I'm trying to write the result of multiple operations into an AWS Aurora PostgreSQL cluster. All the calculations performs right but, when I try to write the result into the database I get the next error:
py4j.protocol.Py4JJavaError: An error occurred while calling o12179.jdbc.
: java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
I already tried to increase cluster size (15 r4.2xlarge machines), change number of partitions for the data to 120 partitions, change executor and driver memory to 4Gb each and I'm facing the same results.
The current SparkSession configuration is the next:
spark = pyspark.sql.SparkSession\
.builder\
.appName("profile")\
.config("spark.sql.shuffle.partitions", 120)\
.config("spark.executor.memory", "4g").config("spark.driver.memory", "4g")\
.getOrCreate()
I don't know if is a Spark configuration problem or if it's a programming problem.