I am currently running a real time Spark Streaming job on a cluster with 50 nodes on Spark 1.3 and Python 2.7. The Spark streaming context reads from a directory in HDFS with a batch interval of 180 seconds. Below are the configuration for the Spark Job:
spark-submit --master yarn-client --executor-cores 5 --num-executors 10 --driver-memory 10g --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.yarn.driver.memoryOverhead=2048 --conf spark.network.timeout=300 --executor-memory 10g
The job runs fine for the most part. However, it throws a Py4j Exception after around 15 hours citing it cannot obtain a communication channel.
I tried reducing the Batch Interval size but then it creates an issue where the Processing time is greater than the Batch Interval.
Below is the Screenshot of the Error
I did some research and found that it might be an issue with Socket descriptor leakage from here SPARK-12617
However, I am not able to work around the error and resolve it. Is there a way to manually close the open connections which might be preventing to provide ports. Or do I Have to make any specific changes in the code to resolve this.
TIA