pyspark log4j dynamic file name from system property

Question

I'm trying to define a system property for my log file, e.g. 'logger.File'='/tmp/some_name.log' in order to dynamically assign different log files to different spark jobs. I've tried to reduced the test case to the bare minimum. It is failing when using the property, all works if I hardcode the value in log4j.properties.

Can someone please enlighten me?..

Thanks!

Python code test_log4j.py:

from pyspark.context import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession

SparkContext.setSystemProperty('logger.File', '/tmp/some_name.log')
spark = SparkSession.builder.config(conf=SparkConf()).getOrCreate()

logger = spark._jvm.org.apache.log4j.LogManager.getLogger('default')
logger.warn('hey')

log4j.properties:

log4j.rootLogger=INFO, spark_jobs
log4j.logger.spark_jobs=INFO

log4j.appender.spark_jobs=org.apache.log4j.RollingFileAppender
log4j.appender.spark_jobs.layout=org.apache.log4j.PatternLayout
log4j.appender.spark_jobs.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.appender.spark_jobs.Threshold=INFO

log4j.appender.spark_jobs.File=${logger.File}

Test:

/usr/hdp/current/spark2-client/bin/spark-submit --master local --conf spark.driver.extraJavaOptions="-Dlog4j.debug -Dlog4j.configuration=file:/tmp/log4j.properties" --conf spark.executor.extraJavaOptions="-Dlog4j.debug -Dlog4j.configuration=file:/tmp/log4j.properties" test_log4j.py

Result:

log4j: Using URL [file:/tmp/log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/tmp/log4j.properties
log4j: Parsing for [root] with value=[INFO, spark_jobs].
log4j: Level token is [INFO].
log4j: Category root set to INFO
log4j: Parsing appender named "spark_jobs".
log4j: Parsing layout options for "spark_jobs".
log4j: Setting property [conversionPattern] to [%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n].
log4j: End of parsing for "spark_jobs".
log4j: Setting property [file] to [].
log4j: Setting property [threshold] to [INFO].
log4j: setFile called: , true
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException:  (No such file or directory)
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
        at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207)
        at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
        at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
        at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
        at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
        at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
        at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
        at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
        at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
        at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
        at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
        at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
        at org.apache.spark.deploy.SparkSubmit$.initializeLogIfNecessary(SparkSubmit.scala:71)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:128)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j: Parsed "spark_jobs" options.
log4j: Parsing for [spark_jobs] with value=[INFO].
log4j: Level token is [INFO].
log4j: Category spark_jobs set to INFO
log4j: Handling log4j.additivity.spark_jobs=[null]
log4j: Finished configuring.

My 2 cents: to start the Spark session, you need a running JVM; and as soon as the JVM starts, Log4J starts automatically. So when you try to configure Log4J in your code, well, it's too late, you already hit the error. — Samson Scharfrichter, Jun 16 '20 at 21:07
Since PySpark starts Py4J under the covers, then Py4J starts a JVM under the covers, you cannot set Java properties on the `java` command line with `-Dprop.erty=value`; and you cannot set the properties in the Py4J init method. You have to use some black magic, such as the env variables described in https://stackoverflow.com/questions/28327620/difference-between-java-options-java-tool-options-and-java-opts -- but again, set the env variable **before** you start the Spark session -- and preferably before you even start Python. — Samson Scharfrichter, Jun 16 '20 at 21:12
Thanks, I knew about pyspark, which is why I used spark-submit, but I didn't realize that spark-submit is actually starting the jvm before spark session is created, or any python code is executed. So in fact the error above is generated by the spark-submit command itself, and not by anything in the python code (like I thought). — Savam, Jun 17 '20 at 05:11
So it looks like the only way to do this is to add -Dlogger.File=/tmp/some_name.log to the extraJavaOptions in the spark-submit command. So indeed, this works: /usr/hdp/current/spark2-client/bin/spark-submit --master local --conf spark.driver.extraJavaOptions="-Dlogger.File=/tmp/some_file.log -Dlog4j.debug -Dlog4j.configuration=file:/tmp/log4j.properties" --conf spark.executor.extraJavaOptions="-Dlogger.File=/tmp/some_file.log -Dlog4j.debug -Dlog4j.configuration=file:/tmp/log4j.properties" test_log4j.py — Savam, Jun 17 '20 at 05:14
This is not really a solution, though. What I want is to make up the log file name and location from within the python code. Hard to believe something so basic is not possible... — Savam, Jun 17 '20 at 05:20
_"something so basic"_ >> interfacing Python and Java/Scala is not basic; Py4J is more of a band-aid solution. You might try something tricky: setup Log4J to use _stderr_ by default, at startup, then use the Py4J gateway to invoke the Log4J API from Python, and set a new "appender" to use whatever file you want. — Samson Scharfrichter, Jun 17 '20 at 08:22
...but that would work only for the driver, not for the executors. Setting hooks to execute arbitrary code at executor startup is experimental & undocumented, has been overhauled in V3, and requires a JAR. — Samson Scharfrichter, Jun 17 '20 at 08:24
ERRATUM - since you use `spark-submit` and start the JVM (driver-side then executor-side) explicitly from the Python code, you can set an env variable with the Java system props to use at startup _before_ importing the PySpark module. But again, if the executors run on a remote cluster, they require a static Log4J conf. — Samson Scharfrichter, Jun 17 '20 at 08:45

pyspark log4j dynamic file name from system property

0 Answers0