1

Please, I need your help, I'm trying to submit an external configuration file for my spark application using typesafe config.

I'm loading the application.conf file in my application code like this:

  lazy val conf = ConfigFactory.load()

File content

  ingestion{
  process {
    value = "sas"
  }
  sas {
    origin{
      value = "/route"
    }
    destination{
      value = "/route"
    }
    extension{
      value = ".sas7bdat"
    }
    file{
      value = "mytable"
    }
    month{
      value = "201010,201011"
    }
    table{
      value = "tbl"
    }
  }
}

My spark submit is

spark2-submit --class com.antonio.Main --master yarn --deploy-mode client --driver-memory 10G --driver-cores 8 --executor-memory 13G --executor-cores 4 --num-executors 10 --verbose  --files properties.conf /home/user/ingestion-1.0-SNAPSHOT-jar-with-dependencies.jar --files application.conf

But for some reason, I'm receiving

com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ingestion'

Everything looks configured correctly ?? Have I missed something.

thanks,

Antonio

Antonio Cachuan
  • 475
  • 1
  • 9
  • 22

1 Answers1

6

Your application.conf by default must be present at the root of classpath for ConfigFactory.load() to find it. Alternatively, you can modify where to find the application.conf file through system properties. Therefore, your options are as follows.

First alternative is, add the root directory of the job to classpath:

spark2-submit ... \
    --conf spark.driver.extraClassPath=./  \
    --conf spark.executor.extraClassPath=./  \    // if you need to load config at executors
    ...

Keep the --files option as is. Note that if you run your job in the client mode, you must pass the proper path to where application.conf is located on the driver machine to the spark.driver.extraClassPath option.

Second alternative is (and I think this one is superior), you can use the config.file system property to affect where ConfigFactory.load() looks for the config file:

spark2-submit ... \
    --conf spark.driver.extraJavaOptions=-Dconfig.file=./application.conf \
    --conf spark.executor.extraJavaOptions=-Dconfig.file=./application.conf \
    ...

Remarks about loading config on executors and keeping the --files option also apply here.

Vladimir Matveev
  • 120,085
  • 34
  • 287
  • 296
  • 2
    Thanks for your fast answer @vladimir-matveev I just got it work with this code `spark2-submit --class com.demo.Main --master yarn --deploy-mode client --driver-memory 10G --driver-cores 8 --executor-memory 13G --executor-cores 4 --num-executors 10 --verbose --conf "spark.driver.extraJavaOptions=-Dconfig.file=/home/user/application.conf" --conf "spark.executor.extraJavaOptions=-Dconfig.file=/home/user/application.conf" --files "application.conf" /home/user/ingestion-1.0-SNAPSHOT-jar-with-dependencies.jar` – Antonio Cachuan Nov 14 '18 at 21:31
  • @AntonioCachuan note that specifying an absolute path for config file on executors is not a good idea on most environments, because usually these files are sent to executors via the YARN distributed cache, which can copy that file virtually anywhere, but usually it should be the "working directory" of the executor JVM, and therefore accessible through relative paths. So consider using relative paths instead, because it would make your job more resilient and able to be transferred between environments correctly. – Vladimir Matveev Nov 14 '18 at 21:43
  • That being said, in many cases executors don't load configuration explicitly at all (they usually capture it through closures, which are serialized and sent to executors together with config objects), so passing `-Dconfig.file` option on executors might be unnecessary. If you would ever run your job in the cluster deployment mode, however, you won't be able to use absolute paths too even on the driver, because in the cluster mode the driver is also started somewhere on the cluster and should use the distributed version of the configuration file. – Vladimir Matveev Nov 14 '18 at 21:46
  • @VladimirMatveev thank you so much , can i pass more than one file like --conf spark.driver.extraJavaOptions=-Dconfig.file=./application.conf \ --conf spark.driver.extraJavaOptions=-Dconfig.file=./application2.conf \ ?? i.e. application.conf & application2.conf – BdEngineer Aug 02 '19 at 13:23
  • @VladimirMatveev when I kept like this --files /local/apps/log4j.properties, /local/apps/applicationNew.properties \ --conf spark.driver.extraJavaOptions=-Dconfig.file=applicationNew.properties \ --conf spark.executor.extraJavaOptions=-Dconfig.file=applicationNew.properties \ Its throwing error Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps/applicationNew.properties at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657) – BdEngineer Aug 02 '19 at 14:00
  • @VladimirMatveev when i print config.file it shows ... 19/08/02 14:19:09 ERROR ExtractionDriver: config.file : null ... what wrong am I doing here ?? – BdEngineer Aug 02 '19 at 14:21
  • @Shyam comments are not a good place for a discussion. Consider asking a separate question with details specific for your use case. – Vladimir Matveev Aug 02 '19 at 22:39
  • @VladimirMatveev got you sure ... https://stackoverflow.com/questions/57330285/how-to-access-external-property-file-in-spark-submit-job thank you – BdEngineer Aug 05 '19 at 07:31
  • @VladimirMatveev - I used the second set of options you mentioned. It worked fine. Now, i edited my config file. And re-ran the spark-submit command. Everything was done according to the previous config file. Is there a way to flush out the content? So that the new version can be read? – Debapratim Chakraborty Mar 25 '21 at 11:34
  • @DebapratimChakraborty I don't think that's how Spark operates - I believe it does not cache anything, at least by default. It might be some issue with your setup. – Vladimir Matveev Mar 25 '21 at 19:40
  • @VladimirMatveev I am using `--files ` and `--conf spark.driver.extraJavaOptions=-Dconfig.file=./application.properties`. Is it possible that because of the --conf property, it is picking up the file I stored in that directory earlier, and not the S3 file that I intend to? If so, what would be a workaround? – Debapratim Chakraborty Mar 26 '21 at 08:08
  • Sorry, I haven't worked with S3 integration of Spark, and with Spark Standalone (I assume you use it?) so I don't know how Spark behaves in this case. `-Dconfig.file=./application.properties` should load file relatively to the Spark driver JVM's working directory in whatever environment it is started; how this environment is populated most likely depends on your cluster technology. – Vladimir Matveev Mar 27 '21 at 03:02