I use fileStream to read files in the hdfs directory from Spark (streaming context). In case my Spark shut down and starts after some time, I would like to read the new files in the directory. I don't want to read old files in the directory which was already read and processed by Spark. I am trying to avoid duplicates here.
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/home/File")
any code snippets to help?