0

I am using Spark running in the cloud. My storage is not traditional HDFS, but I connect to my files via a URL. So for example I can do the following and Spark will receive all the files in the directory. I can also use any other HDFS function calls.

sc.TextFile("hdfs://mystorage001/folder/month1/*")

My data is spread between 5 different drives, and I would like spark to round robin between reading from each drive, so I can read from all 5 in parallel. I can do the following currently and process all my data, but it does not read the drives in parallel. Instead spark reads all the files in one drive, then moves to the next.

sc.TextFile("hdfs://mystorage001/folder/month1/*, hdfs://mystorage002/folder/month2/*, hdfs://mystorage003/folder/month3/*, hdfs://mystorage004/folder/month4/*,hdfs://mystorage005/folder/month5/*")

I have 100 executors. So I have also tried this, but this gives me even worst performance.

sc.TextFile("hdfs://mystorage001/folder/month1/*, 20) union sc.TextFile("hdfs://mystorage002/folder/month2/*, 20) union sc.TextFile("hdfs://mystorage003/folder/month3/*, 20) union sc.TextFile("hdfs://mystorage004/folder/month4/*, 20) union sc.TextFile("hdfs://mystorage005/folder/month5/*")

I know that in each directory my has files named 000000_0, 000001_0, 000002_0, etc... So if I could order the list of files spark reads by that part of the name, I think this will accomplish what I want, but the only way I have figured out how to return the list is via wholeTextFile() which needs to load all the data first anyway.

Dan Ciborowski - MSFT
  • 6,807
  • 10
  • 53
  • 88

1 Answers1

0
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val path = Array(new Path("hdfs://mystorage001/folder/month1/"), 
new Path("hdfs://mystorage001/folder/month2/"), 
new Path("hdfs://mystorage001/folder/month3/"), 
new Path("hdfs://mystorage001/folder/month4/"), 
new Path("hdfs://mystorage001/folder/month5/"))
path.map(fs.listStatus(_)).flatMap(_.zipWithIndex).sortBy(_._2).map(_._1.getPath).mkString(",")
Pankaj Arora
  • 544
  • 2
  • 6