Process multiple small files of total size 100GB in HDFS

Question

I have a requirement in my project to process multiple .txt message files using PySpark. The files are moved from local dir to HDFS path (hdfs://messageDir/..) using batches and for every batch, i could see a few thousand .txt files and their total size is around 100GB. Almost all of the files are less than 1 MB.

May i know how HDFS stores these files and perform splits? Because every file is less than 1 MB (less than HDFS block size of 64/128MB), I dont think any split would happen but the files will be replicated and stored in 3 different data nodes.

When i use Spark to read all the files inside the HDFS directory (hdfs://messageDir/..) using wild card matching like *.txt as below:-

rdd = sc.textFile('hdfs://messageDir/*.txt')

How does Spark read the files and perform Partition because HDFS doesn't have any partition for these small files.

What if my file size increases over a period of time and get 1TB volume of small files for every batch? Can someone tell me how this can be handled?

Good news, you're there already! The block size is the minimum file size, so each 1 MB file takes at least 64-128 MB! And then we add the replicas! — Elliott Frisch, Oct 21 '18 at 06:21
Is there a reason you didn't compress the files before uploading to HDFS? — OneCricketeer, Oct 21 '18 at 06:24
@cricket_007, Yes, that is one option but i would like to know how Spark behaves when there is a huge volume of small text files. — AngiSen, Oct 21 '18 at 06:35
It'll obviously be much slower. There needs to be one namenode request for each file individually — OneCricketeer, Oct 21 '18 at 06:39
but i assume all text files will be read as one RDD and then partition happens.. i would like to understand more on how does the RDD of the big chunk of files perform the partition under the hood — AngiSen, Oct 21 '18 at 06:41

Tobi · Accepted Answer · 2018-10-21T10:39:00.557

I think you are mixing things up a little.

You have files sitting in HDFS. Here, Blocksize is the important factor. Depending on your configuration, a block normally has 64MB or 128MB. Thus, each of your 1MB files, take up 64MB in HDFS. This is aweful lot of unused space. Can you concat these TXT-files together? Otherwise you will run out of HDFS blocks, really quick. HDFS is not made to store a large amount of small files.
Spark can read files from HDFS, Local, MySQL. It cannot control the storage principles used there. As Spark uses RDDs, they are partitioned to get part of the data to the workers. The number of partitions can be checked and controlled (using repartition). For HDFS reading, this number is defined by the number of files and blocks.

Here is a nice explanation on how SparkContext.textFile() handles Partitioning and Splits on HDFS: How does Spark partition(ing) work on files in HDFS?

score 1 · Answer 2 · answered Oct 21 '18 at 16:21

You can read from spark even files are small. Problem is HDFS. Usually HDFS block size is really large(64MB, 128MB, or more bigger), so many small files make name node overhead.

If you want to make more bigger file, you need to optimize reducer. Number of write files is determined by how many reducer will write. You can use coalesce or repartition method to control it.

Another way is make one more step that merge files. I wrote spark application code that coalesce. I put target record size of each file, and application get total number of records, then how much number of coalesce can be estimated.

You can use Hive or otherwise.

Process multiple small files of total size 100GB in HDFS

2 Answers2