1

I am curious to understand the behavior of spark framework when the file size is greater than the cluster memory size. Let us hypothetically assume that we have 2 Nodes in cluster with 64 GB Memory(32 GB + 32 GB) and the file to process is 100 GB. I have read that 50% of the Memory in the Node is allocated as Storage Memory for RDD persistence and remaining 50% of Memory in the Node is allocated as Working Memory. Working Memory can go up and down and may occupy Storage Memory if it is available for use.

So let us suppose that there are no persisted blocks and whole 64 GB Memory is available for use. In such case, will the spark process the 100 GB file or it will fail since it cannot fit 100 GB file into Memory?

shankar
  • 196
  • 14

1 Answers1

0

Unless you explicitly persist the data, Spark will read the data using appropriate partitions given the memory size and cluster count. As long as there is sufficient disk space, Spark will handle files larger than the available memory.

If you do persist, you will get an Out of Memory error, however.

See the official Spark FAQ for further detail.

For most of my work, I leave Spark to decide its memory management and rarely call .cache().

Lars Skaug
  • 1,376
  • 1
  • 7
  • 13
  • Lars, thanks for the response. If I use persist with MEMORY_AND_DISK then it will keep some amount of data in memory and remaining data would spill onto disk. But I am not using cache or persist here. I want to just read the 100GB file. Let us suppose the 100 GB file is divided into 4 partitions. 25 GB each. But the memory on each node is 32GB. Will the read happen successfully or Does the spark still use persist logic even if I don't use it manually? – shankar Jul 24 '20 at 03:47
  • Ofcourse I will perform some action on the file that is read. But first of all the data has to be read into the memory right? So this 100GB data cannot fit into 64 GB memory. So will it fail with out of memory error or will it use persist logic without my knowledge and load first 2 partitions in memory and spill 2 partitions on disk? Just curious to know what's happening in the background. – shankar Jul 24 '20 at 03:58
  • I think it's worth reading the link Samson Scharfrichter posted. https://stackoverflow.com/questions/30520428/what-is-the-difference-between-memory-only-and-memory-and-disk-caching-level-in – Lars Skaug Jul 24 '20 at 16:19
  • Samson's post mostly focuses on the persist storage levels. I am not persisting the data. I am reading a 100GB file on a 64GB Memory cluster. I wanted to know if the read will be successful or a failure? If 100GB file is divided into 4 partitions of 25GB each then how can they be fit into 64GB memory? Is spark loading first 2 partitions in the memory first and keeping the rest 2 partition on disk? want to understand the behavior. – shankar Jul 25 '20 at 06:18
  • 1
    Spark will run on any file size, as long as you have sufficient disk space for data to spill. https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main#:~:text=Spark's%20operators%20spill%20data%20to,well%20on%20any%20sized%20data.&text=In%20Apache%20Spark%20if%20the,level%20to%20persist%20the%20data. – Lars Skaug Jul 25 '20 at 18:02
  • Thanks a lot Lars.That answers my question. – shankar Jul 25 '20 at 18:20
  • Glad to be of help. Speaking of which, if you found any of the details I provided helpful, please upvote. If you accept my answer, I would also be grateful. I am, as you, fairly new to StackOverflow and every point counts! – Lars Skaug Jul 25 '20 at 18:24