I have large table saved as a parquet and when I try to load it I get an crazy amount of GC-Time
like 80%. I use spark 2.4.3 The parquet is saved with the following schema:
/parentfolder/part_0001/parquet.file
/parentfolder/part_0002/parquet.file
/parentfolder/part_0003/parquet.file
[...]
2432 in total
The table in total is 2.6 TiB
and looks like this (both fields are 64bit int's):
+-----------+------------+
| a | b |
+-----------+------------+
|85899366440|515396105374|
|85899374731|463856482626|
|85899353599|661424977446|
[...]
I have a total amount of 7.4 TiB
cluster memory, with 480 cores, on 10 workers and I read the parquet like this:
df = spark.read.parquet('/main/parentfolder/*/').cache()
and as I said I get an crazy amount of garbage collection time right now it stands at: Task Time (GC Time) | 116.9 h (104.8 h)
with only 110 GiB
loaded after 22 min of wall time.
I monitor one of the workers
and memory usual hover around 546G/748G
what am I doing wrong here? Do I need a larger cluster? If my Dataset is 2.6 TiB
why isn't 7.4 TiB
of memory enough? But then again why isn't the memory full on my worker?