How to generate a correlation matrix of a dataset with a large file size in Python?

Question

I'm trying to generate a correlation matrix based on gene expression levels. I have a dataset that has Gene name on the columns and individual experiments on the rows with expression levels in the cells. The matrix is 55,000 genes wide and 150,000 experiments tall so I broke computing down into chunks because my computer's memory cannot hold the entire set in memory.

This was my attempt:

import pandas as pd
import numpy as np


file_path = 'data.tsv'
chunksize = 10**6 
corr_matrix = pd.DataFrame()

for chunk in pd.read_csv(file_path, delimiter='\t', chunksize=chunksize):
   chunk_corr = chunk.corr()
   corr_matrix = (corr_matrix + chunk_corr) / 2

print(corr_matrix)

However running this code slowly eats up ram until it crashes my system/jupyter lab once it's eaten all of it

Is there a better way to run this that might use different cores? I'm not familiar with making python work with such large data.

UPDATE:

I discovered Dask which supposedly should handle both the size and multithreading. I rewrote rewrote my code as:

import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

#Read the dataframe with large sample size to overcome a value error
df = dd.read_csv('data.tsv',sep = '\t', sample=1000000000)

#Generate correlation matrix
corr_matrix = np.zeros((54675,54675))
corr_matrix = df.corr(method='pearson')
corr_matrix = corr_matrix.compute()

#Print correlation matrix
print(Cmatrix)

update: This also slowly eats up RAM and crashes once it's capped my ram amount. Back to the drawing board

update update: No one on this website was helpful so I just used a supercomputer with 60gb of ram to generate the correlation matrix with pandas

If your dataset is 150,000 rows, then setting a chunksize of 10**6 does not chunk the data at all. It does it all as one chunk. — Nick ODell, Jan 18 '23 at 18:16
Related: https://stackoverflow.com/questions/52427933/how-to-calculate-a-very-large-correlation-matrix — Niko Föhr, Jan 18 '23 at 18:47
How large are your CSV file on disk? I think Dask would be the godd the solution to stream your data. But when you ask for 1 billion samples in Dask read_csv, you basically try to load all the dataset just to get the correct metadata. So if the dataset don't fit in RAM this won't work. You can also tell dask the dtypes of the column (see the doc: https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html). — Guillaume EB, Jan 28 '23 at 06:44

How to generate a correlation matrix of a dataset with a large file size in Python?

0 Answers0