I'm trying to generate a correlation matrix based on gene expression levels. I have a dataset that has Gene name on the columns and individual experiments on the rows with expression levels in the cells. The matrix is 55,000 genes wide and 150,000 experiments tall so I broke computing down into chunks because my computer's memory cannot hold the entire set in memory.
This was my attempt:
import pandas as pd
import numpy as np
file_path = 'data.tsv'
chunksize = 10**6
corr_matrix = pd.DataFrame()
for chunk in pd.read_csv(file_path, delimiter='\t', chunksize=chunksize):
chunk_corr = chunk.corr()
corr_matrix = (corr_matrix + chunk_corr) / 2
print(corr_matrix)
However running this code slowly eats up ram until it crashes my system/jupyter lab once it's eaten all of it
Is there a better way to run this that might use different cores? I'm not familiar with making python work with such large data.
UPDATE:
I discovered Dask which supposedly should handle both the size and multithreading. I rewrote rewrote my code as:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
import dask.bag as db
#Read the dataframe with large sample size to overcome a value error
df = dd.read_csv('data.tsv',sep = '\t', sample=1000000000)
#Generate correlation matrix
corr_matrix = np.zeros((54675,54675))
corr_matrix = df.corr(method='pearson')
corr_matrix = corr_matrix.compute()
#Print correlation matrix
print(Cmatrix)
update: This also slowly eats up RAM and crashes once it's capped my ram amount. Back to the drawing board
update update: No one on this website was helpful so I just used a supercomputer with 60gb of ram to generate the correlation matrix with pandas