I am working on a 600.000 x 128 dataset, arranged in the following form with a hierarchical index (see left):
Var1 Var2 ... Var1_mean Var2_mean ...
ID Date ID Date
1 2017-12 1 0.1 1 2017-12 2 0.3
1 2018-01 2 0.3 1 2018-01 2 0.3
1 2018-02 3 0.5 --> 1 2018-02 2 0.3
2 2018-01 2 0.2 2 2018-01 3 0.15
2 2018-02 4 0.1 2 2018-02 3 0.15
Now for all 55.000 customers (= unique IDs) I want to replace the values per customer in selected variables with the averages per customer, stored in new variables (Variable name + '_mean', see right).
I wrote a function to do that, but it takes 4+ hours, even after trying to parallelize the task. I tried:
identifiers = set(df_raw_sort.index.get_level_values(0)) # unique IDs
def avg_per_customer(column):
df_raw_sort.loc[:, column + '_mean'] = 0 # Create new col
for unique in identifiers:
meanvalue = np.mean(df_raw_sort[column].loc[(unique)])
df_raw_sort.loc[(unique), column + '_mean'] = meanvalue
Parallel(n_jobs=2, verbose=10)(delayed(avg_per_customer)(col) for col in transform)
# transforms selected columns (= transform)
What could I do to speed this thing up?
Thanks a lot in advance.