How to speed up two nested for loops in big dataset

Question

I am working on a 600.000 x 128 dataset, arranged in the following form with a hierarchical index (see left):

                    Var1     Var2     ...                                     Var1_mean  Var2_mean     ...          
    ID     Date                                               ID     Date          
    1      2017-12  1        0.1                              1      2017-12  2          0.3
    1      2018-01  2        0.3                              1      2018-01  2          0.3
    1      2018-02  3        0.5            -->               1      2018-02  2          0.3
    2      2018-01  2        0.2                              2      2018-01  3          0.15
    2      2018-02  4        0.1                              2      2018-02  3          0.15

Now for all 55.000 customers (= unique IDs) I want to replace the values per customer in selected variables with the averages per customer, stored in new variables (Variable name + '_mean', see right).

I wrote a function to do that, but it takes 4+ hours, even after trying to parallelize the task. I tried:

    identifiers = set(df_raw_sort.index.get_level_values(0)) # unique IDs

    def avg_per_customer(column): 
       df_raw_sort.loc[:, column + '_mean'] = 0 # Create new col
       for unique in identifiers: 
          meanvalue = np.mean(df_raw_sort[column].loc[(unique)])
          df_raw_sort.loc[(unique), column + '_mean'] = meanvalue

    Parallel(n_jobs=2, verbose=10)(delayed(avg_per_customer)(col) for col in transform)
    # transforms selected columns (= transform)

What could I do to speed this thing up?

Thanks a lot in advance.

score 0 · Accepted Answer · answered Sep 17 '18 at 09:57

0

Can you try this

df_raw_sort[variable_name_mean] = df_raw_sort[variable_name].groupby(df_raw_sort[‘ID’]).transform(‘mean’)

Referred answer from here

answered Sep 17 '18 at 09:57

Avinash Rao

59
4

Thanks a lot! After getting rid of the hierarchical index, it worked in some seconds! Would there be any equally easy way to calculate the slope of a linear regression (like scipy.stats.linregress) instead of the average? So in the case of above: ID1: 1 ID2: -1 Thank you, Avinash – S. W. Sep 17 '18 at 12:03
You are welcome. I have never calculated linear regression but have a look at this: https://stackoverflow.com/a/46754476/2196290 – Avinash Rao Sep 17 '18 at 14:21

How to speed up two nested for loops in big dataset

1 Answers1