4

I'm trying to fill out missing values in a pandas dataframe by interpolating or copying the last-known value within a group (identified by trip). My data looks like this:

    brake   speed   trip
0   0.0     NaN     1
1   1.0     NaN     1
2   NaN     1.264   1
3   NaN     0.000   1
4   0.0     NaN     1
5   NaN     1.264   1
6   NaN     6.704   1
7   1.0     NaN     1
8   0.0     NaN     1
9   NaN     11.746  2
10  1.0     NaN     2
11  0.0     NaN     2
12  NaN     16.961  3
13  1.0     NaN     3
14  NaN     11.832  3
15  0.0     NaN     3
16  NaN     17.082  3
17  NaN     22.435  3
18  NaN     28.707  3
19  NaN     34.216  3

I have found Pandas interpolate within a groupby but I need brake to simply be copied from the last-known, yet speed to be interpolated (my actual dataset has 12 columns that each need such treatment)

kevins_1
  • 1,268
  • 2
  • 9
  • 27
retorquere
  • 1,496
  • 1
  • 14
  • 27

1 Answers1

4

You can apply separate methods to each column. For example:

# interpolate speed
df['speed'] = df.groupby('trip').speed.transform(lambda x: x.interpolate())
# fill brake with last known value 
df['brake'] = df.groupby('trip').brake.transform(lambda x: x.fillna(method='ffill'))

>>> df
    brake    speed  trip
0     0.0      NaN     1
1     1.0      NaN     1
2     1.0   1.2640     1
3     1.0   0.0000     1
4     0.0   0.6320     1
5     0.0   1.2640     1
6     0.0   6.7040     1
7     1.0   6.7040     1
8     0.0   6.7040     1
9     NaN  11.7460     2
10    1.0  11.7460     2
11    0.0  11.7460     2
12    NaN  16.9610     3
13    1.0  14.3965     3
14    1.0  11.8320     3
15    0.0  14.4570     3
16    0.0  17.0820     3
17    0.0  22.4350     3
18    0.0  28.7070     3
19    0.0  34.2160     3

Note that this means you remain with some NaN in brake, because there was no "last known value" for the first row of a trip, and some NaNs in speed when the first few rows were NaN. You can replace these as you see fit with fillna()

sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    That's actually fine. I keep forgetting that I can do stuff with individual columns. Is this the fastest way to go about this? I have about 6 million rows; this solution is doable for me at a minute and a half, but if there's a faster way I'm very much interested. – retorquere May 13 '18 at 22:00