Pandas Select Multiple Indices Belonging to Final Date

Question

I have a DataFrame that looks as follows:


df = pd.DataFrame({
    "ID": [1, 1, 1, 1, 2, 2, 2, 2],
    "Date": [19990101, 19990101, 20000101, 20000101, 19990101, 19990101, 20000101, 20000101],
    "Name": ["A", "B", "A", "C", "A", "B", "A", "D"]
})

print(df)

   ID      Date Name
0   1  19990101    A
1   1  19990101    B
2   1  20000101    A
3   1  20000101    C
4   2  19990101    A
5   2  19990101    B
6   2  20000101    A
7   2  20000101    D

What I need to do is, for each ID, select only all observations which have the highest date. Currently I have solved the problem with the following code:

fin_df = pd.DataFrame()

for i, d in df.groupby("ID")["Date"].max().items():
    fin_df = fin_df.append(df.loc[(df.ID == i) & (df.Date == d)])

print(fin_df)

   ID      Date Name
2   1  20000101    A
3   1  20000101    C
6   2  20000101    A
7   2  20000101    D

Which works, but is extremely slow when I have to iterate over all dates and ID's in my original DataFrame. I'm sure there must be a better way to do this but I can't figure out how. Hopefully someone can point me in the right direction.

Use `df = df[df.groupby("ID")["Date"].transform('max') == df['Date']]` — jezrael, May 10 '19 at 09:31
This should work: `temp = df.groupby(['ID'])` `res = temp.apply(lambda x: x.sort_values(by=['Date']).head(1))` — K.Maj, May 10 '19 at 09:40
@jezrael Amazing thanks. Guess I didn't know how to properly phrase my question. — Menno Van Dijk, May 10 '19 at 09:55

Pandas Select Multiple Indices Belonging to Final Date

0 Answers0