1

I have a DataFrame that looks as follows:


df = pd.DataFrame({
    "ID": [1, 1, 1, 1, 2, 2, 2, 2],
    "Date": [19990101, 19990101, 20000101, 20000101, 19990101, 19990101, 20000101, 20000101],
    "Name": ["A", "B", "A", "C", "A", "B", "A", "D"]
})

print(df)

   ID      Date Name
0   1  19990101    A
1   1  19990101    B
2   1  20000101    A
3   1  20000101    C
4   2  19990101    A
5   2  19990101    B
6   2  20000101    A
7   2  20000101    D

What I need to do is, for each ID, select only all observations which have the highest date. Currently I have solved the problem with the following code:

fin_df = pd.DataFrame()

for i, d in df.groupby("ID")["Date"].max().items():
    fin_df = fin_df.append(df.loc[(df.ID == i) & (df.Date == d)])

print(fin_df)

   ID      Date Name
2   1  20000101    A
3   1  20000101    C
6   2  20000101    A
7   2  20000101    D

Which works, but is extremely slow when I have to iterate over all dates and ID's in my original DataFrame. I'm sure there must be a better way to do this but I can't figure out how. Hopefully someone can point me in the right direction.

Menno Van Dijk
  • 863
  • 6
  • 24

0 Answers0