I have a DataFrame that looks as follows:
df = pd.DataFrame({
"ID": [1, 1, 1, 1, 2, 2, 2, 2],
"Date": [19990101, 19990101, 20000101, 20000101, 19990101, 19990101, 20000101, 20000101],
"Name": ["A", "B", "A", "C", "A", "B", "A", "D"]
})
print(df)
ID Date Name
0 1 19990101 A
1 1 19990101 B
2 1 20000101 A
3 1 20000101 C
4 2 19990101 A
5 2 19990101 B
6 2 20000101 A
7 2 20000101 D
What I need to do is, for each ID, select only all observations which have the highest date. Currently I have solved the problem with the following code:
fin_df = pd.DataFrame()
for i, d in df.groupby("ID")["Date"].max().items():
fin_df = fin_df.append(df.loc[(df.ID == i) & (df.Date == d)])
print(fin_df)
ID Date Name
2 1 20000101 A
3 1 20000101 C
6 2 20000101 A
7 2 20000101 D
Which works, but is extremely slow when I have to iterate over all dates and ID's in my original DataFrame. I'm sure there must be a better way to do this but I can't figure out how. Hopefully someone can point me in the right direction.