Get latest row from csv based on custom condition

Question

I have a table material

+--------+-----+-------------------+----------------+-----------+          
| ID     | REV | name              | Description    | curr      |
+--------+-----+-------------------+----------------+-----------+
| 211-32 | 001 | Screw 1.0         | Used in MAT 1  | READY     |
| 211-32 | 002 | Screw 2 plus      | can be Used-32 | WITHDRAWN |
| 212-41 | 001 | Bolt H1           | Light solid    | READY     |
| 212-41 | 002 | BOLT H2+Form      | Heavy solid    | READY     |
| 101-24 | 001 | HexHead 1-A       | NOR-1          | READY     |
| 101-24 | 002 | HexHead Spl       | NOR-22         | READY     |
| 423-98 | 001 | Nut Repair spare  | NORM1          | READY     |
| 423-98 | 002 | Nut Repair Part-C | NORM2          | WITHDRAWN |
| 423-98 | 003 | Nut SP-C          | NORM2+NORM1    | NULL      |
| 654-01 | 001 | Bar               | Specific only  | WITHDRAWN |
| 654-01 | 002 | Bar rod-S         | Designed+Spe   | WITHDRAWN |
| 654-01 | 003 | Bar OPG           | Hard spec      | NULL      |
+--------+-----+-------------------+----------------+-----------+

Here each ID can have multiple revisions. I want to take latest revisions (i.e highest of 001,002,003 etc.,). But If the latest revision has curr as either NULL(string) or WITHDRAWN then I have take the previous revision and its corresponding value. If even that's curr is NULL or WITHDRAWN I have to again go to previous revision. If all the revision has the same issue then we can ignore it. so the expected output is

+--------+-----+------------------+---------------+-------+
| ID     | REV | name             | Description   | curr  |
+--------+-----+------------------+---------------+-------+
| 211-32 | 001 | Screw 1.0        | Used in MAT 1 | READY |
| 212-41 | 002 | BOLT H2+Form     | Heavy solid   | READY |
| 101-24 | 002 | HexHead Spl      | NOR-22        | READY |
| 423-98 | 001 | Nut Repair spare | NORM1         | READY |
+--------+-----+------------------+---------------+-------+

I'm very new to Python. I have tried below code, but i'm not working. Any suggestion are highly appreciated.

import pandas as pd
import numpy as np

mydata = pd.read_csv('C:/Myfolder/Python/myfile.csv')

mydata.sort_values(['ID','REV'], ascending=[True, False]).drop_duplicates('',keep=last)

score 2 · Answer 1 · answered Jul 30 '20 at 14:21

you can select the rows that don't have NULL or WITHDRAW in it with isin, then do the sort_values and drop_duplicates:

mydata = mydata[~mydata['curr'].isin(['NULL','WITHDRAW'])]
mydata = mydata.sort_values(['ID','REV']).drop_duplicates('ID',keep='last')

score 2 · Accepted Answer · answered Jul 30 '20 at 14:24

2

We can create a psuedo column to get the max and return its index.

first step is to filter out the values we want to ignore.

df1 = df.loc[
    df[~df["curr"].isin(["WITHDRAWN", "NULL"])]
    .assign(key=df["REV"].astype(int))
    .groupby("ID")["key"]
    .idxmax()
]


         ID  REV                 name       Description   curr
6   101-24   002   HexHead Spl          NOR-22           READY
1   211-32   001   Screw 1.0            Used in MAT 1    READY
4   212-41   002   BOLT H2+Form         Heavy solid      READY
7   423-98   001   Nut Repair spare     NORM1            READY

answered Jul 30 '20 at 14:24

Umar.H

22,559
7
39
74

Thanks a lot. I see you are from SQL background. :) I think you are best person to answer my question as i'm a SQL developer and new to python. 1) is `.loc` similar to `where` condition? 2) `.astype` is equivalent to `cast`? 3) Where can i learn replicate SQL functions in pandas. – Avinash Jul 30 '20 at 15:01
`CAST` is indeed similar to astype but pandas has other functions such as `pd.to_numeric` which is like `COALESCE` i would say `.loc` is pandas version of filtering by index and columns, I think its equivilent would be WHERE – Umar.H Jul 30 '20 at 15:05
Thanks for the tips. One last question, what is `groupby("ID")["key"]` ? can i write this as `groupby("ID","key")`? Bit confused here – Avinash Jul 30 '20 at 15:27
no, this is the aggeragte operation think `SELECT NAME,MAX(COL) GROUP BY NAME` @Avinash – Umar.H Jul 30 '20 at 15:47
1

Ah. if my understanding is correct then `["key"] .idxmax()` = `MAX("Key")` ? – Avinash Jul 30 '20 at 15:51
@Avinash exactly, re run the query in SQL to see the results – Umar.H Jul 30 '20 at 15:54
1

Great. Thank you so much once again! – Avinash Jul 30 '20 at 16:05
Can you please help me with this [question](https://stackoverflow.com/questions/63379510/how-sql-recursion-can-be-done-in-pandas) if possible ? – Avinash Aug 13 '20 at 05:13

Ambarish Singh · Answer 3 · 2020-07-30T14:26:02.017

1

I think first you should do is remove the NULL or WITHDRAW from the table.

mydata[mydata[curr] == 'Ready']       # this should do I think...

then you can try your sort and take the max rev value.

mydata = mydata.sort_values(['ID','REV']).drop_duplicates('ID',keep='last')

edited Jul 30 '20 at 14:26

answered Jul 30 '20 at 14:20

Ambarish Singh

123
5

Get latest row from csv based on custom condition

3 Answers3