1

I'm reading data from 2 csv using pandas read_csv.

Details.csv

ID,VALID
P1,Yes
P2,No
P3,Yes
P4,No

Relations.csv

ParentID,ChildID
P1,C1
P1,C2
C1,C1A
C2,C2A
C1A,C1AA
P2,D1
P2,D2
D2,D2A
D2A,D2AA
P3,C4
P4,C5

Now i stored both in separate dataframes. I have to check the ID's from Details in Relationship and for each ID find all level of its children(until no further child). If the ID has Yes for VALID column, then all its child should have "VALID" if not then those are "NOT VALID".

Expected output

P1,VALID
C1,VALID
C2,VALID
C1A,VALID
C2A,VALID
C1AA,VALID
P2,NOT VALID
D1,NOT VALID
D2,NOT VALID
D2A,NOT VALID
D2AA,NOT VALID
P3,VALID
C4,VALID
P4,NOT VALID
C5,NOT VALID

Currently I'm doing this in SQL. I have no idea how to replicate this in python. Are there any functions available in pandas or I have to do with for loop. Any suggestion would be appreciated.

From this question, i have tried something like below but it is not working.

import pandas as pd

details = pd.read_csv('C:/Myfolder/Python/Details.csv')
relationship = pd.read_csv('C:/Myfolder/Python/Relationship.csv')

def nlevel(details.id, parent_dict=relationship.ParentID, _cache={0:0}):
if details.id in _cache:
    return _cache[details.id]

return 1+nlevel(parent_dict[details.id],parent_dict)
Avinash
  • 533
  • 5
  • 15
  • Is the relation.csv ordered, by this I mean that all the parent-child relation for the same parent in one block: for example, could you have P1,C3 after the row P3,C4, hence P1,C3 is not in the same block than the other P1-child relation? – Ben.T Aug 12 '20 at 15:26
  • Hi @Ben.T. Thanks for you time. No it is not ordered. It can be any order. – Avinash Aug 12 '20 at 15:32
  • Maybe have a look at this [post](https://stackoverflow.com/questions/57228605/creating-dictionary-of-parent-child-pairs-in-pandas-dataframe) in the meantime someone answers, that addresses your problem while not a full dup – Ben.T Aug 12 '20 at 15:47

1 Answers1

1

You can do this recursively using a stack. First, put the elements into an adjacency list to simplify traversal, then empty the stack, adding children for each node and storing validity strings in a separate structure. Preserving order, build a result dataframe from the ordered node-validity pairs and dump to CSV.

import pandas as pd
from collections import defaultdict

details_df = pd.read_csv("Details.csv")
relationship_df = pd.read_csv("Relationship.csv")
order = {k: 1 for k in relationship_df.values.flatten()}
graph = defaultdict(list)
validity = {}

for parent, child in relationship_df.values:
    graph[parent].append(child)

for root, valid in details_df.values:
    stack = [root]

    while stack:
        curr = stack.pop()
        validity[curr] = "VALID" if valid == "Yes" else "NOT VALID"
        stack.extend(graph[curr])

result_df = pd.DataFrame([[x, validity[x]] for x in order])
print(result_df.to_csv(index=False, header=False))

Output:

P1,VALID
C1,VALID
C2,VALID
C1A,VALID
C2A,VALID
C1AA,VALID
P2,NOT VALID
D1,NOT VALID
D2,NOT VALID
D2A,NOT VALID
D2AA,NOT VALID
P3,VALID
C4,VALID
P4,NOT VALID
C5,NOT VALID
ggorlen
  • 44,755
  • 7
  • 76
  • 106