1

Background Information

This question is closely related to my previous question. Unfortunately while making up an general example it was not specific enough to be applied to my personal problem. That is why this question is more specific.

Example - Code Snippet

import pandas as pd
import numpy as np

inp = [{'ID_Code':1,'information 1':[10,22,44],'information 2':[1,0,1]},
       {'ID_Code':2,'information 1':[400,323],'information 2':[1,1]},
       {'ID_Code':2,'information 1':[243],'information 2':[0]},
       {'ID_Code':2,'information 1':[333,555],'information 2':[0]},
       {'ID_Code':3,'information 1':[12,27,43,54],'information 2':[1,0,1,1]},
       {'ID_Code':3,'information 1':[31,42,13,14],'information 2':[1,0,0,0]},
       {'ID_Code':3,'information 1':[14,24,34,14],'information 2':[1,0,1,1]},
       {'ID_Code':4,'information 1':[15,25,33,44],'information 2':[0,0,0,1]},
       {'ID_Code':5,'information 1':[12,12,13,14],'information 2':[1,1,1,0]},
       {'ID_Code':5,'information 1':[12,12,13,24],'information 2':[1,0,1,1]},
       {'ID_Code':5,'information 1':[21,22,23,14],'information 2':[1,1,1,1]},
       {'ID_Code':6,'information 1':[10,12,23,4],'information 2':[1,0,1,0]},
       {'ID_Code':7,'information 1':[112,212,143,124],'information 2':[0,0,0,0]},
       {'ID_Code':7,'information 1':[211,321],'information 2':[1]},
       {'ID_Code':7,'information 1':[431],'information 2':[1,0]},
       {'ID_Code':8,'information 1':[1,2,3,4],'information 2':[1,0,0,1]}]


df = pd.DataFrame(inp)

df1=df.groupby("ID_Code")["information 1"].apply(list).to_frame()
df2=df.groupby("ID_Code")["information 2"].apply(list).to_frame()
df3=pd.concat([df1, df2],axis=1, sort=False)

The Output

ID_Code    information 1                                            information 2
1          [[10, 22, 44]]                                           [[1, 0, 1]]
2          [[400, 323], [243], [333, 555]]                          [[1, 1], [0], [0]]
3          [[12, 27, 43, 54], [31, 42, 13, 14], [14, 24, 34, 14]]   [[1, 0, 1, 1], [1, 0, 0, 0], [1, 0, 1, 1]]
4          [[15, 25, 33, 44]]                                       [[0, 0, 0, 1]]
5          [[12, 12, 13, 14], [12, 12, 13, 24], [21, 22, 23, 14]]   [[1, 1, 1, 0], [1, 0, 1, 1], [1, 1, 1, 1]]
6          [[10, 12, 23, 4]]                                        [[1, 0, 1, 0]]
7          [[112, 212, 143, 124], [211, 321], [431]]                [[0, 0, 0, 0], [1], [1, 0]]
8          [[1, 2, 3, 4]]                                           [[1, 0, 0, 1]]

Where ID_Code is no longer a column but the index. Which is the problem that I hadn't specified in my previous question.

The Task

With the given Dataframe "df3", to get rid of ID_Code = 1 and store its information in ID_Code = 3, and get rid of ID_Code = 5 and ID_Code = 7 and store that information in ID_Code = 2, so that the DataFrame looks like this:

ID_Code    information 1                                                                                                                       information 2
2          [[400, 323], [243], [333, 555], [12, 12, 13, 14], [12, 12, 13, 24], [21, 22, 23, 14], [112, 212, 143, 124], [211, 321], [431]]     [[1, 1], [0], [0], [1, 1, 1, 0], [1, 0, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0], [1], [1, 0]]
3          [[12, 27, 43, 54], [31, 42, 13, 14], [14, 24, 34, 14], [10, 22, 44]]                                                               [[1, 0, 1, 1], [1, 0, 0, 0], [1, 0, 1, 1], [1, 0, 1]]
4          [[15, 25, 33, 44]]                                                                                                                 [[0, 0, 0, 1]]
6          [[10, 12, 23, 4]]                                                                                                                  [[1, 0, 1, 0]]
8          [[1, 2, 3, 4]]                                                                                                                     [[1, 0, 0, 1]]

It would be a huge help, if someone could help me solve this.

Robert Redisch
  • 161
  • 1
  • 10
  • What's the pattern in selecting the ID and storing its information in another ID? – YOLO Jun 28 '20 at 10:00
  • Unfortunately there is no pattern, in my original Dataset I have ~15 "ID's" that need to be deleted and the information restored in ID's that are somewhat "related" to the ones deleted. I had to manually choose the most related ID's as well. That is why it needs to be a code, that can specifically choose a row to extract the information and delete and a row where the information has to be added into. – Robert Redisch Jun 28 '20 at 10:35
  • @RobertRedisch How do you select the row with ID_Code=1 to be added to the row with ID_Code=3? What is the logic behind it? – davidbilla Jun 28 '20 at 17:17
  • As I've written before, there is no rule which decides which ones have to be merged, I have to manually select it in my original dataset. The column, that I've called ID_Code here, has Codes that are related to each other. An example would be postcodes. You could imagine that I need to delete some postcodes for some reason, and I have to add the information to a postcode which is geographically close by, and therefore has to be selected by hand. In short: I need to be able to specifically choose a row to extract and delete the information, and a row where the information has to be added into. – Robert Redisch Jun 28 '20 at 18:11

1 Answers1

0

It worked for me with the answer given by Datanovice from the previous question with some changes to the indexing.

As the Question states, the problem lies in the 'ID_Code' being an Index rather than a column. So my solution involves adding a column with the unique ID_Code's. For that I found two possible approaches.

Solution 1

Using .unique() in combination with pd.Dataframe() as .unique() returns a numpy.ndarray which has to be converted to a Dataframe again.

df4 = pd.DataFrame(df['ID_Code'].unique(),columns=['ID_Code'],index=df['ID_Code'].unique())
df5 = pd.concat([df4,df3],axis=1)
col = 'ID_Code'
cond = [df5[col].eq(1),
       df5[col].isin([5,7])]

outputs = [3,2]

df5[col] = np.select(cond,outputs,default=df5[col])

df6 = df5.groupby(col).sum()

Solution 2

Using .reset_index() to move the ID_Code out of the index into a seperate column.

df3 = df3.reset_index()
col = 'ID_Code'
cond = [df3[col].eq(1),
       df3[col].isin([5,7])]

outputs = [3,2]

df3[col] = np.select(cond,outputs,default=df3[col])

df4 = df3.groupby(col).sum()
Robert Redisch
  • 161
  • 1
  • 10