3 data frames and 3 rules in operation to insert data into another dataframe - No common columns - Big data

Question

I have 3 different data-frames which can be generated using the code given below

data_file= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European'],'Marital_status': ['Single','Married','Widowed'],'Smoke_status':['Yes','No','No']})
map_file= pd.DataFrame({'gender': ['1.Male','2. Female','3. Not disclosed'],'ethnicity': ['1.Chinese','2. Indian','3.European'],
              'Marital_status':['1.Single','2. Married','3 Widowed'],'Smoke_status':['1. Yes','2. No',np.nan]})
hash_file = pd.DataFrame({'keys':['gender','ethnicity','Marital_status','Smoke_status','Yes','No','Male','Female','Single','Married','Widowed','Chinese','Indian','European'],'values':[21,22,23,24,125,126,127,128,129,130,131,141,142,0]})

And another empty dataframe in which the output should be filled can be generated using the code below

columns = ['person_id','obsid','valuenum','valuestring','valueid']
obs = pd.DataFrame(columns=columns)

What I am trying to achieve is shown in the table where you can see the rules and description of how data is to be filled

I did try via for loop approach but as soon as I unstack it, I lose the column names and not sure how I can proceed further.

a=1
for i in range(len(data_file)):
   df_temp = data_file[i:a]
   a=a+1
   df_temp=df_temp.unstack()
   df_temp = df_temp.to_frame().reset_index()

How can I get my output dataframe to be filled like as shown below (ps: I have shown only for person_id = 1 and 4 columns) but in real time, I have more than 25k persons and 400 columns for each person. So any elegant and efficient approach is helpful unlike my for loop.

Any other approach to do this is also helpful – The Great Jun 12 '19 at 07:02 — The Great, Jun 12 '19 at 07:02

Chris Adams · Answer 1 · 2019-06-12T09:52:00.950

1

Here is an alterternative approach using DataFrame.melt and Series.map:

# Solution for pandas V 0.24.0 +

columns = ['person_id','obsid','valuenum','valuestring','valueid']

# Create map Series
hash_map = hash_file.set_index('keys')['values']
value_map = map_file.stack().str.split('\.\s?', expand=True).set_index(1, append=True).droplevel(0)[0]

# Melt and add mapped columns
obs = data_file.melt(id_vars=['person_id'], value_name='valuestring')
obs['obsid'] = obs.variable.map(hash_map)
obs['valueid'] = obs.valuestring.map(hash_map).astype('Int64')
obs['valuenum'] = obs[['variable', 'valuestring']].apply(tuple, axis=1).map(value_map)

# Reindex and sort for desired output
obs.reindex(columns=columns).sort_values('person_id')

[out]

    person_id  obsid valuenum    valuestring  valueid
0           1     21        1           Male      127
3           1     22        1        Chinese      141
6           1     23        1         Single      129
9           1     24        1            Yes      125
1           2     21        2         Female      128
4           2     22        2         Indian      142
7           2     23        2        Married      130
10          2     24        2             No      126
2           3     21        3  Not disclosed      NaN
5           3     22        3       European        0
8           3     23        3        Widowed      131
11          3     24        2             No      126

edited Jun 12 '19 at 09:52

answered Jun 12 '19 at 07:23

Chris Adams

18,389
4
22
39

Will update the solution soon.currently away for lunch – The Great Jun 12 '19 at 07:32
I get an error message like No attribute 'droplevel'. Can you help me understand what's the issue? – The Great Jun 12 '19 at 09:48
I am just using the same statment (value_map) – The Great Jun 12 '19 at 09:49
1

This is a new method added to `v0.24.0` of pandas.... check your version? `print(pd.__version__)` - maybe just needs an upgrade. You can run `pip install pandas -U` from a terminal if you need to upgrade – Chris Adams Jun 12 '19 at 09:50
I encountered another error as "InvalidIndexError: Reindexing only valid with uniquely valued Index objects" – The Great Jun 12 '19 at 10:06
Mean that there must be duplicates in your data - what is `print(columns) ?` – Chris Adams Jun 12 '19 at 10:06
Yeah, My dataset is huge so there might be duplicates but it is valid to have such records. – The Great Jun 12 '19 at 10:07
obs.variable.map(hash_map) - this is the line that causes error – The Great Jun 12 '19 at 10:08
try running `hash_map[hash_map.duplicated(keep=False)]` to see which values are causing issues – Chris Adams Jun 12 '19 at 10:09
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/194813/discussion-between-avles-and-chris-a). – The Great Jun 12 '19 at 10:09

score 1 · Accepted Answer · answered Jun 13 '19 at 10:17

After chat and remove duplicates data is possible use:

s = hash_file.set_index('VARIABLE')['concept_id']
df1 = map_file.melt().dropna(subset=['value'])
df1[['valueid','valuestring']] = df1.pop('value').str.extract('(\d+)\.(.+)')
df1['valuestring'] = df1['valuestring'].str.strip()

columns = ['studyid','obsid','valuenum','valuestring','valueid']
obs = data_file.melt('studyid', value_name='valuestring').sort_values('studyid')

#merge by 2 columns variable, valuestring
obs = (obs.merge(df1, on=['variable','valuestring'], how='left')
          .rename(columns={'valueid':'valuenum'}))
obs['obsid'] = obs['variable'].map(s)
obs['valueid'] = obs['valuestring'].map(s)

#map by only one column variable
s1 = df1.drop_duplicates('variable').set_index('variable')['valueid']
obs['valuenum_new'] = obs['variable'].map(s1)

obs = obs.reindex(columns + ['valuenum_new'], axis=1)
print (obs)

#compare number of non missing rows
print (len(obs.dropna(subset=['valuenum'])))
print (len(obs.dropna(subset=['valuenum_new'])))

Hello Jezrael, Can you please help me with this post? https://stackoverflow.com/questions/56807104/wide-to-long-returns-empty-output-python-dataframe — The Great, Jun 28 '19 at 13:10

3 data frames and 3 rules in operation to insert data into another dataframe - No common columns - Big data

2 Answers2

Linked