Merging two unequal length dfs with matching subset of index

Question

Description: I have two data frames foreign (having options in English and its Urdu translation) and native,having German version of options, Lengths of both DFs is unequal the native is much bigger than foreign and carries complete list of options. I want to merge them two using idx column so that native df searches for each entry in its idx in the foreign; if idx is matching than its adds two columns namely: option_eng and option_urdu otherwise fill it with NAN

I have two dataframs
foreign:
idx                    option_eng                    option_urdu
1.1.01-001   This is english of option1             urdu translation of option1
1.1.01-001   This is english of option2             urdu translation of option2
1.1.01-001   This is english of option3             urdu translation of option3 
1.1.01-002   This is english of option1             urdu translation of option1
1.1.01-002   This is english of option2             urdu translation of option2
1.1.01-002   This is english of option3              Urdu translation of option3
1.1.01-003   This is english of option1              urdu translation of option1
1.1.01-003   This is english of option2              urdu translation of option2
1.1.01-004   This is english of option1              urdu translation of option1
1.1.01-004   This is english of option2               urdu translation of option2

Native:>> a unequal to df1
  idx    sort_code  iscorrect   count                option_de
1.1.01-001  1          1          1              German tranlation option3
1.1.01-001  2          1          2              German tranlation option2
1.1.01-001  3          0          3              German tranlation option1
1.1.01-002  1          1          4              German tranlation option3
1.1.01-002  2          1          5             German tranlation option2
1.1.01-002  3          1          6             German tranlation option1
1.1.01-003  1          1          7             German tranlation option3
1.1.01-003  2          1          8             German tranlation option2
1.1.01-003  3          0          9             German tranlation option1
1.1.01-012  1          1          10             German tranlation option3
1.1.01-012  2          1          11             German tranlation option2
1.1.01-012  3          1          12             German tranlation option1
1.1.01-101  1          1          13             German tranlation option3
1.1.01-101  2          1          14             German tranlation option2

My desired output should look like

idx               option_de                       option_eng           option_urdu
1.1.01-001  German tranlation option3   This is english of option3  This is urdu of option3
1.1.01-001  German tranlation option2   This is english of option2  This is urdu of option2
1.1.01-001  German tranlation option1   This is english of option1  This is urdu of option1
1.1.01-002  German tranlation option3   This is english of option3  This is urdu of option3
1.1.01-002  German tranlation option2   This is english of option2  This is urdu of option2
1.1.01-002  German tranlation option1   This is english of option1  This is urdu of option1
1.1.01-003  German tranlation option3   This is english of option3  This is urdu of option3
1.1.01-003  German tranlation option2   This is english of option2  This is urdu of option2
1.1.01-003  German tranlation option1   This is english of option1  This is urdu of option1
1.1.01-012  German tranlation option3     na                                  na
1.1.01-012  German tranlation option2     na                                  na
1.1.01-012  German tranlation option1     na                                  na
1.1.01-101  German tranlation option3     na                                  na
1.1.01-101  German tranlation option2     na                                  na

Note Please note that the df2 the German option are sorted inversely like option 3 ,2,1 and in df1 the option_en and option_urdu are in opposite order I want to match them.

What I have tried?

I have tried different options.native.merge(foreign,how='left',left_on='idx',right_on='idx')

Merged always gives me duplicated rows of idxs of both df (like every idx 3 times) I have also tries pd.combine_first but its changes the orders of options.

I have tried it with native['key']=native.groupby(["code"]).cumcount()

source code to reproduce it

Foreigndic={'idx': {0: '1.1.01-001', 1: '1.1.01-001', 2: '1.1.01-001', 3: '1.1.01-002', 4: '1.1.01-002', 5: '1.1.01-002', 6: '1.1.01-003', 7: '1.1.01-003', 8: '1.1.01-004', 9: '1.1.01-004'}, 'option_eng': {0: ' Stopping as a precaution at every crossroads', 1: ' Not insisting on your rights', 2: " Allowing for other people's mistakes", 3: ' Inattentiveness', 4: ' Driving too close behind the vehicle in front', 5: ' Unexpectedly heavy braking', 6: '  respond as soon as possible to probable changes in the road traffic situation', 7: '  attempt to identify the intentions of other road users, as soon as possible', 8: ' - of its braking characteristics', 9: ' - of its manoeuvrability'}, 'option_urdu': {0: 'احتیاط کے طور پر ہر دوراہے پر رکنا', 1: 'اپنے حقوق پر اصرار نہیں کرنا', 2: 'دوسرے لوگوں کی غلطیوں کی اجازت دینا', 3: 'غفلت', 4: 'اگلی گاڑی سے مناسب فاصلہ نہ رکھنا', 5: 'غیر متوقع طور پر اگلی گاڑی کا اچانک بریک لگا دینا', 6: 'سڑک پر موجود ٹریفک کی صورتحال میں ممکنہ تبدیلیوں کا جلد سے جلد جواب دینا', 7: 'جتنی جلدی ممکن ہو سڑک کے دوسرے صارفین کے ارادوں کی نشاندہی کرنے کی کوشش کریں', 8: ' اس کی بریک خصوصیات', 9: 'اس کے سسٹم کی جانکاری'}}
Nativedict={'code': {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 1, 7: 2, 8: 3, 9: 1, 10: 2, 11: 3, 12: 1, 13: 2, 14: 3}, 'ans_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15}, 'iscorrect': {0: 1, 1: 1, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1}, 'ans_id_txt': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15}, 'text': {0: 'Nicht auf dem eigenen Recht bestehen', 1: 'Mit Fehlern anderer rechnen', 2: 'Vorsorglich an jeder Kreuzung anhalten', 3: 'Unerwartet starkes Bremsen', 4: 'Unaufmerksamkeit', 5: 'Zu dichtes Auffahren', 6: 'Ich reagiere möglichst frühzeitig auf wahrscheinliche Veränderungen der Verkehrssituation', 7: 'Ich versuche, die Absichten anderer Verkehrsteilnehmer möglichst frühzeitig zu erkennen', 8: 'Ich beschränke die Verkehrsbeobachtung möglichst auf das direkt vor mir fahrende Fahrzeug', 9: 'Ich weiß möglicherweise zu wenig über die Besonderheiten in der Anordnung und der Funktion der Bedienelemente', 10: 'Ich weiß möglicherweise zu wenig über die Besonderheiten im Bremsverhalten', 11: 'Ich weiß möglicherweise zu wenig über die Besonderheiten im Lenkverhalten', 12: 'Der Vorausfahrende bremst unerwartet', 13: 'Der Vorausfahrende betätigt den Blinker vor dem Abbiegen zu spät', 14: 'Der Vorausfahrende hält unerwartet an, um nach dem Weg zu fragen'}, 'ques_id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 5, 13: 5, 14: 5}, 'idx': {0: '1.1.01-001', 1: '1.1.01-001', 2: '1.1.01-001', 3: '1.1.01-002', 4: '1.1.01-002', 5: '1.1.01-002', 6: '1.1.01-003', 7: '1.1.01-003', 8: '1.1.01-003', 9: '1.1.01-004', 10: '1.1.01-004', 11: '1.1.01-004', 12: '1.1.01-101', 13: '1.1.01-101', 14: '1.1.01-101'}}

if you want to merge without duplicates you'll need to eliminate dulicates in one of the dataframes OR merge on more keys to create a unique row per key. — Umar.H, Apr 30 '21 at 13:40
The problem is, that it is unclear which foreign // native row pair is valid. In your expected output it seems like there is only 1 allowed valid pair possible. But from you provided data it is imposible to determine, how this pairing works. — Andreas, Apr 30 '21 at 13:44
I think this is a dupe of [Pandas Merge 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) imo. — Umar.H, Apr 30 '21 at 13:49

score 1 · Answer 1 · edited Apr 30 '21 at 14:53

1

Assuming that you don't need to sort the tables, pd.concat() with axis=1 as an argument will do the trick.

pd.concat([
    native[['idx', 'option_de']], 
    foreign[['option_eng', 'option_urdu']]
], axis=1)

That worked for me

native['key']=native.groupby(["code"]).cumcount()
native= native.sort_values(['key','code'],ascending=[True,False]).drop('key',1).reset_index()

native=native.rename(columns={'idx.1':'idx'})
mrg=pd.concat([native[['idx', 'text','iscorrect']],foreign[['option_eng', 'option_urdu']]], axis=1)

edited Apr 30 '21 at 14:53

Faisal

151
3
10

answered Apr 30 '21 at 14:03

Gusti Adli

1,225
4
13

In this way the order of matching index cannot be preserved. It is just blink addition of two extra columns, but I want to match based on index that is common in both dfs – Faisal Apr 30 '21 at 15:35

score 1 · Accepted Answer · answered Apr 30 '21 at 14:12

Try creating a unique index for each idx group by adding an order element. Then merge together on the idx and the order column.

import pandas as pd

foreign = pd.DataFrame({'idx': {0: '1.1.01-001', 1: '1.1.01-001',
                                2: '1.1.01-001', 3: '1.1.01-002',
                                4: '1.1.01-002', 5: '1.1.01-002',
                                6: '1.1.01-003', 7: '1.1.01-003',
                                8: '1.1.01-004', 9: '1.1.01-004'},
                        'option_eng': {
                            0: ' Stopping as a precaution at every crossroads',
                            1: ' Not insisting on your rights',
                            2: " Allowing for other people's mistakes",
                            3: ' Inattentiveness',
                            4: ' Driving too close behind the vehicle in front',
                            5: ' Unexpectedly heavy braking',
                            6: '  respond as soon as possible to probable changes in the road traffic situation',
                            7: '  attempt to identify the intentions of other road users, as soon as possible',
                            8: ' - of its braking characteristics',
                            9: ' - of its manoeuvrability'
                        },
                        'option_urdu': {
                            0: 'احتیاط کے طور پر ہر دوراہے پر رکنا',
                            1: 'اپنے حقوق پر اصرار نہیں کرنا',
                            2: 'دوسرے لوگوں کی غلطیوں کی اجازت دینا',
                            3: 'غفلت',
                            4: 'اگلی گاڑی سے مناسب فاصلہ نہ رکھنا',
                            5: 'غیر متوقع طور پر اگلی گاڑی کا اچانک بریک لگا دینا',
                            6: 'سڑک پر موجود ٹریفک کی صورتحال میں ممکنہ تبدیلیوں کا جلد سے جلد جواب دینا',
                            7: 'جتنی جلدی ممکن ہو سڑک کے دوسرے صارفین کے ارادوں کی نشاندہی کرنے کی کوشش کریں',
                            8: ' اس کی بریک خصوصیات',
                            9: 'اس کے سسٹم کی جانکاری'}
                        })

native = pd.DataFrame(
    {'code': {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 1, 7: 2,
              8: 3, 9: 1, 10: 2, 11: 3, 12: 1, 13: 2, 14: 3},
     'ans_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8,
                8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15},
     'iscorrect': {0: 1, 1: 1, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1,
                   8: 0, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1},
     'ans_id_txt': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8,
                    8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14,
                    14: 15},
     'text': {0: 'Nicht auf dem eigenen Recht bestehen',
              1: 'Mit Fehlern anderer rechnen',
              2: 'Vorsorglich an jeder Kreuzung anhalten',
              3: 'Unerwartet starkes Bremsen',
              4: 'Unaufmerksamkeit',
              5: 'Zu dichtes Auffahren',
              6: 'Ich reagiere möglichst frühzeitig auf '
                 'wahrscheinliche Veränderungen der Verkehrssituation',
              7: 'Ich versuche, die Absichten anderer '
                 'Verkehrsteilnehmer möglichst frühzeitig zu erkennen',
              8: 'Ich beschränke die Verkehrsbeobachtung '
                 'möglichst auf das direkt vor mir fahrende Fahrzeug',
              9: 'Ich weiß möglicherweise zu wenig über die Besonderheiten '
                 'in der Anordnung und der Funktion der Bedienelemente',
              10: 'Ich weiß möglicherweise zu wenig über '
                  'die Besonderheiten im Bremsverhalten',
              11: 'Ich weiß möglicherweise zu wenig über '
                  'die Besonderheiten im Lenkverhalten',
              12: 'Der Vorausfahrende bremst unerwartet',
              13: 'Der Vorausfahrende betätigt den Blinker vor dem Abbiegen zu spät',
              14: 'Der Vorausfahrende hält unerwartet an, um nach dem Weg zu fragen'},
     'ques_id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3,
                 8: 3, 9: 4, 10: 4, 11: 4, 12: 5, 13: 5, 14: 5},
     'idx': {0: '1.1.01-001', 1: '1.1.01-001', 2: '1.1.01-001',
             3: '1.1.01-002', 4: '1.1.01-002', 5: '1.1.01-002',
             6: '1.1.01-003', 7: '1.1.01-003', 8: '1.1.01-003',
             9: '1.1.01-004', 10: '1.1.01-004', 11: '1.1.01-004',
             12: '1.1.01-101', 13: '1.1.01-101', 14: '1.1.01-101'}}
)

# Grab Columns and add key for group order (level_1)
foreign = foreign[['idx', 'option_eng', 'option_urdu']].groupby('idx') \
    .apply(lambda x: x.reset_index(drop=True)) \
    .drop(columns=['idx']) \
    .reset_index()
native = native[['idx', 'text']].groupby('idx') \
    .apply(lambda x: x.reset_index(drop=True)) \
    .drop(columns=['idx']) \
    .reset_index()

# Merge Together on idx and level_1
merged = native.merge(foreign, on=['idx', 'level_1'], how='outer') \
    .drop(columns='level_1') \
    .fillna('na')

print(merged.to_string(index=False))

idx	text	option_eng	option_urdu
1.1.01-001	Nicht auf dem eigenen Recht bestehen	Stopping as a precaution at every crossroads	احتیاط کے طور پر ہر دوراہے پر رکنا
1.1.01-001	Mit Fehlern anderer rechnen	Not insisting on your rights	اپنے حقوق پر اصرار نہیں کرنا
1.1.01-001	Vorsorglich an jeder Kreuzung anhalten	Allowing for other people's mistakes	دوسرے لوگوں کی غلطیوں کی اجازت دینا
1.1.01-002	Unerwartet starkes Bremsen	Inattentiveness	غفلت
1.1.01-002	Unaufmerksamkeit	Driving too close behind the vehicle in front	اگلی گاڑی سے مناسب فاصلہ نہ رکھنا
1.1.01-002	Zu dichtes Auffahren	Unexpectedly heavy braking	غیر متوقع طور پر اگلی گاڑی کا اچانک بریک لگا دینا
1.1.01-003	Ich reagiere möglichst frühzeitig auf wahrscheinliche Veränderungen der Verkehrssituation	respond as soon as possible to probable changes in the road traffic situation	سڑک پر موجود ٹریفک کی صورتحال میں ممکنہ تبدیلیوں کا جلد سے جلد جواب دینا
1.1.01-003	Ich versuche, die Absichten anderer Verkehrsteilnehmer möglichst frühzeitig zu erkennen	attempt to identify the intentions of other road users, as soon as possible	جتنی جلدی ممکن ہو سڑک کے دوسرے صارفین کے ارادوں کی نشاندہی کرنے کی کوشش کریں
1.1.01-003	Ich beschränke die Verkehrsbeobachtung möglichst auf das direkt vor mir fahrende Fahrzeug	na	na
1.1.01-004	Ich weiß möglicherweise zu wenig über die Besonderheiten in der Anordnung und der Funktion der Bedienelemente	- of its braking characteristics	اس کی بریک خصوصیات
1.1.01-004	Ich weiß möglicherweise zu wenig über die Besonderheiten im Bremsverhalten	- of its manoeuvrability	اس کے سسٹم کی جانکاری
1.1.01-004	Ich weiß möglicherweise zu wenig über die Besonderheiten im Lenkverhalten	na	na
1.1.01-101	Der Vorausfahrende bremst unerwartet	na	na
1.1.01-101	Der Vorausfahrende betätigt den Blinker vor dem Abbiegen zu spät	na	na
1.1.01-101	Der Vorausfahrende hält unerwartet an, um nach dem Weg zu fragen	na	na

Aside: Your provided data does not appear to align correctly. Foreign: 1.1.01-002,0,Inattentiveness is the first row of the 1.1.01-002 group, but that aligns with Native: 1.1.01-002,0,Unerwartet starkes Bremsen which is incorrect. It should align with the second row 1.1.01-002,1,Unaufmerksamkeit. But assuming you have translations in the correct order in both tables this should work.

Yes Actually that is the problem. We can change the order with `code` that is `sort_code` 1,2,3, but I need to inverse it to match it with `eng` and `Urdu translation` I have use this method to rearrange . `native['key']=native.groupby(["code"]).cumcount() #print(native.head(100)) #native= native.sort_values(['key','code'],ascending=[True,False]).drop('key',1)#.reset_index()` the problem is when I merge it automatically change the order — Faisal, Apr 30 '21 at 14:20
The alignment is still not consistent. For group `1.1.01-001` the mapping from `native` to `foreign` is 0 -> 1, 1 -> 0, 2 -> 2. That seems like you have some kind of rotation in `foreign`. But `1.1.01-002` the mapping is 0 -> 0, 1-> 2, 2 -> 1. Which is a different kind of rotation in `foreign`. — Henry Ecker, Apr 30 '21 at 14:36

Merging two unequal length dfs with matching subset of index

2 Answers2