Slicing pandas dataframe based on rearranged duplicates (or how to drop rearranged duplicates)

Question

I have a big dataframe with the format

   col1    col2    val1    val2
[0]A       B       0.8     0.1
[1]B       A       0.8     0.1
[2]A       C       0.3     0.9
[3]A       D       0.2     0.8
[4]D       A       0.2     0.8

As you can see, some rows are duplicated with only col1 and col2 reversed. For instance, row 1 is duplicate of row 0, and row 4 is duplicate of row 3. Can you drop the duplicates based on the match of col1:col2 being equal to those of col2:col1?

score 1 · Accepted Answer · answered Dec 10 '19 at 13:51

You can sorting both columns by np.sort and assign back, then use DataFrame.drop_duplicates with specify some columns:

df[['col1','col2']] = np.sort(df[['col1','col2']], axis=1)
df1 = df.drop_duplicates(['col1','col2'])
print (df1)
  col1 col2  val1  val2
0    A    B   0.8   0.1
2    A    C   0.3   0.9
3    A    D   0.2   0.8

Of remove duplicates by all columns:

df2 = df.drop_duplicates()
print (df2)
  col1 col2  val1  val2
0    A    B   0.8   0.1
2    A    C   0.3   0.9
3    A    D   0.2   0.8

Slicing pandas dataframe based on rearranged duplicates (or how to drop rearranged duplicates)

1 Answers1