0

I'm trying to find similarity across columns in a dataframe (Python). Can I get the similarity in % or (between 0 and 1) ?

I was able to find vlookup alternative in python where I know on which column I can join (ref: vlookup in Pandas using join) But I am not certain against which column the second data frame I'll have the specific match (I want vlookup against each and every column in second DF and want to find similarity ).

df.merge(df1, on='id', how='left')

Ex 1:

id  name    flag
128 shyam   T
129 ram F
130 alex    F
131 chinming    F
132 jose    T
133 khader  T

Ex 2:

ex_id   hig
129 FULL
130 LOW
133 MID

Ex 3 :

c_id    loc
129 hy
132 tx
134 ca

I am not sure on what I'd to join with either of two data frames listed above, but I want to find a relationship or similarity between different columns across data frames of Ex 1.

martineau
  • 119,623
  • 25
  • 170
  • 301
Punith Raj
  • 1
  • 1
  • 1
  • Hey! Can i ask, are you trying to join two dataframes based on similarity of columns or are you trying to find the similarity of two columns in a singular dataframe? – the_good_pony Oct 22 '19 at 16:01
  • Hey! I am trying to find the similarity between columns across data frames. In the example mentioned above, If i can figure out "id" can be mapped to "ex_id" and "c_id" based on similarity ( Ideally in range 0 to 1 ) of values. PS : Many to many comparisons has to be performed. Thanks – Punith Raj Oct 23 '19 at 04:49

1 Answers1

2

Assuming you're looking to compare the similarity of two columns in a singular dataframe, you do something like this using Spacy.

import the required packages

import pandas as pd 
import spacy

import en_core_web_sm
nlp = en_core_web_sm.load() 

Create example dataframe

df = pd.DataFrame({                                        
    "A": ["Cat", "Puppy", "Small Fish"],                                 
    "B": ["Cat", "Dog", "Fish"],                                 
    "C": ["Kitten", "Pikachu", "Large Goldfish"],                                 
    "D": ["Lion", "Charmander", "Goldfish"]})  

Create function to compare two strings for similarity

def get_similarity(term1, term2):
    tokens = nlp(term1 + " " + term2)

    print(tokens[0].text, "|",tokens[1].text, tokens[0].similarity(tokens[1]))

    return tokens[0].similarity(tokens[1])

Apply the function to a new column - the below is going to create a column with a similarity score between column A and B

df['A_B_similarity'] = df.apply(lambda x: get_similarity(x['A'], x['B']), axis = 1)

You end up with a dataframe which looks like this

enter image description here

For more information go here vectors-similarity

the_good_pony
  • 490
  • 5
  • 12