2

I have about 30,000 bank names in dataframe. I would like to group them into a base group as most of them are same except that they are located in different location. However I do not know what bank names are in there.

Given below is a subset of the dataset. From this data I could identify 2 banks namely ROYAL BANK and BARCLAYS. So I would like to get 2 groups.

ROYAL BANK(count:13) BARCLAYS(count:7)

ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
THE ROYAL BANK OF SCOTLAND PLC
THE ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF CANADA CAYMAN ISLANDS
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL, LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL LTD.
ROYAL BANK OF SCOTLAND, N.V.
RBC ROYAL BANK (BAHAMAS), LTD.
ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF SCOTLAND PLC
BARCLAYS BANK PLC
BARCLAYS BANK DELAWARE
BARCLAYS BANK OF GHANA, LTD.
BARCLAYS BANK DELAWARE
BARCLAYCARD GERMANY
BARCLAYS BANK PLC
BARCLAYS BANK PLC

There are other banks as well with similar pattern and I would like to have a generalized method to identify the list unique groups(bank names) and group similar ones under these.

VKB
  • 65
  • 1
  • 7
  • 1
    Just because they have similar names doesn't mean they are the same bank - there are a lot of 'royal banks' that have nothing to do with one another. You need to define precise grouping rules (i.e. what is similar and what is not...) if you want them grouped. – zwer Apr 06 '18 at 00:24

1 Answers1

1

Do you want something like this?

[ ROYAL BANK ]
ROYAL BANK OF CANADA
ROYAL BANK OF CANADA
THE ROYAL BANK OF SCOTLAND PLC
THE ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF CANADA CAYMAN ISLANDS
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
RBC ROYAL BANK (TRINIDAD AND TOBAGO), LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL, LTD.
THE ROYAL BANK OF SCOTLAND INTERNATIONAL LTD.
ROYAL BANK OF SCOTLAND, N.V.
RBC ROYAL BANK (BAHAMAS), LTD.
ROYAL BANK OF SCOTLAND PLC
ROYAL BANK OF SCOTLAND PLC

[ BARCLAY ]
BARCLAYS BANK PLC
BARCLAYS BANK DELAWARE
BARCLAYS BANK OF GHANA, LTD.
BARCLAYS BANK DELAWARE
BARCLAYCARD GERMANY
BARCLAYS BANK PLC
BARCLAYS BANK PLC

Regex used is

(?m)^\s*([A-Z\s]*?(?:(ROYAL BANK)|(BARCLAY)).*)$

Demo,,, in which matched bank name is captured to group 1, and detected keyword(ROYAL BANK, BARCLAY) is captured to group 2 or group 3 for using them to classify banks by name in a python script.

Following python script may explain some of basic concepts about name classification what you want.

import re
ss=""" copy & paste sample text in this area """

royalbank=[]
barclay=[]
regx= re.compile(r'(?m)^\s*([A-Z\s]*?(?:(ROYAL BANK)|(BARCLAY)).*)$')
matching=regx.findall(ss)
for m in matching:
    if m[1] !="":
        royalbank.append(m[0])
    elif m[2] !="":
        barclay.append(m[0])

print("\n[ ROYAL BANK ]")
for e in royalbank: print(e)
print("\n[ BARCLAY ]")
for e in barclay: print(e)
Thm Lee
  • 1,236
  • 1
  • 9
  • 12