I have a dataframe with bankmutations.
The dataframe contains a description column. In this column there are limitative field names with their content.
It looks like:
AAm.loc[0, ’OmsBank'] => ‘ fieldname1: content fn1 fieldname3: content fn3 ‘
AAm.loc[1, ’OmsBank'] => ‘ fieldname5: content fn5 fieldname2: content fn2 ‘
I did a lot of googling if there was a solution for this problem in general (breakdown a long string in fieldnames and content). Until now no simple solution found.
I could use double space as delimiter for splitting, but inside the fields content there are also double spaces. So this is not a vallid option. Tried to use split with a list/array, but also no luck.
Until now i have created a small second dataframe with the limitative fieldnames, starting position of fieldname and starting position of the content and end position of content.
I loop through the rows of the bankstatements, check which fieldnames are present and store the position of start label and start content in the small dataframe I sort the small dataframe by postion of fieldname, reset the index and then i can determine the end position of the content (start of the next fieldname). The last fieldname has no next fieldname. So there i have to store the length of the description string as end value. Then i can store the content of specific elements in the bankstatement dataframe.
I have the feeling that it can be done more ‘pythonic’ and efficient, vectorise, lambda functions?
If somebody has suggestions, i would highly appreciate this. Also a solution for this problem in general could help others.
My code until now:
import pandas as pd
# data
tosplit = ['SEPA Inc dl Incassant: blabla Naam: Insurance Machtiging: number Omschrijving: Rel.nr. number2 IBAN: iban#',
'SEPA Overboeking IBAN: iban# BIC: bic# Naam: blabla Omschrijving: Refund Kenmerk: refcode',
'SEPA iDEAL IBAN: iban# BIC: bic# Naam: seller Omschrijving: description double space Kenmerk: refcode',
'SEPA iDEAL IBAN: iban# BIC: bic# Naam: city Omschrijving: citytax']
AAm = pd.DataFrame(tosplit, columns= ['OmsBank'])
AAm['Naam'] = ''
AAm['Oms'] = ''
AAm
# field list
fld = [' IBAN: ', ' BIC: ', ' Naam: ', ' Omschrijving: ', ' Kenmerk: ', ' Referentie: ', ' Machtiging: ', ' Incassant: ']
fld.sort()
# store field list in dataframe and add columns for position
fld= pd.DataFrame(fld, columns = ['fldnm'] )
ser= [0]*len(fld)
fld = fld.assign(bgn = ser, bgc = ser, end = ser )
# loop through AAm (dataframe with bankmutations) and proces data
for i in AAm.index:
# search for fieldnames and store postion (begin of fieldname and begin of fieldcontent) if the are found
for j in fld.index:
temp = AAm.loc[i, 'OmsBank'].find(fld.loc[j,'fldnm'])
if temp == -1:
temp2 = -1
else:
temp2 = temp+len(fld.loc[j,'fldnm'])
fld.loc[j,'bgn'] = temp
fld.loc[j,'bgc'] = temp2
# order by postion and reset index
fld = fld.sort_values(by ='bgc')
fld.reset_index(drop=True, inplace = True)
# establish end of content position. For the last one it is length general string
for j in fld.index:
if (fld.loc[j, 'bgn']>0) & (j < len(fld)-1):
fld.loc[j, 'end'] = fld.loc[(j+1), 'bgn']
fld.loc[len(fld)-1, 'end'] = len(AAm.loc[i, 'OmsBank'])
# store start/end of relevant field content in AAm
nm_om = fld[(fld['fldnm'] == ' Naam: ') | (fld['fldnm'] == ' Omschrijving: ')]
nm_om.reset_index(drop=True, inplace = True)
AAm.loc[i,'Naam'] = AAm.loc[i,'OmsBank'][nm_om.loc[0, 'bgc']:nm_om.loc[0, 'end']].strip()
AAm.loc[i,'Oms'] = AAm.loc[i,'OmsBank'][nm_om.loc[1, 'bgc']:nm_om.loc[1, 'end']].strip()
#reset values in fieldlist
fld[['bgn', 'bgc', 'end']]=0