split string no delimiter with limitative field names and content

Question

I have a dataframe with bankmutations.
The dataframe contains a description column. In this column there are limitative field names with their content. It looks like:

AAm.loc[0, ’OmsBank'] => ‘  fieldname1: content fn1   fieldname3: content fn3     ‘
AAm.loc[1, ’OmsBank'] => ‘  fieldname5: content fn5   fieldname2: content fn2     ‘

I did a lot of googling if there was a solution for this problem in general (breakdown a long string in fieldnames and content). Until now no simple solution found.

I could use double space as delimiter for splitting, but inside the fields content there are also double spaces. So this is not a vallid option. Tried to use split with a list/array, but also no luck.

Until now i have created a small second dataframe with the limitative fieldnames, starting position of fieldname and starting position of the content and end position of content.

I loop through the rows of the bankstatements, check which fieldnames are present and store the position of start label and start content in the small dataframe I sort the small dataframe by postion of fieldname, reset the index and then i can determine the end position of the content (start of the next fieldname). The last fieldname has no next fieldname. So there i have to store the length of the description string as end value. Then i can store the content of specific elements in the bankstatement dataframe.

I have the feeling that it can be done more ‘pythonic’ and efficient, vectorise, lambda functions?
If somebody has suggestions, i would highly appreciate this. Also a solution for this problem in general could help others.

My code until now:

import pandas as pd
# data
tosplit = ['SEPA Inc dl Incassant: blabla  Naam: Insurance    Machtiging: number  Omschrijving: Rel.nr. number2    IBAN: iban#', 
          'SEPA Overboeking  IBAN: iban#  BIC: bic#  Naam: blabla  Omschrijving: Refund  Kenmerk: refcode', 
          'SEPA iDEAL  IBAN: iban#  BIC: bic#  Naam: seller  Omschrijving: description  double space  Kenmerk: refcode',
          'SEPA iDEAL  IBAN: iban#  BIC: bic#  Naam: city  Omschrijving: citytax']
AAm = pd.DataFrame(tosplit, columns= ['OmsBank'])
AAm['Naam'] = ''
AAm['Oms'] = ''
AAm
# field list
fld = [' IBAN: ', ' BIC: ', ' Naam: ', ' Omschrijving: ', ' Kenmerk: ', ' Referentie: ', ' Machtiging: ', ' Incassant: ']
fld.sort()

# store field list in dataframe and add columns for position
fld= pd.DataFrame(fld, columns = ['fldnm'] )
ser= [0]*len(fld)
fld = fld.assign(bgn = ser, bgc = ser, end = ser )
# loop through AAm (dataframe with bankmutations) and proces data
for i in AAm.index:

    # search for fieldnames and store postion (begin of fieldname and begin of fieldcontent) if the are found
    for j in fld.index:
        temp = AAm.loc[i, 'OmsBank'].find(fld.loc[j,'fldnm'])
        if temp == -1: 
            temp2 = -1 
        else:
            temp2 = temp+len(fld.loc[j,'fldnm'])
            fld.loc[j,'bgn'] = temp
            fld.loc[j,'bgc'] = temp2

    # order by postion and reset index
    fld = fld.sort_values(by ='bgc')
    fld.reset_index(drop=True, inplace = True)
    # establish end of content position. For the last one it is length general string
    for j in fld.index:
        if (fld.loc[j, 'bgn']>0) & (j < len(fld)-1):
            fld.loc[j, 'end'] = fld.loc[(j+1), 'bgn']
        fld.loc[len(fld)-1, 'end'] = len(AAm.loc[i, 'OmsBank'])

    # store start/end of relevant field content in AAm
    nm_om = fld[(fld['fldnm'] == ' Naam: ') | (fld['fldnm'] == ' Omschrijving: ')]
    nm_om.reset_index(drop=True, inplace = True)
    AAm.loc[i,'Naam'] = AAm.loc[i,'OmsBank'][nm_om.loc[0, 'bgc']:nm_om.loc[0, 'end']].strip()
    AAm.loc[i,'Oms'] = AAm.loc[i,'OmsBank'][nm_om.loc[1, 'bgc']:nm_om.loc[1, 'end']].strip()

    #reset values in fieldlist
    fld[['bgn', 'bgc', 'end']]=0

a best practice for (more complex) questions, is to provide a so called "reproducable" scenario, where people can copy and paste your dataframe + code in their IDE and run into the same problem. For you that would mean to include a dataset with `pd.DataFrame`, so people can instantly get the data in the correct format, also since that's quite important for your question. For example see one of my [questions](https://stackoverflow.com/questions/57774352/fill-in-same-amount-of-characters-where-other-column-is-nan) — Erfan, Mar 20 '20 at 11:22
@Erfan: it is quite a large set. My code works, but i think it is far from optimal. I will try to add a small dataframe to make it reproducable — Pieter de Wit, Mar 20 '20 at 11:29
Yes, the idea is to break down your problem to a small example dataset, not to include your whole dataset. I would also break down your problem in small parts, right now you posted your whole code, which will scare most people away, since they don't want to go through 80 lines of code to debug. — Erfan, Mar 20 '20 at 11:38
You gave a description for 2 rows. A simple way to produce a dataframe extract is `df.iloc[:10].to_dict(orient='list')`: it gives a copy-pastable content of the 10 first rows (adapt the slice to whatever is relevant) — Serge Ballesta, Mar 20 '20 at 13:54

Pieter de Wit · Answer 1 · 2022-02-04T11:46:41.707

I picked up my problem and discovered regex. You can search for all kind of (text)patterns. My pattern was ' fieldname: field value secondfieldname: second field value'. You can put these blocks in brackets and refer to them with group(number), where 0 is whole search string, 1 is first group etc.

One of the example rows is:

tosplit = 'SEPA Inc dl Incassant: blabla  Naam: Insurance    Machtiging: number  Omschrijving: Rel.nr. number2    IBAN: iban#'

Import regex and search for the pattern in the string to split and use brackets to divide them in groups:

import re    
splitted = re.search(r'( Naam: )(.*)( Machtiging: )', tosplit)

Name content can be easily obtained with

name = splitted.group(2)

split string no delimiter with limitative field names and content

1 Answers1