There's two ways. Both are super klugy, and extremely dependent on very little fluctuation in the original string. However, you can modify the code to offer a little more flexibility.
Both of the options depend on the line meeting these characteristics...
The grouping in question must...
- Start with a letter or slash, probably capitalized
- That title of interest is followed by a colon (":")
- Grab ONLY the first word after the colon.
Method 1, regex, this can only grab TWO blocks of data. The second group is "everything else" because I can't get the search pattern to repeat properly :P
code:
import re
l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]
pattern = ''.join([
"(", # Start capturing group
"\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash
".+?\:", # any character (non-greedy) up to and including the colon
"\s*", # One or more spaces
"\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
")", # End capturing group
"(.*)"
])
for s in l:
m = re.search(pattern, s)
print("----------------")
try:
print(m.group(1))
print(m.group(2))
print(m.group(3))
except Exception as e:
pass
Output:
----------------
MC/MX/FF Number(s): None
DUNS Number: --
----------------
Power Units: 1
Drivers: 1
Method two, parsing the string word by word. This method has the same basic characteristics as the regex, but can do more than two blocks of interest. It works by...
- Start parsing each string word for word, and loading that into a
newstring
.
- When it hits a colon, mark a flag.
- Add the first word from the next loop to
newstring
. You could change this to the 1-2, 1-3, or 1-n word if you wanted. You could also just have it keep adding words after colonflag
is set until some criteria is met, like a word with a capital...although that could break on words like "None." You could go until a word is met that is ALL capitals, but then a not-all-capital header would break it.
- Add
newstring
to the newlist
, reset the flag, and keep parsing words.
code:
s = 'MC/MX/FF Number(s): None DUNS Number: -- '
for s in l:
newlist = []
newstring = ""
colonflag = False
for w in s.split():
newstring += " " + w
if colonflag:
newlist.append(newstring)
newstring = ""
colonflag = False
if ":" in w:
colonflag = True
print(newlist)
Output:
[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']
Third option:
Create a list of all the expected headers, like header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ]
and have it split/parse based on those.
Fourth option
Use Natural Language Processing and Machine Learning to actually figure out where the logical sentences are ;)