0

I'm trying to find the best way to parse this type of string:

Operating Status: NOT AUTHORIZED Out of Service Date: None

I need the output to be like this:

['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']

Is there an easy way of doing this? I am parsing hundreds of string like this. There is no deterministic text but its always in the above format.

Other string examples:

MC/MX/FF Number(s): None  DUNS Number: -- 
Power Units: 1  Drivers: 1 

Expected Output:

['MC/MX/FF Number(s): None, 'DUNS Number: --']
['Power Units: 1,  Drivers: 1 ']
Jan
  • 42,290
  • 8
  • 54
  • 79
user2353003
  • 522
  • 1
  • 7
  • 18
  • 1
    just an approach, try to keep a list for all of the key values by yourself and then proceed – sahasrara62 Oct 18 '19 at 08:19
  • 3
    The question doesn't appear to include any attempt at all to solve the problem. Please edit the question to show what you've tried, and show a specific roadblock you're running into with [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). For more information, please see [How to Ask](https://stackoverflow.com/help/how-to-ask). – Andreas Oct 18 '19 at 08:21
  • Sorry bro, but I can't find any pattern between the possible strings. so my oppinion is that it can't be solved without knowing all possible strings. – Sofien Oct 18 '19 at 08:27
  • 1
    You could split on :, but the problem is that there is no way to know if it should be ['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None'] or ['Operating Status: NOT AUTHORIZED Out of Service', 'Data: None'] – Astrogat Oct 18 '19 at 08:29
  • Possible duplicate of [Splitting on last delimiter in Python string?](https://stackoverflow.com/questions/15012228/splitting-on-last-delimiter-in-python-string) – Muhammed Fasil Oct 18 '19 at 18:35

2 Answers2

2

There's two ways. Both are super klugy, and extremely dependent on very little fluctuation in the original string. However, you can modify the code to offer a little more flexibility.

Both of the options depend on the line meeting these characteristics... The grouping in question must...

  1. Start with a letter or slash, probably capitalized
  2. That title of interest is followed by a colon (":")
  3. Grab ONLY the first word after the colon.

Method 1, regex, this can only grab TWO blocks of data. The second group is "everything else" because I can't get the search pattern to repeat properly :P

code:

import re

l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]

pattern = ''.join([
                 "(", # Start capturing group  
                 "\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash 
                 ".+?\:", # any character (non-greedy) up to and including the colon
                 "\s*", # One or more spaces
                 "\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
                  ")", # End capturing group
                  "(.*)"
])

for s in l: 
    m = re.search(pattern, s)
    print("----------------")
    try:
        print(m.group(1))
        print(m.group(2))
        print(m.group(3))
    except Exception as e:
        pass

Output:

----------------
MC/MX/FF Number(s): None 
DUNS Number: -- 
----------------
Power Units: 1 
Drivers: 1 

Method two, parsing the string word by word. This method has the same basic characteristics as the regex, but can do more than two blocks of interest. It works by...

  1. Start parsing each string word for word, and loading that into a newstring.
  2. When it hits a colon, mark a flag.
  3. Add the first word from the next loop to newstring. You could change this to the 1-2, 1-3, or 1-n word if you wanted. You could also just have it keep adding words after colonflag is set until some criteria is met, like a word with a capital...although that could break on words like "None." You could go until a word is met that is ALL capitals, but then a not-all-capital header would break it.
  4. Add newstring to the newlist, reset the flag, and keep parsing words.

code:

s =     'MC/MX/FF Number(s): None DUNS Number: -- ' 
for s in l: 
    newlist = []
    newstring = ""
    colonflag = False
    for w in s.split():
        newstring += " " + w
        if colonflag: 
            newlist.append(newstring)
            newstring = ""
            colonflag = False

        if ":" in w:
            colonflag = True
    print(newlist)

Output:

[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']

Third option: Create a list of all the expected headers, like header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ] and have it split/parse based on those.

Fourth option

Use Natural Language Processing and Machine Learning to actually figure out where the logical sentences are ;)

RightmireM
  • 2,381
  • 2
  • 24
  • 42
1

Have a look at pyparsing. It seems to be maybe the most 'natural' way to express combinations of words, detect relations between them (gramatically) and produce a structured response... There are plenty of tutorials and docs on the net:

You can install pyparsing using `pip install pyparsing'

Parsing:

Operating Status: NOT AUTHORIZED Out of Service Date: None

would require something like:

!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
#  test_pyparsing2.py
#
#  Copyright 2019 John Coppens <john@jcoppens.com>
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#  along with this program; if not, write to the Free Software
#  Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
#  MA 02110-1301, USA.
#
#

import pyparsing as pp

def create_parser():
    opstatus = pp.Keyword("Operating Status:")
    auth     = pp.Combine(pp.Optional(pp.Keyword("NOT"))) + pp.Keyword("AUTHORIZED")
    status   = pp.Keyword("Out of Service Date:")
    date     = pp.Keyword("None")

    part1    = pp.Group(opstatus + auth)
    part2    = pp.Group(status + date)

    return part1 + part2



def main(args):
    parser = create_parser()

    msg = "Operating Status: NOT AUTHORIZED Out of Service Date: None"
    print(parser.parseString(msg))

    msg = "Operating Status: AUTHORIZED Out of Service Date: None"
    print(parser.parseString(msg))

    return 0

if __name__ == '__main__':
    import sys
    sys.exit(main(sys.argv))

Running the program:

[['Operating Status:', 'NOT', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
[['Operating Status:', '', 'AUTHORIZED'], ['Out of Service Date:', 'None']]

using Combine and Group you can change the format how the output is organized.

jcoppens
  • 5,306
  • 6
  • 27
  • 47