1

I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):

REVENUE 9,000,000 900,000

COST OF SALES 900,000 900,000

GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000

I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:

[[REVENUE], [9,000,000], [9,000,000]]

I came across this stack overflow post where someone attempts to use re.match() to the .groups() method to find the pattern: How to split strings into text and number?

I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.

Sam
  • 641
  • 1
  • 7
  • 17
  • The posted question does not appear to include [any attempt](https://idownvotedbecau.se/noattempt/) at all to solve the problem. StackOverflow expects you to [try to solve your own problem first](https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users), as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific roadblock you're running into a [MCVE]. For more information, please see [ask] and take the [tour]. – CertainPerformance Dec 07 '18 at 04:49
  • did you want for this?`r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+)"` . you need to use `re.findall()` – KC. Dec 07 '18 at 05:46

3 Answers3

1

Regexes are overkill for this problem as you've stated it.

text.split() and a join of the items before the last two is better suited to this.

lines = [ "REVENUE 9,000,000 900,000",
          "COST OF SALES 900,000 900,000",
          "GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000" ]
out = []
for line in lines:
    parts = line.split()
    if len(parts) < 3:
        raise InputError
    if len(parts) == 3:
        out.append(parts)
    else:
        out.append([' '.join(parts[0:len(parts)-2]), parts[-2], parts[-1]])

out will contain

 [['REVENUE', '9,000,000', '900,000'], 
  ['COST OF SALES', '900,000', '900,000'], 
  ['GROSS PROFIT (90%; 2016 - 90%)', '900,000', '900,000']]

If the label text needs further extraction, you could use regexes, or you could simply look at the items in parts[0:len(parts)-2] and process them based on the words and numbers there.

Joe McMahon
  • 3,266
  • 21
  • 33
1

I wrote this regex through watching your first expected output. But i am not sure what your desired output is with your third sentence.

  1. ([A-Za-z ]+)(?=\d|\S) match name until we found a number or symbol.
  2. .*? for the string which we do not care
  3. ([\d,]+)\s([\d,]+|(?=-\n|-$)) match one or two groups of number, if there is only one group of number, this group should end with newline or end of text.

Test code(edited):

import re

regex = r"([A-Za-z ]+)(?=\d|\S).*?([\d,]+)\s([\d,]+|(?=-\n|-$))"

text = """
REVENUE 9,000,000 900,000

COST OF SALES 900,000 900,000

GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000

Business taxes 999 -
"""

print(re.findall(regex,text))
# [('REVENUE ', '9,000,000', '900,000'), ('COST OF SALES ', '900,000', '900,000'), ('GROSS PROFIT ', '900,000', '900,000'), ('Business taxes ', '999', '')]
KC.
  • 2,981
  • 2
  • 12
  • 22
  • How do we account for negative numbers? On financial documents they are displayed as such: (9,000,000) – Sam Dec 07 '18 at 15:17
  • Also, for a case where something is displayed as follows, how would you account for this case `'Business taxes 999 -`? so basically I'm looking to have the output as follows [[business taxes], [999], [-]] ....where the hyphen is a blank number – Sam Dec 07 '18 at 15:40
  • Add a special case, because after number there is the position which end of text, or having a `\n`. – KC. Dec 07 '18 at 16:24
0

To detect the string

rev_str = "[[REVENUE], [9,000,000], [9,000,000]]"

and extract the values

("REVENUE", "9,000,000", "9,000,000")

you would do

import re
x = re.match(r"\[\[([A-Z]+)\], \[([0-9,]+)\], \[([0-9,]+)\]\]", rev_str)
x.groups()
# ('REVENUE', '9,000,000', '9,000,000')

Let's unpack this big ol' string.

  • Square brackets signify a range of characters. For example, [A-Z] means to look for all letters from A to Z, whereas [0-9,] means to look for the digits 0 through 9, as well as the character ,. The - here is an operator used inside square brackets to denote a range of characters that we want.
  • The + operator means to look for at least one occurrence of whatever immediately precedes it. For example, the expression [A-Z]+ means to look for at least one occurrence of any of the letters A through Z. You can also use the * operator instead, to look for at least zero occurrences of whatever precedes it.
  • The round brackets (i.e. parentheses) signify a group to be extracted from the regex. Whenever that pattern is matched, whatever is inside any expression in parentheses will be extracted and returned as a group. For example, ([A-Z+]) means to look for at least one occurrence of any of the letters A through Z, and then save whatever that turns out to be. We access this by doing x.groups() after assigning the result of the regex match to a variable x.
  • Otherwise, it's straightforward - accommodating for the pattern [[TEXT], [NUMBER], [NUMBER]]. The square brackets are escaped with the \ character, because we want to interpret them literally, rather than as a range of characters.
  • Overall, the re.match() function will search rev_str for any places where the given pattern matches, keep track of the groups within that match, and return those groups when you call x.groups().

This is a fairly simple example, but you've gotta start somewhere, right? You should be able to use this as a starting point for making a more complicated regex expression to process more of your code.

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53