I'm working on a project where I have to read scanned images of financial statements. I used tesseract 4 to convert the image to a text output, which looks as such (here is a snippet):
REVENUE 9,000,000 900,000
COST OF SALES 900,000 900,000
GROSS PROFIT (90%; 2016 - 90%) 900,000 900,000
I would like to break the above into a list of three entries, where the first entry is the text, then the second and third entries would be the numbers. For example the first row would look something like this:
[[REVENUE], [9,000,000], [9,000,000]]
I came across this stack overflow post where someone attempts to use re.match()
to the .groups()
method to find the pattern: How to split strings into text and number?
I'm just being introduced to regex and I'm struggling to properly understand the syntax and documentation. I'm trying to use a cheat sheet for now, but I'm having a tough time figuring out how to go about this, please help.