1

I am trying to write a program that reads text from screenshots and then identifies various PII from it. Using pytesseract to read in the text, I am trying to write regex for urls, email IDs etc. Here is an example of a function which takes in a string and returns True email IDs and False otherwise:

def email_regex(text):
    pattern = compile(r"\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?")
    return bool(pattern.match(text))

This function works well for all email IDs in a proper format(abc@xyz.dd), but since the input to the function is text read in from pytesseract, the text is not guaranteed to be properly formatted. My function returns False for abc@xyzdd. I'm running into the same issues with URL regex,domain name regex etc. Is there a way to make my regex expressions more robust to reading in errors from pytesseract?

I have tried following the accepted solution to this answer, but that leads to the regex functions returning True for random words as well. Any help to resolve this would be greatly appreciated.

EDIT :- Here are my url and domain regexs, where I'm running into the same problem as my email regex. Any help with these will be very useful for me.

    pattern = compile(r'\b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86} 
    [a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b', re.IGNORECASE)
    return pattern.match(text)```


  def url_regex(text):
    pattern = compile(r'(http|https://)?:(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])+)', re.IGNORECASE)
    return pattern.match(text)
Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
  • 1
    Where do you want draw the line? Isn't something like `.+@[^@]+` sufficient, matching everything that has exactly one @ symbol in the middle? If your OCR fails to recognize the @ symbol correctly, there's little hope that the result is easily recognizable as an email address. – Thomas May 25 '20 at 12:22
  • something like ```abc@xyzcom``` should be acceptable. A missing @ is going to be difficult to manage, as you pointed out. Similarly, ```httyis://www.facebook.com``` should be acceptable to the url_regex... – WitchKingofAngmar May 25 '20 at 12:25

1 Answers1

0

Perhaps adding some flags, such as ignorecase and DOTALL for newlines:

# Match email ID:
my_pattern = compile(r"^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]?\w{2,3}$", re.I, re.S)

Match URLs:

https://gist.github.com/gruber/8891611

Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
  • can you help me with an example here, where adding a flag takes care of the case I've outlined in the body of the question? – WitchKingofAngmar May 25 '20 at 12:28
  • Yes: The character sets currently checks lower case letters: a-z, which won't find mails like aBc@xyZ.dD, right ? – Gustav Rasmussen May 25 '20 at 12:31
  • So adding the re.I flag will give more robustness – Gustav Rasmussen May 25 '20 at 12:31
  • 1
    That's true. Thanks for the tip. Is there anything I can do to make the regex I've posted in the question for example more robust against edge cases like I've described in the question? – WitchKingofAngmar May 25 '20 at 12:34
  • Solving the "False" return value for the "abc@xyzdd" email ID could be done by making the . (dot) optional, which can be done with the quantifier "?" (zero or one occurrences), trailing the \. (dot-character) – Gustav Rasmussen May 25 '20 at 12:38
  • 1
    this was helpful. Can you edit your answer so that I can select it as the accepted answer? Also, if you could look at the updated question body and have a look at the other functions as well, I'd be grateful – WitchKingofAngmar May 25 '20 at 12:47