I am working on a project for which I need to extract Invoice numbers from email body. The invoice numbers could be anywhere on the mail body which I am trying to search using Python code. The problem is that the email senders do not used standard keywords, they used variety of word to mention invoice numbers, for ex. Invoice Number, invoice#, inv no., invoice no. inv-no etc.
This inconsistency makes it difficult for me to extract the invoice number from the mail body since there is no specific keyword.
After reading hundreds of emails I am able to identify most commons words which are used before invoice numbers and I have created a list of them (around 15 keywords). But I am not able to search that list of keywords into the string to retrieve the keywords next to them to identify the invoice number, also the invoice number could be both numeric and alpha-numeric which added more complexity.
I have tried to make some progress which is mentioned below but not getting the desired output.
inv_list = ['invoice number','inv no','invoice#','invoice','invoices','inv number','invoice-number','inv-number','inv#','invoice no.'] # list of keywords used before invoice number
example_string = 'Hi Team, Could you please confirm the status of payment
for invoice# 12345678 and AP-8765432?
Also, please confirm the status of existing invoice no. 7652908.
Thanks'
# Basic code to test if any word from inv_list exists in example_string
for item in inv_list:
if item in example_string:
print(item)
# gives the output like
invoice#
invoice no.
Next, after searching for couple of hours I found this function how to get a list with words that are next to a specific word in a string in python but I am not able to use this for a list of words. I tried:
def get_next_words(mailbody, invoice_text_list, sep=' '):
mail_body_words = mailbody.split(sep)
for word in invoice_text_list:
if word in mail_body_words:
yield next(mail_body_words)
words = get_next_words(example_string,inv_list)
for w in words:
print(w)
and getting
TypeError: 'list' object is not an iterator
Expected output is to return keywords from 'example_string' which are followed by any keyword matching from 'inv_list' (I am assuming that I can identify the invoice number from the match returned)
For the given example the output should be:
Match1: 'invoice#'
Expected Output: '12345678'
Match2: 'invoice no.'
Expected Output: '7652908'
Please let me know if further details are required, any help is appreciated!!