1

I'm trying to write a Python Script which will Find specific words in pdf files. Right now I have to scroll through the result to find the lines where its found.

I want the lines containing the word alone to be printed or saved as a separate file.

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader("Filename.pdf")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
Strings = "House|Property|street"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    print("this is page " + str(i)) 
    Text = PageObj.extractText() 
    # print(Text)
    ResSearch = re.search(Strings, Text)
    print(ResSearch)

When I run the above code I need to scroll through the output to find the lines where the words are found. I expect the lines containing the words to be printed or saved as separate file or the page containing the line alone to be saved in separate pdf or txt file. Thanks for the help in advance

Michael
  • 29
  • 1
  • 4
  • Welcome to SO! Is it correct that your problem is not specific on text from pdf's? I.e. it is just about finding lines that match your search? – Pieter Oct 30 '19 at 20:42
  • That's right. I want to find a way to print the lines alone containing the words instead of having to scroll through the output. – Michael Oct 31 '19 at 08:36

1 Answers1

1

You could use re.match after splitting lines for the text on each page.

As an example:

for i in range(0, num_pages):
    page = object.getPage(i)
    text = page.extractText()
    for line in text.splitlines():
        if re.match('House|Property|street', line):
            print(line)
Pieter
  • 3,262
  • 1
  • 17
  • 27