0

I am interested in extracting the 'Company Name' column from this link: https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf

I was able to achieve something similar with this solution: How do I decode text from a pdf online with Requests?

However I was wondering how would I go about extracting only the company name column from that? Since the solution returns all of the text in an unstructured format. Thanks in advance as I am fairly new to python and having difficulties.

Xin
  • 666
  • 4
  • 16

2 Answers2

1

There is a python library named tabula-py You can install it using "pip install tabula-py" You can use it as follows:

import tabula

file = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"

tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)

You can use this to convert the table to a csv file

 tabula.convert_into(file, "table.csv")

Then you can use csv library to get the required columns you want

dewDevil
  • 381
  • 1
  • 3
  • 12
  • Thanks for the reply, I tried the code and got this error: --------------------------------------------------------------------------- HTTPError Traceback (most recent call last) --- HTTPError: HTTP Error 403: Forbidden – Xin May 15 '20 at 21:53
1

You get the error as the Server is preventing bots from web scraping or something. I don't quite understand it either but I found a fix which is to download the file locally first and then use tabula to get the data like so

import requests
from tabula import read_pdf


url = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
r = requests.get(url, allow_redirects=True)
open('data.pdf', 'wb').write(r.content)

tables = read_pdf("data.pdf", pages = "all", multiple_tables = True)

you may then get the following message

tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`

to fix it follow the steps from this thread. `java` command is not found from this Python process. Please ensure Java is installed and PATH is set for `java` and everything should be working.

Computeshorts
  • 596
  • 1
  • 7
  • 25