4

I'm trying to scrape a really long web page with beautifulsoup4 and python3. Due to the size of the website, http.client throws me an error when I try to search for something in the website:

File "/anaconda3/lib/python3.6/http/client.py", line 456, in read return self._readall_chunked() File "/anaconda3/lib/python3.6/http/client.py", line 570, in _readall_chunked raise IncompleteRead(b''.join(value)) http.client.IncompleteRead: IncompleteRead(16109 bytes read)

Is there any way to get around this error?

Niellles
  • 868
  • 10
  • 27
Evan Hsueh
  • 139
  • 1
  • 2
  • 9
  • 1
    The [`http.client`](https://docs.python.org/3/library/http.client.html) library is pretty low-level. I believe you can solve this by manually reading and assembling the chunks, but it's kind of a pain. If possible, it would be much easier to switch to `requests` if you can use a third-party library, or `urllib` if you can't. (In fact, it even says this right at the top of the docs…) Is there a reason you can't do that? – abarnert Jul 07 '18 at 20:39
  • 1
    If you _do_ need to stick with `http.client`, and you want us to show how to fix your code, you're going to have to give us the relevant code (and ideally a [mcve], not just a snippet from a larger program that can't be run and debugger). – abarnert Jul 07 '18 at 20:40

1 Answers1

2

As the docs for http.client tell you right at the top, this is a very low-level library, meant primarily to support urllib, and:

See also The Requests package is recommended for a higher-level HTTP client interface.

If you can conda install requests or pip install requests, your problem becomes trivial:

import requests
req = requests.get('https://www.worldcubeassociation.org/results/events.php?eventId=222&regionId=&years=&show=All%2BPersons&average=Average')
soup = BeautifulSoup(req.text, 'lxml')

If you can't install a third-party library, working around this is possible, but not actually supported, and not easy. None of the chunk-handling code in http.client is public or documented, but the docs do link you to the source, where you can see the private methods. In particular, notice that read calls a method named _readall_chunked, which loops over calling a _safe_read method on _get_chunk_left. That _safe_read method is the code you'll need to replace (e.g., by subclassing HTTPResponse, or monkeypatching it) to work around this problem. Which probably isn't going to be nearly as easy or fun as just using a higher-level library.

abarnert
  • 354,177
  • 51
  • 601
  • 671