1

Scrapy beginner here. I am trying to scrape data from multiple pages. Each page contains 20 entries, you then click the next button to go to the next page. However the URL does not change because the tag is:

<a href="#" onClick="nextPage(2);"> Click me! </a>

This page is unique since it does not use XHR requests like many other examples I have found. A few other answers suggest to monitor the GET requests through Chrome/Firefox development tools and then replicate them. This site doesn't produce XHR requests but a series of javascript requests (one AJAX).

I want to use scrapy to transfer to the next page (preferably without selenium, if possible) so I can continue the scrape on the viewed data.

This is the webpage for reference: http://www.australianschoolsdirectory.com.au/search-result.php

First time asker. Thank you in advance!

Matt Allan
  • 33
  • 5
  • 1
    You need to make a [POST request](http://stackoverflow.com/questions/17625053/how-to-send-post-data-in-start-urls-of-the-scrapy-spider) and add `form-data` like this `pageNum : `. – vold May 20 '17 at 16:25

1 Answers1

1

For getting the next page you need to make a 'POST' request and pass form-data with pageNum as key and number of the page as value. This code gets you first 5 pages and shows response in the browser:

>>> from scrapy.http import FormRequest
>>> url = 'http://www.australianschoolsdirectory.com.au/search-result.php'
>>> for i in range(1, 6):
...     payload={'pageNum': str(i)}
...     r = FormRequest(url, formdata=payload)
...     fetch(r)
...     view(response)
...
2017-05-20 21:52:22 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.australianschoolsdirectory.com.au/search-result.php> (r
eferer: None)
True
2017-05-20 21:52:25 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.australianschoolsdirectory.com.au/search-result.php> (r
eferer: None)
True
2017-05-20 21:52:28 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.australianschoolsdirectory.com.au/search-result.php> (r
eferer: None)

If you need all pages simply change 6 to 488.

vold
  • 1,549
  • 1
  • 13
  • 19
  • Thanks for this, worked great! (I can't upvote yet, sorry!) – Matt Allan May 21 '17 at 10:43
  • How did you find that the field name was 'pageNum'? – Matt Allan May 21 '17 at 11:10
  • In `Network` tab select `Doc` filter and click next on the page. You can see post request. If you click on that request and scroll down you can see `Form Data` with `pageNum` key. You can use soft for playing and testing requests before writing actual scrapy code. I use [Postman](https://chrome.google.com/webstore/detail/postman/fhbjgbiflinjbdggehcddcbncdddomop) for debugging and testing requests. – vold May 21 '17 at 11:18