How to scrape all content from a website?

Question

I develop websites and sometimes clients already have websites but need them totally revamped but most of the content and images need to stay the same. I'm looking for software, even if it costs or is a desktop application that will easily allow me to enter a URL and scrape all content to a designated folder on my local machine. Any help would be much appreciated.

File > Save As… > [ Website, Complete ] — It won't get you every page, but it'll get you all of the assets on the current page. — coreyward, Apr 25 '11 at 15:00
possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) — Gordon, Apr 25 '11 at 17:09
possible duplicate if [Save Full Webpage](http://stackoverflow.com/questions/1722433/save-full-webpage) — Gordon, Apr 25 '11 at 17:10

score 7 · Accepted Answer · answered Apr 25 '11 at 14:59

htttrack will work just fine for you. It is an offline browser that will pull down websites. You can configure it as you wish. This will not pull down PHP obviously since php is server side code. The only thing you can pull down is html and javascript and any images pushed to the browser.

score 4 · Answer 2 · answered Apr 25 '11 at 14:58

4

file_put_contents('/some/directory/scrape_content.html', file_get_contents('http://google.com'));

Save your money for charity.

answered Apr 25 '11 at 14:58

John Cartwright

5,109
22
25

score 2 · Answer 3 · answered Apr 25 '11 at 15:00

2

By content do you mean the entire page contents, cause you can just "save as..." the whole page with most of the included media.

Firefox, in Tool -> Page Info -> Media, includes a listing of every piece of media on the page that you can download.

answered Apr 25 '11 at 15:00

Tony Lukasavage

1,937
1
14
26

score 1 · Answer 4 · answered Apr 25 '11 at 15:01

Don't bother with PHP for something like this. You can use wget to grab an entire site trivially. However, be aware that it won't parse things like CSS for you, so it won't grab any files referenced via (say) background-image: URL('/images/pic.jpg'), but will snag most everything else for you.

score 1 · Answer 5 · answered Apr 25 '11 at 15:11

1

This class can help you scrape the content: http://simplehtmldom.sourceforge.net/

answered Apr 25 '11 at 15:11

Klaus S.

1,239
10
18

Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Apr 25 '11 at 17:10
Thanks for the suggestions, Gordon. Really good. :D – Klaus S. Apr 26 '11 at 21:31

score 0 · Answer 6 · answered Aug 15 '13 at 08:43

You can scrape websites with http://scrapy.org and get the content you want.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

score 0 · Answer 7 · answered Apr 25 '11 at 15:00

0

You can achieve this by save as option of the browser go to file->save page as in firefox and all the images and js will be saved in one folder

answered Apr 25 '11 at 15:00

jimy

4,848
3
35
52

score 0 · Answer 8 · answered Apr 25 '11 at 15:11

0

I started using HTTrack a couple of years ago and I'm happy with it. It seems to go out of its way to get pages I wouldn't even see on my own.

answered Apr 25 '11 at 15:11

Pete Wilson

8,610
6
39
51

How to scrape all content from a website?

8 Answers8

Linked

Related