2

I develop websites and sometimes clients already have websites but need them totally revamped but most of the content and images need to stay the same. I'm looking for software, even if it costs or is a desktop application that will easily allow me to enter a URL and scrape all content to a designated folder on my local machine. Any help would be much appreciated.

cklingdesigns
  • 23
  • 1
  • 1
  • 3
  • 1
    File > Save As… > [ Website, Complete ] — It won't get you every page, but it'll get you all of the assets on the current page. – coreyward Apr 25 '11 at 15:00
  • possible duplicate of [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) – Gordon Apr 25 '11 at 17:09
  • possible duplicate if [Save Full Webpage](http://stackoverflow.com/questions/1722433/save-full-webpage) – Gordon Apr 25 '11 at 17:10

8 Answers8

7

htttrack will work just fine for you. It is an offline browser that will pull down websites. You can configure it as you wish. This will not pull down PHP obviously since php is server side code. The only thing you can pull down is html and javascript and any images pushed to the browser.

k to the z
  • 3,217
  • 2
  • 27
  • 41
4
file_put_contents('/some/directory/scrape_content.html', file_get_contents('http://google.com'));

Save your money for charity.

John Cartwright
  • 5,109
  • 22
  • 25
2

By content do you mean the entire page contents, cause you can just "save as..." the whole page with most of the included media.

Firefox, in Tool -> Page Info -> Media, includes a listing of every piece of media on the page that you can download.

Tony Lukasavage
  • 1,937
  • 1
  • 14
  • 26
1

Don't bother with PHP for something like this. You can use wget to grab an entire site trivially. However, be aware that it won't parse things like CSS for you, so it won't grab any files referenced via (say) background-image: URL('/images/pic.jpg'), but will snag most everything else for you.

Marc B
  • 356,200
  • 43
  • 426
  • 500
1

This class can help you scrape the content: http://simplehtmldom.sourceforge.net/

Klaus S.
  • 1,239
  • 10
  • 18
  • Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Apr 25 '11 at 17:10
  • Thanks for the suggestions, Gordon. Really good. :D – Klaus S. Apr 26 '11 at 21:31
0

You can scrape websites with http://scrapy.org and get the content you want.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

0

You can achieve this by save as option of the browser go to file->save page as in firefox and all the images and js will be saved in one folder

jimy
  • 4,848
  • 3
  • 35
  • 52
0

I started using HTTrack a couple of years ago and I'm happy with it. It seems to go out of its way to get pages I wouldn't even see on my own.

Pete Wilson
  • 8,610
  • 6
  • 39
  • 51