0

I'm looking for a plugin or a simple code that fetches images from a link FASTER. I have been using http://simplehtmldom.sourceforge.net/ to extract first 3 images from a given link.

simplehtmldom is quite slow and many users on my site are reporting it as an issue.

Correct me if I'm wrong , I believe this plugin is taking lot of time to fetch complete html code from the url I pass and then it searches for img tags.

Someone please suggest me a technique to improvise the speed of fetching html code or an alternate plugin that i can try ?

What I'm thinking is something like fetching html code until it finds first three img tags and then kill the code fetching process ? So that things get faster.

I'm not sure if it's possible with php although, I'm trying hard to design that using jquery.

Thanks for your help !

Yesh
  • 63
  • 9
  • What can you tell us about the page that you're getting the images from? That's most likely the bottleneck, rather than a simple parse and find. Could we see the page in question? – Reinstate Monica Cellio Jan 25 '13 at 17:46
  • I'm not talking about a particular page. It's just like pinterest or facebook caching a web page's image when a user passes a URL. – Yesh Jan 25 '13 at 17:59

1 Answers1

3

Cross-site scripting rules will prevent you from doing something like this in jQuery/JS (unless you control all the domains that you'll be grabbing content from). What you're doing is not going to be super fast in any case, but try writing your own using file_get_content() paired with DOMDocument... the DOMDocument getElementsByTagName method may be faster than simplehtmldom's find() method.

You could also try a regex approach. It won't be as fool-proof as a true DOM parser, but it will probably be faster... Something like:

$html = file_get_contents($url);
preg_match_all("/<img[^']*?src=\"([^']*?)\"[^']*?>/", $html, $arr, PREG_PATTERN_ORDER);

If you want to avoid reading whole large files, you can also skip the file_get_contents() call and sub in a fopen(); while(feof()) loop and just check for images after each line is read from the remote server. If you take this approach, however, make sure you're regexing the WHOLE buffered string, not just the most recent line, as you could easily have the code for an image broke across several lines.

Keep in mind that real-life variability in HTML will make regex an imperfect solution at best, but if speed is a major concern it might be your best option.

Ben D
  • 14,321
  • 3
  • 45
  • 59
  • Thanks! What do you think popular sites like pinterest and facebook are doing ? Or is it just they have good servers ? – Yesh Jan 25 '13 at 17:50
  • As I understand it, Facebook is using javascript to do this. They are unaffected by the cross-scripting rules because the js is embedded *in* the code of the target site (i.e. it's not going out and fetching the page, it's a part of the page in the target site). Where they don't rely on this method, I'm sure they have a much more sophisticated approach (parsing the code as it's being read from the remote site until it's found an image, and then stopping the read so it doesn't have to download the full content, perhaps?). – Ben D Jan 25 '13 at 18:05
  • Exactly. As I have mentioned in my question, I'm gathering techniques to achieve the same. i.e. parsing until 3 img tags(and those images being larger than a standard size! ) are found and then stopping the process. – Yesh Jan 25 '13 at 18:27
  • 1
    This approach will probably break an PGP DOM-parsing libraries, but you could use the regex approach paired with a `fopen/feof` loop (look at the last answer [here](http://stackoverflow.com/questions/5970632/php-how-to-read-big-remote-files-efficiently-and-use-buffer-in-loop). this will read the code one line at a time... you can then regex the single buffered line and stop if you've hit three images. – Ben D Jan 25 '13 at 19:29