3

I want to crawl a site search info, my script works fine for normal page, but how to crawl login protected pages? i have login info how i send this info with url? my code:

crawl_page("http://www.site.com/v3/search/results?start=1&sortCol=MyDefault");
function crawl_page($url) {
    $html = file_get_contents($url);
    preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);
    $str = "http://www.site.com/v3/features/id?Profile=";
    foreach($matches[1] as $newurl) {
       if (strpos($newurl, $str) !== false) {
        $my_file = 'link.txt';
        $handle = fopen($my_file, 'a') or die('Cannot open file:  '.$my_file);
        $numberNewline = $newurl  . PHP_EOL;
        fwrite($handle, $numberNewline);
       }
    }
}

any help Thanks.

Shawon
  • 930
  • 2
  • 10
  • 19
  • 1
    Put the username and password combination ability into your crawler. With a very long random unguessable password. Pretty sure you can't crawl secure pages without login. That would defeat the purpose. – PenguinCoder Feb 06 '13 at 14:48
  • 7
    Automated flirting? Sounds like you're trying to abuse this system. – Joe Feb 06 '13 at 14:48
  • i have username and password – Shawon Feb 06 '13 at 14:49
  • 2
    Still. Sounds creepy and almost certainly a breach of the terms. – Joe Feb 06 '13 at 14:53
  • read the site's ToS, look for their API, get the keys and implement – CSᵠ Feb 06 '13 at 15:04
  • This is a legitimate question, and a real problem. Some interfaces don't have APIs. I personally use this kind of process to install Joomla Extension on test servers and there's nothing wrong about that. Plus what (s)he's doing is none of your business. – Bgi Feb 06 '13 at 15:29

2 Answers2

1

This heavily depends on the method of authentication used. The most simple one is HTTP Basic Auth. For that method, you only need to build a context like this:

$context = stream_context_create(array(
    'http' => array(
        'header'  => "Authorization: Basic " . base64_encode("$username:$password")
    )
));
$data = file_get_contents($url, false, $context);

This way, file_get_contents will use the HTTP basic auth.

Other auth methods may require more work, like sending passwords via POST to login pages and storing session cookies.

Martin Müller
  • 2,565
  • 21
  • 32
1

My answer applies only to Form Authentication (this is the most common form of authentication).

Basically, when you browse a website, you open a "session" on it. When you log in on the website, your session gets "authenticated" and you're granted access everywhere based on that.

Your browser identifies the corresponding session to the server thanks to a Session Id stored in a cookie.

So you must browse the login page and then browse the page you want without forgetting to send the cookie in the process. The cookie is the link between all the pages you browse.

I actually faced the same problem you did a while ago, and wrote a class to do that without having to keep in mind this cookie thing.

Look quickly at the class, it is not important, but look well at the example below. It allows you to submit forms that implement CSRF protection.

This class has basically the following features: - Complies with CSRF token-based protection - Sends a "common" user-agent. Some websites reject queries that don't communicate a user-agent - Sends a Referrer header. Some websites reject queries that don't communicate a referrer (this is another anti-csrf protection) - Stores the cookie across the calls

File: WebClient.php

<?php
/**
 * Webclient
 *
 * Helper class to browse the web
 *
 * @author Bgi
 */

class WebClient
{
    private $ch;
    private $cookie = '';
    private $html;

    public function Navigate($url, $post = array()) 
    {
        curl_setopt($this->ch, CURLOPT_URL, $url);
        curl_setopt($this->ch, CURLOPT_COOKIE, $this->cookie);
        if (!empty($post)) {
            curl_setopt($this->ch, CURLOPT_POST, TRUE);
            curl_setopt($this->ch, CURLOPT_POSTFIELDS, $post);
        }
        $response = $this->exec();
        if ($response['Code'] !== 200) {
            return FALSE;
        }
        //echo curl_getinfo($this->ch, CURLINFO_HEADER_OUT);
        return $response['Html'];
    }

    public function getInputs() 
    {
        $return = array();

        $dom = new DOMDocument();
        @$dom->loadHtml($this->html);
        $inputs = $dom->getElementsByTagName('input');
        foreach($inputs as $input)
        {
            if ($input->hasAttributes() && $input->attributes->getNamedItem('name') !== NULL)
            {
                if ($input->attributes->getNamedItem('value') !== NULL)
                    $return[$input->attributes->getNamedItem('name')->value] = $input->attributes->getNamedItem('value')->value;
                else
                    $return[$input->attributes->getNamedItem('name')->value] = NULL;
            }
        }

        return $return;
    }

    public function __construct()
    {
        $this->init();
    }

    public function __destruct()
    {
        $this->close();
    }

    private function init() 
    {
        $this->ch = curl_init();
        curl_setopt($this->ch, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1");
        curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($this->ch, CURLOPT_MAXREDIRS, 5);
        curl_setopt($this->ch, CURLINFO_HEADER_OUT, TRUE);
        curl_setopt($this->ch, CURLOPT_HEADER, TRUE);
        curl_setopt($this->ch, CURLOPT_AUTOREFERER, TRUE);
    }

    private function exec() 
    {
        $headers = array();
        $html = '';

        ob_start();
        curl_exec($this->ch);
        $output = ob_get_contents();
        ob_end_clean(); 

        $retcode = curl_getinfo($this->ch, CURLINFO_HTTP_CODE);

        if ($retcode == 200) {
            $separator = strpos($output, "\r\n\r\n");

            $html = substr($output, $separator);

            $h = trim(substr($output,0,$separator));
            $lines = explode("\n", $h);
            foreach($lines as $line) {
                $kv = explode(':',$line);

                if (count($kv) == 2) {
                    $k = trim($kv[0]);
                    $v = trim($kv[1]);
                    $headers[$k] = $v;
                }
            }
        }

        // TODO: it would deserve to be tested extensively.
        if (!empty($headers['Set-Cookie']))
            $this->cookie = $headers['Set-Cookie'];

        $this->html = $html;

        return array('Code' => $retcode, 'Headers' => $headers, 'Html' => $html);
    }

    private function close()
    {
        curl_close($this->ch);
    }
}

How do I use it?

In this example, I login to a website, then browse to a page which contains a form to upload a file, then I upload the file:

<?php
    require_once('WebClient.php');
    $url = 'http://example.com/administrator/index.php'; // This a Joomla admin

    $wc = new WebClient();
    $page = $wc->Navigate($url);
    if ($page === FALSE) {
         die('Failed to load login page.');
    }

    echo('Logging in...');

    $post = $wc->getInputs();
    $post['username'] = $username;
    $post['passwd'] = $passwd;

    $page = $wc->Navigate($url, $post);
    if ($page === FALSE) {
        die('Failed to post credentials.');
    }

  echo('Initializing installation...');

    $page = $wc->Navigate($url.'?option=com_installer');
    if ($page === FALSE) {
        die('Failed to access installer.');
    }

    echo('Installing...');

    $post = $wc->getInputs();
    $post['install_package'] = '@'.$file; // The @ specifies we are sending a file

    $page = $wc->Navigate($url.'?option=com_installer&view=install', $post);
    if ($page === FALSE) {
        die('Failed to upload file.');
    }

    echo('Done.');

The Navigate() method returns either FALSE either the HTML content of the page browsed.

Oh, and one last thing: don't use regexes to parse HTML, this is WRONG. There is a legendary StackOverflow answer about that: see here.

Community
  • 1
  • 1
Bgi
  • 2,513
  • 13
  • 12
  • i use your code but its show 'Failed to post credentials.' is there any way to use browser cookie ? – Shawon Feb 07 '13 at 05:06
  • Are you sure you identified correctly the name of the inputs (in my case it's username and passwd, but it's probably different on your site)... You should try to debug the headers – Bgi Feb 07 '13 at 10:00