3

I am trying to crawl some domains with different user-agents. My crawler works fins, the problem happens when a domain does not have an SSL certificate and is insecure, in that case, I do not get any response with HttpClient. To skip that I use HttpHandler and set the certificate myself. With this solution I get 301 for all those domains, it feels like my AllowAutoRedirect is false however it is not. I tried and assigned MaxAutomaticRedirections to 5, that did not work as well.

Here is my code:

public Task<int> Crawl(string userAgent, string url)
{
    var handler = new HttpClientHandler();
    handler.ClientCertificateOptions = ClientCertificateOption.Manual;
    handler.ServerCertificateCustomValidationCallback =
        (httpRequestMessage, cert, cetChain, policyErrors) =>
    {
        return true;
    };

    var httpClient = new HttpClient(handler);

    httpClient.DefaultRequestHeaders.Add("User-Agent", userAgent);


    var statusCode = (int)(await httpClient.SendAsync(new HttpRequestMessage(HttpMethod.Get, URL))).StatusCode;

    return statusCode;
}
Sajjad Mortazavi
  • 190
  • 2
  • 12
  • Have you tried using [HttpClientHandler.DangerousAcceptAnyServerCertificateValidator](https://learn.microsoft.com/en-us/dotnet/api/system.net.http.httpclienthandler.dangerousacceptanyservercertificatevalidator?view=net-5.0) instead of `ClientCertificationOption.Manual`? – keenthinker Jan 19 '21 at 17:45
  • 3
    You may want to take a look at [HttpClient doesn't redirect even when AllowAutoRedirect = true](https://stackoverflow.com/q/42405183/215552) – Heretic Monkey Jan 19 '21 at 17:49

1 Answers1

0

Domains that I was trying to crawl did not have any SSL certificates and HttpClient was redirected to the HTTP version. My guess is HttpClient did not have any clue where it was redirected to, so just did not continue.

My problem got solved by crawling HTTP version of domains, for example: http://example.com

Sajjad Mortazavi
  • 190
  • 2
  • 12