I am trying to scrape several websites using beautiful soup and mechanize libraries in python. However, I came across a website with the following robots.txt
User-Agent: *
Allow: /$
Disallow: /
According to wikipedia, Allow directive counteracts the following Disallow directive. I have read about simpler examples, and understand how it works, but this situation is a bit confusing for me. Am I right to assume that my crawler is allowed to access everything on this website? If yes, it seems really strange that the website would even bother writing robots.txt in the first place...
Extra information:
Mechanize gave me an error when I tried to scrape this website, and the error was something along the lines of Http error 403, crawling is prohibited because of robots.txt
. If my assumption stated above is correct, then I think the reason mechanize returned an error while trying to access the website is because it is either not equipped to handle such robots.txt or it follows a different standard of interpreting robots.txt files. (In which case I will just have to make my crawler ignore robots.txt)
Update:
I just stumbled upon this question
robots.txt allow root only, disallow everything else?
in particular, I looked at the @eywu's answer, and now I think my initial assumption was wrong, and I am only allowed to access website.com but not website.com/other-stuff