What does it mean if robots.txt allows everything and disallows everything?

Question

I am trying to scrape several websites using beautiful soup and mechanize libraries in python. However, I came across a website with the following robots.txt

User-Agent: *
Allow: /$
Disallow: /

According to wikipedia, Allow directive counteracts the following Disallow directive. I have read about simpler examples, and understand how it works, but this situation is a bit confusing for me. Am I right to assume that my crawler is allowed to access everything on this website? If yes, it seems really strange that the website would even bother writing robots.txt in the first place...

Extra information: Mechanize gave me an error when I tried to scrape this website, and the error was something along the lines of Http error 403, crawling is prohibited because of robots.txt. If my assumption stated above is correct, then I think the reason mechanize returned an error while trying to access the website is because it is either not equipped to handle such robots.txt or it follows a different standard of interpreting robots.txt files. (In which case I will just have to make my crawler ignore robots.txt)

Update:

I just stumbled upon this question

robots.txt allow root only, disallow everything else?

in particular, I looked at the @eywu's answer, and now I think my initial assumption was wrong, and I am only allowed to access website.com but not website.com/other-stuff

score 3 · Accepted Answer · answered Nov 08 '15 at 21:50

3

Your update is correct. You can access http://example.com/, but not http://example.com/page.htm.

This is taken from Robots.txt Specifications, look at the very bottom of the page in the section titled "Order of precedence for group-member records" which states:

URL allow:  disallow:   Verdict Comments
http://example.com/page /p  /   allow    
http://example.com/folder/page  /folder/    /folder allow    
http://example.com/page.htm /page   /*.htm  undefined    
http://example.com/ /$  /   allow    
http://example.com/page.htm /$  /   disallow

answered Nov 08 '15 at 21:50

Leb

15,483
10
56
75

Thanks a lot, I can't believe I didn't read Robots.txt specification first :D – Anny G Nov 08 '15 at 21:58
No problem, 2 sets of eyes better than one. – Leb Nov 08 '15 at 21:59

Martijn Pieters · Answer 2 · 2015-11-08T21:51:31.493

0

No, your crawler can only access the homepage.

The Allow directive lets you access /$; the $ is significant here! It means only the literal / path matches, any other path (like /foo/bar) is not allowed as per the Disallow directive, which matches all paths (it has no $).

See the Google documentation on path matching:

$ designates the end of the URL

Mechanize correctly interpreted the robots.txt file.

edited Nov 08 '15 at 21:51

answered Nov 08 '15 at 21:44

Martijn Pieters

1,048,767
296
4,058
3,343

Thank you for your answer!! – Anny G Nov 08 '15 at 21:58

What does it mean if robots.txt allows everything and disallows everything?

2 Answers2