robots.txt allow root only, disallow everything else?

Question

I can't seem to get this to work but it seems really basic.

I want the domain root to be crawled

http://www.example.com

But nothing else to be crawled and all subdirectories are dynamic

http://www.example.com/*

I tried

User-agent: *
Allow: /
Disallow: /*/

but the Google webmaster test tool says all subdirectories are allowed.

Anyone have a solution for this? Thanks :)

Try removing the `Allow` line or putting it after the `Disallow`. Crawlers are supposed to stop at the first match. — Brian Roach, Aug 29 '11 at 05:43
Brian is right, first match rules, but beware that disallowing everything this way, the Google "quick view" won't be able to load any image or script, so display might get altered. So perhaps you would need to create at least one single public folder in order to have your homepage well displayed on "quick view". — Seb T., Aug 29 '11 at 06:15
For Googlebot, it isn't "first match" that wins it is "longest matching rule" that wins. — Stephen Ostermiller, Nov 30 '19 at 12:12

eywu · Answer 1 · 2020-08-16T06:48:14.357

79

According to the Backus-Naur Form (BNF) parsing definitions in Google's robots.txt documentation, the order of the Allow and Disallow directives doesn't matter. So changing the order really won't help you.

Instead, use the $ operator to indicate the closing of your path. $ means 'the end of the line' (i.e. don't match anything from this point on)

Test this robots.txt. I'm certain it should work for you (I've also verified in Google Search Console):

user-agent: *
Allow: /$
Disallow: /

This will allow http://www.example.com and http://www.example.com/ to be crawled but everything else blocked.

note: that the Allow directive satisfies your particular use case, but if you have index.html or default.php, these URLs will not be crawled.

side note: I'm only really familiar with Googlebot and bingbot behaviors. If there are any other engines you are targeting, they may or may not have specific rules on how the directives are listed out. So if you want to be "extra" sure, you can always swap the positions of the Allow and Disallow directive blocks, I just set them that way to debunk some of the comments.

edited Aug 16 '20 at 06:48

answered Feb 15 '14 at 07:12

eywu

2,654
1
22
24

Only root page could be crawled? Or http://www.example.com/electr/pr.html also is OK? – GML-VS Mar 29 '17 at 14:42
2

The longest rule is `/$` so it takes precedence. However it only matches the home page and no other URLs. All other URLs fall into the `Disallow: /` rule which blocks all crawling. – Stephen Ostermiller Nov 30 '19 at 12:14
I used the code in this answer (ie `user-agent: * Allow: /$ Disallow: /`) and noticed twitter posts sharing site link weren't showing a preview image, I went to https://cards-dev.twitter.com/validator and no image was shown and there was a message that said `WARN: The image URL https://www.my-domain-name.com/img/my_share_img.png specified by the 'og:image' metatag may be restricted by the site's robots.txt file, which will prevent Twitter from fetching it.`. Can anyone please confirm if this would be a valid solution? `user-agent: * Allow: /$ Allow: /img/my_share_img.png Disallow: /` – user1063287 Sep 23 '20 at 09:28
Answer to my comment question above: Yes, that seems to work (all on new lines of course). – user1063287 Sep 23 '20 at 09:45

score 11 · Answer 2 · answered Mar 15 '16 at 18:44

When you look at the google robots.txt specifications, you can see that:

Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:

* designates 0 or more instances of any valid character
$ designates the end of the URL

see https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#example-path-matches

Then as eywu said, the solution is

user-agent: *
Allow: /$
Disallow: /

robots.txt allow root only, disallow everything else?

2 Answers2

Linked