How Google interprets the robots.txt specification  |  Search Central

Google’s automated
crawlers support the
Robots Exclusion Protocol (REP).
This means that before crawling a site, Google’s crawlers download and parse the site’s
robots.txt file to extract information about which parts of the site may be crawled. The REP
isn’t applicable to Google’s crawlers that are controlled by users (for example, feed
subscriptions), or crawlers that are used to increase user safety (for example, malware
analysis).

What is a robots.txt file

If you don’t want crawlers to access sections of your site, you can create a robots.txt file
with appropriate rules. A robots.txt file is a simple text file containing rules about which
crawlers may access which parts of a site.

File location and range of validity

You must place the robots.txt file in the top-level directory of a site, on a supported
protocol. In case of Google Search, the supported protocols are HTTP,
HTTPS, and FTP. On HTTP and HTTPS, crawlers fetch the robots.txt file with an HTTP
non-conditional GET request; on FTP, crawlers use a standard
RETR (RETRIEVE) command, using anonymous login.

The rules listed in the robots.txt file apply only to the host, protocol, and port number
where the robots.txt file is hosted.

Examples of valid robots.txt URLs

Robots.txt URL examples
http://example.com/robots.txt Valid for:

  • http://example.com/
  • http://example.com/folder/file

Not valid for:

  • http://other.example.com/
  • https://example.com/
  • http://example.com:8181/
http://www.example.com/robots.txt

Valid for:
http://www.example.com/

Not valid for:

  • http://example.com/
  • http://shop.www.example.com/
  • http://www.shop.example.com/
http://example.com/folder/robots.txt Not a valid robots.txt file. Crawlers don’t check for robots.txt files in subdirectories.
http://www.exämple.com/robots.txt Valid for:

  • http://www.exämple.com/
  • http://xn--exmple-cua.com/

Not valid for:
http://www.example.com/

ftp://example.com/robots.txt

Valid for:
ftp://example.com/

Not valid for:
http://example.com/

http://212.96.82.21/robots.txt

Valid for:
http://212.96.82.21/

Not valid for:
http://example.com/ (even if hosted on 212.96.82.21)

http://example.com:80/robots.txt

Valid for:

  • http://example.com:80/
  • http://example.com/

Not valid for:
http://example.com:81/

http://example.com:8181/robots.txt

Valid for:
http://example.com:8181/

Not valid for:
http://example.com/

Handling of errors and HTTP status codes

When requesting a robots.txt file, the HTTP status code of the server’s response affects how
the robots.txt file will be used by Google’s crawlers.

Handling of errors and HTTP status codes
2xx (successful) HTTP status codes that signal success prompt Google’s crawlers to process the robots.txt
file as provided by the server.
3xx (redirection)

Google follows at least five redirect hops as defined by
RFC 1945 and then
stops and treats it as a 404 for the robots.txt. This also applies to any
disallowed URLs in the redirect chain, since the crawler couldn’t fetch rules due to
the redirects.

Google doesn’t follow logical redirects in robots.txt files (frames, JavaScript, or
meta refresh-type redirects).

4xx (client errors)

Google’s crawlers treat all 4xx errors as if a valid robots.txt file
didn’t exist, which means crawling without restrictions.

5xx (server error)

Because the server couldn’t give a definite response to Google’s robots.txt request,
Google temporarily interprets server errors as if the site is fully disallowed. Google
will try to crawl the robots.txt file until it obtains a non-server-error HTTP status
code. A 503 (service unavailable) error results in fairly frequent
retrying. If the robots.txt is unreachable for more than 30 days, Google will use the
last cached copy of the robots.txt. If unavailable, Google assumes that there are no
crawl restrictions.

If we are able to determine that a site is incorrectly configured to return
5xx instead of a 404 status code for missing pages, we treat
the 5xx error from that site as a 404. For example, if the
error message on a page that returns a 5xx status code is “Page not
found”, we would interpret the satus code as 404 (not found).

Other errors A robots.txt file which cannot be fetched due to DNS or networking issues, such as
timeouts, invalid responses, reset or interrupted connections, and HTTP chunking errors,
is treated as a server error.

Caching

Google generally caches the contents of robots.txt file for up to 24 hours, but may cache it
longer in situations where refreshing the cached version isn’t possible (for example, due to
timeouts or 5xx errors). The cached response may be shared by different crawlers.
Google may increase or decrease the cache lifetime based on
max-age Cache-Control
HTTP headers.

File format

The robots.txt file must be a
UTF-8 encoded plain text
file and the lines must be separated by CR, CR/LF, or
LF.

Google ignores invalid lines in robots.txt files, including the Unicode
Byte Order Mark
(BOM) at the beginning of the robots.txt file, and use only valid lines. For example, if the
content downloaded is HTML instead of robots.txt rules, Google will try to parse the content
and extract rules, and ignore everything else.

Similarly, if the character encoding of the robots.txt file isn’t UTF-8, Google may ignore
characters that are not part of the UTF-8 range, potentially rendering robots.txt rules
invalid.

Google currently enforces a robots.txt file size limit of 500
kibibytes (KiB). Content
which is after the maximum file size is ignored. You can reduce the size of the robots.txt
file by consolidating directives that would result in an oversized robots.txt file. For
example, place excluded material in a separate directory.

Syntax

Valid robots.txt lines consists of a field, a colon, and a value. Spaces are optional, but
recommended to improve readability. Space at the beginning and at the end of the line is
ignored. To include comments, precede your comment with the # character. Keep in
mind that everything after the # character will be ignored. The general format is
<field>:<value><#optional-comment>.

Google supports the following fields:

  • user-agent: identifies which crawler the rules apply to.
  • allow: a URL path that may be crawled.
  • disallow: a URL path that may not be crawled..
  • sitemap: the complete URL of a sitemap.

The allow and disallow fields are also called directives.
These directives are always specified in the form of
directive: [path] where [path] is optional. By default, there are no
restrictions for crawling for the designated crawlers. Crawlers ignore directives without a
[path].

The [path] value, if specified, is relative to the root of the website from where
the robots.txt file was fetched (using the same protocol, port number, host and domain names).
The path value must start with / to designate the root and the value is
case-sensitive. Learn more about
URL matching based on path values.

user-agent

The user-agent line identifies which crawler rules apply to. See
Google’s crawlers and user-agent strings
for a comprehensive list of user-agent strings you can use in your robots.txt file.

The value of the user-agent line is case-insensitive.

disallow

The disallow directive specifies paths that must not be accessed by the crawlers
identified by the user-agent line the disallow directive is grouped with.
Crawlers ignore the directive without a path.

The value of the disallow directive is case-sensitive.

Usage:

disallow: [path]

allow

The allow directive specifies paths that may be accessed by the designated
crawlers. When no path is specified, the directive is ignored.

The value of the allow directive is case-sensitive.

Usage:

allow: [path]

sitemap

Google, Bing, and other major search engines support the sitemap field in
robots.txt, as defined by sitemaps.org.

The value of the sitemap field is case-sensitive.

Usage:

sitemap: [absoluteURL]

The [absoluteURL] line points to the location of a sitemap or sitemap index file.
It must be a fully qualified URL, including the protocol and host, and doesn’thave to be
URL-encoded. The URL doesn’thave to be on the same host as the robots.txt file. You can
specify multiple sitemap fields. The sitemap field isn’t tied to any specific
user agent and may be followed by all crawlers, provided it isn’t disallowed for crawling.

For example:

user-agent: otherbot
disallow: /kale

sitemap: https://example.com/sitemap.xml
sitemap: https://cdn.example.org/other-sitemap.xml
sitemap: https://ja.example.org/テスト-サイトマップ.xml

Grouping of lines and rules

You can group together rules that apply to multiple user agents by repeating
user-agent lines for each crawler.

For example:

user-agent: a
disallow: /c

user-agent: b
disallow: /d

user-agent: e
user-agent: f
disallow: /g

user-agent: h

In this example there are four distinct rule groups:

  • One group for user agent “a”.
  • One group for user agent “b”.
  • One group for both “e” and “f” user agents.
  • One group for user agent “h”.

For the technical description of a group, see
section 2.1 of the REP.

Order of precedence for user agents

Only one group is valid for a particular crawler. Google’s crawlers determine the correct
group of rules by finding in the robots.txt file the group with the most specific user agent
that matches the crawler’s user agent. Other groups are ignored. All non-matching text is
ignored (for example, both googlebot/1.2 and googlebot* are
equivalent to googlebot). The order of the groups within the robots.txt file is
irrelevant.

If there’s more than one group declared for a specific user agent, all the rules from the
groups applicable to the specific user agent are combined internally into a single group.

Examples

Matching of user-agent fields

user-agent: googlebot-news
(group 1)

user-agent: *
(group 2)

user-agent: googlebot
(group 3)

This is how the crawlers would choose the relevant group:

Group followed per crawler
Googlebot News googlebot-news follows group 1, because group 1 is the most specific group.
Googlebot (web) googlebot follows group 3.
Googlebot Images googlebot-images follows group 2, because there is no specific
googlebot-images group.
Googlebot News (when crawling images) When crawling images, googlebot-news follows group 1.
googlebot-news doesn’t crawl the images for Google Images, so it only
follows group 1.
Otherbot (web) Other Google crawlers follow group 2.
Otherbot (news) Other Google crawlers that crawl news content, but don’t identify as
googlebot-news follow group 2. Even if there is an entry for a related
crawler, it is only valid if it’s specifically matching.

Grouping of rules

If there are multiple groups in a robots.txt file that are relevant to a specific user agent,
Google’s crawlers internally merge the groups. For example:

user-agent: googlebot-news
disallow: /fish

user-agent: *
disallow: /carrots

user-agent: googlebot-news
disallow: /shrimp

The crawlers internally group the rules based on user agent, for example:

user-agent: googlebot-news
disallow: /fish
disallow: /shrimp

user-agent: *
disallow: /carrots

URL matching based on path values

Google uses the path value in the allow and disallow directives as a
basis to determine whether or not a rule applies to a specific URL on a site. This works by
comparing the rule to the path component of the URL that the crawler is trying to fetch.
Non-7-bit ASCII characters in a path may be included as UTF-8 characters or as percent-escaped
UTF-8 encoded characters per
RFC 3986.

Google, Bing, and other major search engines support a limited form of wildcards for
path values. These are:

  • * designates 0 or more instances of any valid character.
  • $ designates the end of the URL.
Example path matches
/ Matches the root and any lower level URL.
/* Equivalent to /. The trailing wildcard is ignored.
/$ Matches only the root. Any lower level URL is allowed for crawling.
/fish

Matches any path that starts with /fish.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn’t match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish
/fish*

Equivalent to /fish. The trailing wildcard is ignored.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn’t match:

  • /Fish.asp
  • /catfish
  • /?id=fish
/fish/

Matches anything in the /fish/ folder.

Matches:

  • /fish/
  • /animals/fish/
  • /fish/?id=anything
  • /fish/salmon.htm

Doesn’t match:

  • /fish
  • /fish.html
  • /Fish/Salmon.asp
/*.php

Matches any path that contains .php.

Matches:

  • /index.php
  • /filename.php
  • /folder/filename.php
  • /folder/filename.php?parameters
  • /folder/any.php.file.html
  • /filename.php/

Doesn’t match:

  • / (even if it maps to /index.php)
  • /windows.PHP
/*.php$

Matches any path that ends with .php.

Matches:

  • /filename.php
  • /folder/filename.php

Doesn’t match:

  • /filename.php?parameters
  • /filename.php/
  • /filename.php5
  • /windows.PHP
/fish*.php

Matches any path that contains /fish and .php, in that order.

Matches:

  • /fish.php
  • /fishheads/catfish.php?parameters

Doesn’t match:
/Fish.PHP

Order of precedence for rules

When matching robots.txt rules to URLs, crawlers use the most specific rule based on the
length of its path. In case of conflicting rules, including those with wildcards, Google uses
the least restrictive rule.

The following examples demonstrate which rule Google’s crawlers will apply on a given URL.

Sample situations
http://example.com/page
allow: /p
disallow: /

Applicable rule: allow: /p, because it’s more specific.

http://example.com/folder/page
allow: /folder
disallow: /folder

Applicable rule: allow: /folder, because in case of
matching rules, Google uses the least restrictive rule.

http://example.com/page.htm
allow: /page
disallow: /*.htm

Applicable rule: disallow: /*.htm, because it matches
more characters in the URL, so it’s more specific.

http://example.com/page.php5
allow: /page
disallow: /*.ph

Applicable rule: allow: /page, because in case of
matching rules, Google uses the least restrictive rule.

http://example.com/
allow: /$
disallow: /

Applicable rule: allow: /$, because it’s more specific.

http://example.com/page.htm
allow: /$
disallow: /

Applicable rule: disallow: /, because the
allow rule only applies on the root URL.

Related Articles

Leave a Reply

Back to top button