Robots.txt test methods
By using the Robots.txt file webmasters can control the behavior of search engines robots (crawlers) when indexing their websites. The entire site or some its area can be closed from indexing. The robots can be informed about such important information as Sitemap files, expected delay between two subsequent accesses the site, etc.
Errors in robots.txt file and its deviations from the standards can cause problems with the indexing of your websites by some search engines.
Moderatobot checks robots.txt file in order to identify potential problems that search robots might face when processing your robots.txt files. And then offers its own version of the robots.txt file that will best meet the standard and can be equally understood by most major robots.
Status of this document
The following document is neither a description of the standard of robots.txt file, nor a manual of how to use robots.txt files to interact with the robots. But this is the information about the approaches which Moderatobot uses when testing robots.txt files.
Most developers of the robots do not reveal their algorithms. So we can guess how the robots check robots.txt files from specifications that are provided by some developers and from the standard for robots.txt files assuming that the robots follow this standard. We cannot guarantee that all robots will use the same approaches as Moderatobot. However, the verification of the robots.txt file by Moderatobot can uncover potential problems which may encounter the robots at work.
- A Standard for Robot Exclusion
Some major search robots follow this standard. Exactly this document Moderatobot means as the standard.
Some robots use this standard with the enhanced capabilities. This sometimes leads to the fact that different robots can understand one and the same directives in different ways. In order to be sure that a particular robot will process your file exactly as you expect, you should refer to the developer documentation for this specific robot
File name and location
Moderatobot processes files with the name in lowercase and uppercase.
The robots.txt file must be placed in the root directory of your domain. Each subdomain must have its own robots.txt file placed in the root directory of the subdomain. If you are using different protocols and ports to access your site, you must create a robots.txt for each combination of protocol and port.
When robots try to access the robots.txt file, they receive HTTP Status Code in the server response. Different robots can interpret the server response HTTP Status codes differently.
2xx - Success. File exists and is accessible. The robot receives the file and uses its instructions.
3xx – Redirect. According to the standard the robots should follow the redirects until the file can be found. But some robots follow the redirects in depth up to 5 levels or until they meet a cycle. If the resulting file is still not achieved, the robots suppose that the file does not exist and access to the site is not restricted by the robots.txt file.
4xx – Client Error. According to the standard the robots have to assume that access to whole site is prohibited when receiving codes 401 - Unauthorized and 403 - Forbidden. And no access rules specified and the site is not restricted by the robots.txt file when receiving code 404 – Not found. But some robots interpret all 4XX as access to the site is not restricted.
5xx - Server Error. According to the standard the robots should suspend attempts to access the site until the file is available again. Some robots will periodically retry to get the robots.txt file until receive a HTTP status code other than 5XX. But some robots interpret all 5XX as access to the site is not restricted.
Moderatobot makes up to 10 redirections and begins to warn after the 5-th redirect.
Moderatobot does not retry to get the robots.txt file if 5ХХ codes are received.
Note that some robots will only process the responses with HTTP code 200. If a response is received with any other code, the robots assume that there are no restrictions and the website is fully accessible for indexing.
We have no information about the duration of time for which different bots are trying to download the file before timeout. Moderatobot tries to get the file within 10 seconds, but it starts to warn about low speed after 5 seconds, assuming that 5 seconds is a low speed indeed.
According to the standard the file format is text/plain. Some robots are expecting the file encoded in UTF-8. Moderatobot tries to verify whether the file encoding is utf-8 by checking the BOM signature of the file (if exists).
In case robots.txt is not a text file or has errors in a format, robots can ignore the file or a part of it.
File consists of records that follow one another or separated by blank line(s). Records follow one another or separated by blank line(s). The record types are:
A comment is any sequence of characters, beginning with a # character and ending with a CR, CR / LF or LF. All bots should ignore comments. Records consisting entirely of comments are ignored completely and so cannot be considered as blank lines. Moderatobot, in accordance with the recommendations of the standard, moves comments that begin not with the beginning of the line to the beginning of the previous line.
All directives must be in the format:
A name is case-insensitive. A value can be missed, case-insensitive or case-sensitive if represents URL
Standard and non-standard directives
The standard directives are User-agent and Disallow. All other directives are non-standard.
The robots which use the extended standard can process the directives in their own way.
If robots encounter unknown directive they should skip the whole record.
All the robots.txt directives are divided into group and non-group. Group directives should be combined in groups. Non-group directives can be anywhere in the file, however, it is recommended to place them outside the groups.
A group opens with one or more directives User-agent. Groups must be separated from each other by one or more blank lines. Blank lines inside the group are prohibited. When Moderatobot encounters a blank line, it closes the current group, if the group is open.
If there is no a blank line before a record with the User-agent directive, except the very first group in the file, Moderatobot inserts a blank line before the record and starts the new group.
Each group represents directives for robot(s), listed in leading User-agent(s). The file must contain only one group containing the particular robot's name token in its User-agent directive. If it contains more then one group per a robot, the robot must take into account only the first group.
Such directives as User-agent, Disallow, Allow and Crawl-delay to be used only within groups.
If the group is not open and Moderatobot encounters any group directive, except the User-agent one, it considers that the group is incorrect and ignores all subsequent group directives until it encounters any non-group directive or a blank line.
User-agent: name token of the robot.
The value of the User-agent directive is case-insensitive and contains the robot's name token, one name token per a directive.
A record ‘User-agent: *’ represents the group for all robots which are not mentioned in any User-agent directive.
Some search machines have more then one robots. For example:
Googlebot – the main robot;
Googlebot-News, Googlebot-Image, etc. – the secondary robots.
According to Google Robots.txt Specifications, any Google robot must find the group with the most specific user-agent. So in case of the Googlebot-News robot it will search for the User-agent directive with the “Googlebot-News” name token. If the group is not found, it will search for the “Googlebot" name token. And finally, if it is still not found, will search for the "*".
Some robots use Google like strategy of groups' searching. But you should notice that there is another strategy: the robot will search for the first User-Agent whose value contains the name token of the robot as a substring. In this case, the search results for main robot will depend on the order of the groups.
If you expect from a robot a certain behavior when scanning the website, it is better to create a group with the exact name of the robot to be sure the robot will find the correct group and apply the necessary directives. Also put groups in order in accordance with the specification of the robot.
Moderatobot moves the group with ‘User-agent: *’ to the end of the file to let robots find theirs most specific name tokens first.
The use of the * symbol in conjunction with other symbols in the value of the User-agent directive is not described in the standard and can cause to ambiguous interpretation of this value by robots.
According to the standard, when robots encounter unknown directives they should skip them. But we do not recommend to insert directives that are non-standard or unique for particular robots into the common group ‘User-agent: *’. It is safer to place them into groups for these particular robots.
In case Moderatobot finds the Host and Clean-params directives which are unique for Yandex bot it will move them into the Yandex group. If Group with “User-agent: Yandex” is not found, Moderatobot will create this group using “User-agent: *” group as a pattern and moves these directives into the group.
According to the standard, each group ought to have at least one Disallow directive. If Disallow directives are not found, Moderatobot inserts a record 'Disallow: ' into the group. That means all pages of the site are available for robots of the group.
The Allow directive is not a standard one, though it is recognizable by most major robots.
Some robots ignore Allow directives without values, but some may consider them equal to “Disallow: /” - whole site is prohibited for access.
The "Allow" directive value "/" has no sense as all pages may be crawled by default.
A Value of the Disallow and Allow directives is case-sensitive as it represents URL.
The “*” and “$” symbols in an URL are not standard. Some robots recognize these symbols, but some not. So URL with the wildcards may be understood differently by different robots. We have no information on whether the robots understand any other regular expressions in the URL. So it is safer not to use them.
The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules.
Disallow vs. Allow
When using Disallow and Allow directives with concurrent values in one group of records you should know that this situation may by processed differently by some robots.
Allow is not a standard directive and some robots that strictly follow the standard will ignore it and thought the resource remains closed for searching by the Disallow directive.
Some robots are sensitive for the order of records in a group and only the first directive with the matching value will be used, whether it is Disallow or Allow.
Some major robots consider that the directive with the most specific value wins and the order of the records does not matter. In case Allow and Disallow have the same value, the Allow directive wins. Note that wildcard characters "*" and "$" can distort the order of precedence for rules.
Moderatobot also considers that Allows directive should win, because the only reason to insert it to a group is to cancel the effect of the appropriate Disallow directive. In order to be compatible to most robots, Moderatobot sorts records with Disallow and Allow directives within each group in the next way: all Allow directives first and then all Disallow directives.
The Crawl-delay directive is not a standard one; it can be ignored or understood quite differently by some robots. Some consider its value as a gap in seconds within a robot can access the site ones. Some consider it as the duration of the pause between two subsequent accesses the site.
Most robots accept only positive, whole numbers as values of Crawl-delay. Some limits the value from 1 to 30.
Such directives as Sitemap, Host and Clean-param are non-group directives. They will be processed by robots, regardless of their position in the file.
But note, that Host and Clean-param directives are used only by the Yandex robot. So it is better to put them into Yandex specific group. Whereas Sitemap is understood by most major robots, so the best place for Sitemap directives(s) is at the end of the file after all groups.
Sitemap is not a standard directive and will be ignored by some robots.
Sitemap is an intersectional directive and should not be located inside a group of records. Moderatobot moves it to the bottom of the robots.txt file when finds it elsewhere.
It is not recommended to close the directive value with '/'. And it is recommended to use protocol (http://, etc.) with a domaine name in the directive value.
The sitemap file has to have an expansion ".xml", or ".xml.gz" if you apply GZIP compression to the file.
Host directive is used only by Yandex robot. It is recommended to insert it into the end of the Yandex group. Moderatobot moves Host directive there. If the group is not found it will be created.
It is not recommended to use protocol (http://, etc.) with domaine name in Host directive value.
Only one Host directive is allowed. Moderatobote skips all Host directives except the first one.
Clean-param directive is used only by Yandex robot. It is recommended to insert it into the end of the Yandex group. Moderatobot moves Clean-param directive there. If the group is not found it will be created.
Clean-param rule cannot have length more than 500 characters.
An Extended Standard for Robot Exclusion
As it has been said before, we have no information about use of this standard by robots. So it is safer to use them in groups for particular robots in which you can be sure.
According to the document, it is recommended to insert the directive “Robot-version: 2.0” just after beginning of the group to let robots know that they ought to interpret the commands of the group as this standard does. If Moderatobot find it elsewhere except just after the last User-agent directive of the group, the directive will be moved there. On the other hand, if Moderatobot encounter any of version 2.0 specific directives and “Robot-version: 2.0” directive is missed, Moderatobot will not add the Robot-version directive as this may cause changes in the parsing logic.
The dirfective can accept values "1.0" and "2.0".
Request-rate, Visit-time, Comment directives
The directives may be understood by some robots without version 2.0 being declared.
Please check in the specification of the robot if the robot support these directives before using them in the group for this robot.