Robots.txt is a fundamental and important file on your website. Its purpose is to give instructions to search engine crawlers about the URLs they shouldn’t crawl on your website. It’s imperative to understand what a robots.txt file is, from a technical seo perspective, how it works, how to give instructions to the crawlers, and how to test the validity and effectiveness of those instructions.
This article will walk you through the ins and outs of the robots.txt file, so you can understand what it is and how to use it to assist your SEO efforts, and win more search results.
What is a robots.txt file?
The robots.txt file is a simple text file that sits in the root directory of your website. It gives instructions to search engine crawlers about which pages to crawl on your website. Valid instructions are based on the robots exclusion standard, which will be discussed during this article. These instructions are by way of User-Agent and Disallow.
The combination of both User-Agent and Disallow tell search engine crawlers which URLs they are prevented from crawling on your website. A robots.txt file that contains only User-Agent: * Disallow: / is perfectly valid. In this case, the instructions given to crawlers is to prevent the entire site from being crawled.
(The above instructions would block your entire site from being crawled)
Crawlers access your site and add the URLs to the crawl queue. They do this for both newly discovered and previously known URLs. A crawler will first check the root directory of your website, looking for the robots.txt file. If it’s not there, they will crawl your entire site. However, if a robots.txt exists, they will crawl your website as per the directives you specify.
The main reason for updating and maintaining a robots.txt file for your website is so that your website does not become bogged down with excess crawler requests. Robots.txt is not a way to stop pages from getting indexed by Google.
A common myth is that the directives in your robots.txt file can be used to prevent pages from ending up in Google search results. The reality is that Google can still index your pages if there are other signals, such as links from other websites.
Misconfiguration of your robots.txt can have serious consequences for your website. Mistakenly telling crawlers not to access your pages can be costly. This problem can be further amplified for very large websites. You could inadvertently prevent crawlers from accessing large portions of essential pages.
Furthermore, it’s not a given that all search engine crawlers will obey the directives you have specified in your robots.txt file. Most of the legitimate crawlers will not crawl pages blocked by robots.txt. However, some malicious bots may ignore it. So do not use robots.txt to protect sensitive pages on your site.
How to use robots.txt
Search engine crawlers will check your robots.txt file before crawling the URLs on your website. If there are particular pages or sections of your site you don’t want to be crawled, pages that are not helpful to be included in search engine results pages, then robots.txt can be used to Disallow them from the crawl.
The most useful reason to include and maintain a robots.txt file is for optimising the crawl budget. Crawl budget is a term used to describe how much time and resources any search engine crawlers will spend on your site. The issue you are trying to address is when those crawlers waste crawl budget by crawling pointless or unwanted pages on your website.
Busting the Myth:- blocking indexing with Robots.txt
Robots.txt is not a reliable tool to prevent search engines from indexing pages. Pages can still be indexed in search results even if they are prevented from being crawled in robots.txt.
If it is prevented from crawling in your robots.txt file, it won’t show a detailed search results snippet describing the index page. Instead, it will give a message explaining that the description is not available because of the robots.txt directive.
(Image credit: Barry Schwartz gives a good example of a page being blocked by robots.txt)
A page can still be indexed in a search engine if:
- The page is included in the sitemap.xml.
- There’s an internal link pointing to the page.
- There’s an external link pointing to the website.
The most reliable ways of blocking pages from getting indexed does not include the use of the robots.txt file at all. It’s most effectively achieved by the use of the Noindex directive. When a search engine crawler reads the noindex directive, it will drop the page from the search results, effectively taking it out of the index completely.
You can effectively block a page from being indexed in one of two ways:
- Use a meta tag.
- Use an HTTP response header.
Block indexing with a meta tag
Most search engine crawlers will respect noindex as implemented by a meta tag. Some shady crawlers and bots may still ignore this directive, so further measures may need to be taken. However, we are mainly concerned with legitimate search engine crawlers and bots, and we are confident that they will obey this directive as specified.
Placing a noindex meta tag in the header of your page, will prevent those crawlers from indexing that page. You can block all robots from indexing by placing this code into the header of your page:
Block indexing with an HTTP response header
A more advanced way to prevent legitimate search engine crawlers from indexing your pages is to use an HTTP response header. You can configure an Apache server to return your pages’ HTTP response by placing the X-Robots-Tag in the .htaccess file of an Apache-based web server.
X-Robots-Tag in .htaccess
You will need to edit your .htaccess file on your web server. This file is read by your Apache webserver to respond to the request with an HTTP response header.
Depending on your web server setup, using X-Robots-Tag may look something like the following example:
Disclaimer: The above-mentioned methods for noindexing are advanced and can cause serious impacts to search results if misconfigured. It is recommended to seek the advice of an experienced Technical SEO before implementation.
What does a robots.txt file look like?
Not every website automatically includes a robots.txt file, so you may need to create it if you don’t already have one. You can check to see if there is one already by using the browser navigation bar and entering the URL for the robots.txt. For example, here is the robots.txt file for the StudioHawk website: https://studiohawk.com.au/robots.txt
(StudioHawk’s robots.txt file)
As you can see from the StudioHawk robots.txt file, you actually may not need a lengthy and complicated file. The main aim here is to Disallow the /wp-admin/ section of the website from being crawled. This saves crawl budget.
However, we do want to Allow the /wp-admin/admin-ajax.php page to be crawled because it’s a resource that may be needed to help search engines make sense of how our website crawls and renders.
There is also a link to the website Sitemap, which assists search engines to discover it, and it shows ownership and trust. This is because only the website owner has the authority to edit the robots.txt file.
Robots exclusion standard
The robots exclusion standard is for you to give instructions to search engine crawlers. The standard provides ways to give directions to crawlers about which areas of your site they should or should not crawl. So your robots.txt should contain a list of your most important directives.
Not all crawlers or robots obey the instructions in your robots.txt file. These crawlers are often called a BadBot. These include robots looking for security vulnerabilities that may allow them to crawl or scan in sections you have specified that robots to stay out of. Common BadBots can also include spambots, malware and email harvesters.
We will focus here on the kinds of instructions we can give to search engine crawlers, and how we can help them to crawl the areas of our site we want them to. The basics of this is done by using a combination of User-Agent and Disallow.
User-Agent is used for specifying a search engine crawler. Depending on the specific search engine, you can specify the appropriate User-Agent you want to Disallow.
Below is a table of some common User-Agent strings to use in your robots.txt:
Disallow this tells a User-Agent not to crawl certain sections of your site. You will need to place a specifier after the word Disallow for it to be valid.
An example used to Block all crawlers:
Here, we are using the asterisk to specify all User-Agents, and we use a forward slash to indicate the start of all URLs.
An example used to Disallow Googlebot from crawling the /photos directory:
Example used to more specifically Disallow Googlebot and then all other crawlers:
Non-standard robots exclusion directives
In addition to the standard directives User-Agent and Disallow, you can also make use of the non-standard directives. Please keep in mind that there’s no guarantee all search engine crawlers will follow all these non-standard directives. However, for the main search engines, these are fairly consistent.
Allow is supported by the major search engine crawlers, and it can be helpful when used in conjunction with a Disallow directive. Use this directive to Allow crawlers to access a file within a directive that may have a Disallow on the directory.
We need a little specific syntax to ensure that the major search engine crawlers will respect the Allow directive. Make sure you place your Allow directives in the line above the Disallow.
An example used to Allow a file inside a directory with a Disallow:
Crawl-delay is not supported by all the major search engine crawlers. It’s used to limit the speed of the crawler. This is commonly used when your website is experiencing performance issues due to too much crawler activity. However, poor site performance is generally due to inadequate web hosting and should be addressed by improving your hosting.
However, Google does not respect Crawl-rate, and will simply ignore this directive if you have it specified in your robots.txt file. If you want to rate-limit Googlebot crawler, you need to go to the Google Search Console Old version and adjust it there.
An example used to set Bingbot Crawl-delay:
It’s considered best practice to include your XML sitemap in your robots.txt. It’s thought that this can aid in the discovery of your sitemap, and therefore help files be discovered and crawled faster.
Example of your XML Sitemap in your robots.txt file:
Wildcards are supported by all major search engine crawlers when implemented correctly. You can use these wildcards to group files together by file type. The wildcard replaces the name of the file and will match any filename where the wildcard is used.
Example to prevent crawling of all .png images and .jpg images in a specific directory:
Adding and editing robots.txt on your server
You will need to have access to be able to add a robots.txt file. Some CMS platforms, such as WordPress, allow you to add/edit your robots.txt.
Add a robots.txt file
If you have access to your web hosting files via a cPanel or some other similar hosting management console, you can create a robots.txt if you don’t already have one.
You may need to add a robots.txt by uploading it via FTP. This is a commonly performed task, and your web hosting provider can point you in the right direction for this.
Edit a robots.txt file
If you have the Yoast plugin installed for WordPress, you can edit your robots.txt file directly from within the dashboard.
By default, the robots.txt file for WordPress will look like the following.
To begin editing your robots.txt with Yoast, follow these steps:
- Click on ‘SEO’ in the left-hand side Yoast menu in the WordPress dashboard.
- Click on ‘Tools’ from the expanded settings options.
- Click on ‘File Editor’, and you can now edit your robots.txt.
Note: Please make sure file editing is enabled, otherwise this option will not be available.
- Make the desired changes to your robots.txt.
- Save the file to make sure the changes take effect.
If you want more detailed instructions, they are available on the Yoast website.
If you cannot access the robots.txt via your CMS, you will need to access your web hosting files directly via cPanel or your web hosting console.
Testing your robots.txt
Testing any changes before setting them live is a critical activity for SEOs to respect. A mistake in your robots.txt has the potential to stop crawlers from accessing your entire website. This can lead to detrimental outcomes for search results and lead to a severe drop in rankings and heavy traffic losses.
We recommend that you test and validate your robots.txt before making them live. Thankfully we have a handy tool for testing robots.txt by Google. It’s available from within the Old version menu in Google Search Console.
Robots.txt is a vital component of your SEO efforts. If it is configured poorly or misconfigured, your SEO efforts could experience real impacts. You will want to make sure you have not mistakenly blocked crawlers from the most important parts of your website.
You will want to make sure to block crawlers from excessive crawling of unimportant URLs. Blocking crawlers from accessing unimportant URLs will save crawl budget. You will want your crawl budget spent on the pages that are important to search engines.
Noindexing pages is an incorrect use of robots.txt. Keeping pages out of the search results is best achieved by using meta robots noindex, or X-Robots-Tag. Otherwise, a page may still end up being indexed if it is referenced by other pages.
Mastering robots.txt is a very powerful yet fundamental skill for every SEO. Understanding how it works and when to use it will help you take more control over your SEO results.