When and why to use robots.txt files

A closer look at how to implement robots.txt files on your website.

If you’re reading this post, you’re probably either a human or a robot. If you’re a human, you freely browse the internet and go to whatever websites interest you. If you’re a robot, how do you determine where to go? Is there any guidance? Are there limits? Also, what’s a robot?

The Robot Defined

First things first: a robot is an automated program that travels across the internet for to gather information. For example, Google has Googlebot, which crawls the web, does some magic to the information it finds, and then presents us with useful results when we’re looking for something on the Internet. Other search engines also have bots, like Bing and Yahoo. And not all robots are good internet citizens—spammers can have robots which scan the internet for email addresses, which later can be used in email blasts.

In order to give well-behaved robots an idea of what to crawl and what not to crawl, a protocol (or, set of rules) was created for these robots to follow. Referred to as the Robots Exclusion Standard, it’s a set of instructions that tell robots which websites should and should not be crawled.

It’s easy to understand how this works if we dive right in. To implement this protocol on a website, include a file called robots.txt in the root directory of your site. Once this file exists, it will be found and consumed by protocol-conforming robots on the Internet. Once you have this file, you can then add text that looks like the following:

User-agent: googlebot
Disallow: /secret

Your robots.txt file instructs Google’s crawling robot (and only their robot) to ignore the /secret directory on your website. As a result, the contents of this directory will not appear in Google search results.

If you wanted to prevent all web crawlers (that conform to this protocol) from crawling your entire website, you could do this:

User-agent: *
Disallow: /

It’s not uncommon to use the aforementioned rule when you’d like to put a website online for development purposes, while keeping it hidden from the general public.

This next protocol is flexible, since you’re able to address multiple crawlers if you’d like:

User-agent: googlebot
Disallow: /

User-agent: bingbot
Disallow: /secret
Disallow: /images
Disallow: /audio

This would tell Google not to crawl your entire site, but only tell Bing not to crawl your /secret, /images, and /audio directories.

It’s important to emphasize that this protocol merely advises robots on what to do—there are no guarantees as to whether or not your suggestions will be followed. If you are interested in keeping a webpage secret and secure, there are other, much better ways of doing this than to simply rely on a robots.txt file.

In summary, if you’d like to advise certain web robots to avoid some or all sections of your site, it’s a good practice to create a robots.txt file.