What exactly is robots.txt?
robots.txt is a bit of text information located at the main directory of a website. The robots are used to control search engine bots and it also offers webmasters a great way to tell search engine while files or pages should be crawled (i.e. visited) by search engines, and which files and pages should not be crawled.
How do I make a robots.txt file?
The robots.txt file can be created using a simple text editor and then ensuring it is named "robots.txt". Following are some lines in the file that are always crawled and analyzed by search engines. Each entry in the robots.txt has two parts:
The first part is called the "User Agent." This part will list a certain bot, e.g. Google Bot.
You should start a line with "User-agent:*" which will tell all crawling bots to follow the following lines.
You can define what is allowed to be crawled, and what isn't allowed, in the second part with allow and disallow. This will tell a bot if it is allowed to crawl a page or not. You could start this line with "Disallow: /", which means that any listed bots may not crawl any files or pages.
How does a simple robots.txt look like?
A simple robots.txt file contains 2 lines and allows all bots to crawl and read all files and pages of your site.
# Full access to your site:
The next example shows the content of a robots.txt file that doesn't allow any page on your site to be indexed by search engines, and therefore not show up in search engines:
# Website closed to search engines:
Access to certain files or pages can be denied by having the following lines in your robots.txt:
In order to deny access to your website for only certain search engine bots, you must address each bot in the User Agent part of an entry:
You can also use Allow to explicitly allow a crawler to read one page or file:
Which content can robots.txt deny access to?
The robots.txt file can be used to prevent any page from being indexed so that you never see one of your pages in a search engine that you don't want to be found there. For example, you might want to keep unnecessary picture galleries from showing up in Google's search results.
You can additionally state where the sitemap.xml is on your site in the robots.txt. This information can help search engines find even more pages and content on your website.
The same applies for video or picture sitemaps
5 Facts you should know about the robots.txt
- The robots.txt is always at the root level of your website (the same place your index.php or index.html is) and it will always be checked first by search engine bots when it comes to your site
- The majority of search engine bots will follow the instructions in the robots.txt
- It can happen that one of your denied pages or entire websites still ends up in a search engine's index. This can happen if there are external links that go to that page. Since another page linked to it, the search engine may find your denied page important, and add it to the index anyways.
- Important: Be very careful when creating or altering your robots.txt. A simple mistake can make it so that your entire website is not crawled, and therefore ensuring that you won't appear in search engines and then rapidly lose any rankings you had.
- You can use Google Webmaster Tools to check whether you have correctly made your robots.txt and that it has no errors.
16 Apr, 2015