robots.txt
January 17, 2009 1:00 PM |
Comments (0) 
| Rate this article: 

One of the reasons the web became so popular and powerful is that pages can be interlinked and search engines have the ability to discover content by following links from other pages.  Many websites also host web applications, and some of these pages are not desirable to be indexed by the search engines.  In fact, web authors don't want all content pages indexed either.

The web needed a way to be able to tell search engine spiders not to index certain pages on the site.  Thus the robots exclusion protocol was introduced.  This involves placing a simple text file at the root of your domain called robots.txt.  This file contains instructions for spiders on which urls to avoid indexing in your site.

A robots.txt file should be included in all websites, even if you don't have pages that are not to be indexed.  This is because all search engine spiders look for the file and the existance of this file is expected by the major search engines.

The most basic robots.txt file that allows full indexing of all content of your site is:

User-agent: *
Disallow:

To mark pages that the spiders should avoid in your website, simply add each url as a new line prefixed by disallow.  For example, get the spiders to avoid the tmp folder on your site, simply use Disallow: /tmp/

You cannot use wildcards in the Disallow lines, you must include all urls that are to be ignored.

One of the things that you should consider is that the robots.txt file is available annonymously and you should not put secrets in the file as this file can be used by hackers to give them ideas as to urls that contain protected content that they might try to hack.  Finally, there is no mechanism on the web that requires that all spiders respect the robots.txt file.  For example, most spiders that are used for spam email harvesting will not respect your robots.txt file.  You can consider this as a hint for the major search engines, and to avoid having the search engines use urls in your web application that might consume a lot of bandwidth or processing resources.

A robots.txt file should also be used with a comprehensive strategy including rel=no-follow on links as well as html meta tags.

The easiest way on an ADXSTUDIO CMS website to implement a robots.txt file is to upload the file as a resource in the CMS and place a custom url on the resource for /robots.txt.  This strategy allows the file to be content managed by authors instead of being a static file that is included in the site template.

Tags

Submit a Comment
Title:  
Name:    
Comment:    
Verification:

Type the characters you see in the picture below.