Difference between revisions of "Restrict bot access"
(No difference)
|
Latest revision as of 17:41, 16 May 2010
Contents
Introduction
Search engine spiders (bots) are programs that search your sites so that search engines can index them and then point searchers to your site. This is very important since it is the way most researchers will find your site. That said, the robots can create an excessive load on your site by trawling unnecessary pages. There is a way to mitigate most of the problem by selectively disallowing the bots from part of your site. The Robots exclusion standard used by all search engines allow you to specify specific parts of your site that you want excluded from indexing. An example of a page that can be excluded is the calendar page. All information available on this page is available on other pages. The same goes for most trees. To exclude pages, create a file named robots.txt. This article explains how to make the file.
File name and format
Source
Search engines will look in your root domain for a special file named "robots.txt". The file tells the robot (spider) which files it may spider (download). This system is called, The Robots Exclusion Standard.
The format for the robots.txt file is special. It consists of records. Each record consists of two fields : a User-agent line and one or more Disallow: lines. The format is:
<Field> ":" <value>
The robots.txt file should be created in Unix line ender mode! Most good text editors will have a Unix mode or your FTP client *should* do the conversion for you. Do not attempt to use an HTML editor that does not specifically have a text mode to create a robots.txt file.
User-agent
The User-agent line specifies the robot. For example:
User-agent: googlebot
You may also use the wildcard charcter "*" to specify all robots:
User-agent: *
You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have short names for their spiders.
Disallow
The second part of a record consists of Disallow: directive lines. These lines specify files and/or directories. For example, the following line instructs spiders that it can not download email.htm:
Disallow: email.htm
You may also specify directories:
Disallow: /cgi-bin/
Which would block spiders from your cgi-bin directory.
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).
If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct. A completely empty Robots.txt file is the same as if it were not present.
White Space & Comments
Any line in the robots.txt that begins with # is considered to be a comment only. The standard allows for comments at the end of directive lines, but this is really bad style:
Disallow: bob #comment
Some spider will not interpret the above line correctly and instead will attempt to disallow "bob#comment". The moral is to place comments on lines by themselves.
White space at the beginning of a line is allowed, but not recommended.
Disallow: bob #comment
Examples
The following allows all robots to visit all files because the wildcard "*" specifies all robots.
User-agent: * Disallow:
This one keeps all robots out.
User-agent: * Disallow: /
The next one bars all robots from the cgi-bin and images directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /images/
This one bans Roverdog from all files on the server:
User-agent: Roverdog Disallow: /
This one bans keeps googlebot from getting at the cheese.htm file:
User-agent: googlebot Disallow: cheese.htm
For more complex examples, try retrieving some of the robots.txt files from the big sites like Cnn, or Looksmart.
Examples submitted by Users
Example 1
User-agent: * Disallow: /phpgedview/reportengine.php Disallow: /phpgedview/fanchart.php Disallow: /phpgedview/search.php Disallow: /phpgedview/login.php Disallow: /phpgedview/clippings.php Disallow: /phpgedview/sosabook.php Disallow: /phpgedview/timeline.php Disallow: /phpgedview/calendar.php Disallow: /phpgedview/images/
In this case PhpGedView was installed in the phpgedview directory. Place this file in your server root.
Example 2
Submitted By: pmarfell
The following shows *part* of the content of mine. As you can see it goes on to ban other bots completely. Use Google for more info that might be relevant to your situation:-
User-agent: * Disallow: /bin/ Disallow: /cgi-bin/ Disallow: /dev/ Disallow: /mypostnuke/ Disallow: /phpfunc/ Disallow: /phpGedView/reportengine.php Disallow: /phpGedView/fanchart.php Disallow: /phpGedView/search.php Disallow: /phpGedView/login.php Disallow: /phpGedView/clippings.php Disallow: /phpGedView/sosabook.php Disallow: /phpGedView/timeline.php Disallow: /phpGedView/calendar.php Disallow: /phpGedView/hourglass.php Disallow: /phpGedView/ancestry.php Disallow: /phpGedView/descendancy.php Disallow: /phpGedView/pedigree.php Disallow: /phpGedView/family.php Disallow: /phpGedView/relationship.php Disallow: /phpGedView/famlist.php Disallow: /phpGedView/patriarchlist.php Disallow: /phpGedView/repolist.php Disallow: /phpGedView/aliveinyear.php User-agent: URL_Spider_Pro Disallow: / User-agent: CherryPicker Disallow: /
Bad Bots
What do you do if a spider doesn't obey the exclusions you've carefully crafted in robots.txt? Easiest way to stop the spider in its tracks is to deny it access to your site by using .htaccess
For example the omni-explorer spider doesn't appear to obey Disallow: so deny it access by adding these lines to the .htaccess file in the phpGedView directory.
RewriteEngine On RewriteBase / RewriteCond %{REMOTE_ADDR} "^64\.127\.124\." [OR] RewriteCond %{REMOTE_ADDR} "^65\.19\.150\." RewriteRule .* - [F,L]