Robots.txt

From LQWiki
Jump to navigation Jump to search

robots.txt is a file that you can place in the root of your web server in order to let search engines know certain things about your site, for example how often they should revisit your site to adapt their search indexes.

Usage

  • Create a file robots.txt
  • This file must be accessible via HTTP on the local URL "/robots.txt"

Format

<field>:<optionalspace><value><optionalspace>
User-agent:  <The value of this field is the name of the robot> 
Disallow:    <The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.>
Request-rate: <maximum rate>
Visit-time:  <to define a visit time range> 
Crawl-delay: <to wait between successive requests to the same server>

Examples

The examples below are taken from robotstxt.org

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/            # these will soon disappear
Disallow: /foo.html
Request-rate: 1/5          # maximum rate is one page every 5 seconds
Visit-time: 1000-1200      # only visit between 10:00 and 12:00 UTC (GMT)
Crawl-delay: 10            # 10 seconds to wait between successive requests to the same server

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
  • Cybermapper knows where to go.
User-agent: cybermapper
Disallow:

This example indicates that no robots should visit this site further:

# go away
User-agent: *
Disallow: /

Efficiency

  • The protocol is purely advisory.

External links