The spiders used to retreive website data for all legitimate search engines follow certain defined rules in a file called 'robots.txt', which should be placed in the root directory of a web site.
This file contains instructions about what a spider can and cannot follow and index within the sites structure, and therefore which directories / pages / images etc. that can be retrieved and indexed.
Example:
This robots.txt file explains to the spiders that ...User-agent: *
Disallow: /cgi-bin/
Disallow: /jscript/
Disallow: /beta/
Disallow: /images/
Disallow: bogus.htm
- User-agent: *
all search engines are welcome to collect data from the site - Disallow: /cgi-bin/
certain directories (and all the files/pages within) are to be collected for indexing - Disallow: bogus.htm
the page bogus.htm in the site root should similarly not be retrieved and included.