When a search engine spider visits your site, the first thing it will do is look for a special file named 'robots.txt'.
This is a simple text file (txt) that should be located in your root directory ...
http://www.change-domain.com/robots.txt
If you do not have this file on your website, create one now using a text-editor application. Make sure you save the file in TXT format and save it to the root directory of your site.
What does it do?
The 'robots.txt' file tells search engine spiders how NOT to crawl your website pages.
By default, a spider will crawl all of your web pages unless you tell it not to. As such, you don't need to specify WHAT to spider, but rather WHAT NOT to spider.
If you have sensitive files or folders (perhaps with client details or personal information) it is recommended you add a command in the robots.txt file to tell all search engines NOT to spider these sensitive areas.
Some examples
The following example will allow the Google spider (googlebot) to crawl all your site pages except those contained in the 'cgi-bin' directory ...
User-agent: googlebot
Disallow: /cgi-bin/
Similarly, the following example will allow all spiders to crawl all of your site pages except those contained in the 'cgi-bin' and 'admin' directories ...
Be careful. If your 'robots.txt' is badly written, you may give search engines the wrong instructions and they may end up not indexing your site correctly.
- make sure the directory names are written in the correct upper or lower case
- add only one directory per line (do not add multiple directories on the same line)
- do not include any spaces or tabs between the instruction and the left margin
- a disallowed directory will block all files within that directory from being indexed
Remember, there is no 'allow' command.
You only need detail the directories that you do not want to be indexed. All others will be indexed if they are linked to on your site.