Disallow the Internet Archive from your website using robots.txt

Posted in Miscellaneous Postings -

The Internet Archive runs the "Wayback Machine" at www.archive.org to archive pages from websites maintaining copies of them at several points in time. If you do not wish to have your website archived in this way you can prevent this using a robots.txt file.

robots.txt entry

Simply add the following two lines to your robots.txt file. If you do not already have one, create a text file and save it as robots.txt in the root directory of your website with the following lines.

User-agent: ia_archiver
Disallow: /

This has the effect of preventing the Internet Archiver from archiving your site again in the future, and also removes any existing entries from the archive.

ia_archiver entry in log files

The webserver log files can tell if the archiver has been accessing your website. The following example from an Apache combined log file shows the Internet Archive bot requesting the robots.txt file:

67.202.59.141 - - [28/Aug/2009:13:26:31 +1200] "GET /robots.txt HTTP/1.0" 200 153 "-" 
  "ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)"

I've split the above log entry onto two lines for readability.




Comments