(Introduction, for those who don't know what they are):
Robots.txt files are used on many websites to indicate to search engines which files and folders they can and cannot index. They are read by search engine spiders to work out which links they can follow, which files/folders should be included and excluded. It is worth noting that the search engine does not have to follow the instructions indicated in Robots.txt, it is only a suggestion.
The contents of the files include names of files and folders in the website, specified relative to the root of the site in plain text and are thus easily readable by both man and machine.
I have been thinking about whether this (ie a part or whole list of files or folders) is a security risk. Obviously, nobody with any sense would put confidential information on the site and link to it, but in a robots.txt file an entry reading:
Exclude: /customerinfo.xls
recommends the search engine spiders to not follow any links that point to this file, it tells me as a human that there is a file called customerinfo.xls in the root of the domain, which given the extension is probably an Excel spreadsheet, and it is a small matter to generate an appropriate URL to get the file, and anybody with an inquisitive nature or wishing to get details about the company may well do that.
I have looked at some high profile sites (both commercial and public sector) and have found that many of these don't have such a file, presumably because of the possible security risk exposing paths to some of the content. Other sites that have free material available for download don't have them either, because if the exact URL was included in the file, anybody would be able to obtain such downloads without registering, effectively using the "security through obscurity" approach.
On the other hand, not having one could result in an accidental inclusion of a URL on a public site, particularly one with a high profile, getting published outside.
I was wondering what other people's thoughts were about this? Is having a robots.txt file a security risk to website content or is not having one a greater risk?
John
Robots.txt files are used on many websites to indicate to search engines which files and folders they can and cannot index. They are read by search engine spiders to work out which links they can follow, which files/folders should be included and excluded. It is worth noting that the search engine does not have to follow the instructions indicated in Robots.txt, it is only a suggestion.
The contents of the files include names of files and folders in the website, specified relative to the root of the site in plain text and are thus easily readable by both man and machine.
I have been thinking about whether this (ie a part or whole list of files or folders) is a security risk. Obviously, nobody with any sense would put confidential information on the site and link to it, but in a robots.txt file an entry reading:
Exclude: /customerinfo.xls
recommends the search engine spiders to not follow any links that point to this file, it tells me as a human that there is a file called customerinfo.xls in the root of the domain, which given the extension is probably an Excel spreadsheet, and it is a small matter to generate an appropriate URL to get the file, and anybody with an inquisitive nature or wishing to get details about the company may well do that.
I have looked at some high profile sites (both commercial and public sector) and have found that many of these don't have such a file, presumably because of the possible security risk exposing paths to some of the content. Other sites that have free material available for download don't have them either, because if the exact URL was included in the file, anybody would be able to obtain such downloads without registering, effectively using the "security through obscurity" approach.
On the other hand, not having one could result in an accidental inclusion of a URL on a public site, particularly one with a high profile, getting published outside.
I was wondering what other people's thoughts were about this? Is having a robots.txt file a security risk to website content or is not having one a greater risk?
John