Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Is a ROBOTS.TXT file a security risk?

Status
Not open for further replies.

jrbarnett

Programmer
Jul 20, 2001
9,645
GB
(Introduction, for those who don't know what they are):
Robots.txt files are used on many websites to indicate to search engines which files and folders they can and cannot index. They are read by search engine spiders to work out which links they can follow, which files/folders should be included and excluded. It is worth noting that the search engine does not have to follow the instructions indicated in Robots.txt, it is only a suggestion.

The contents of the files include names of files and folders in the website, specified relative to the root of the site in plain text and are thus easily readable by both man and machine.

I have been thinking about whether this (ie a part or whole list of files or folders) is a security risk. Obviously, nobody with any sense would put confidential information on the site and link to it, but in a robots.txt file an entry reading:

Exclude: /customerinfo.xls

recommends the search engine spiders to not follow any links that point to this file, it tells me as a human that there is a file called customerinfo.xls in the root of the domain, which given the extension is probably an Excel spreadsheet, and it is a small matter to generate an appropriate URL to get the file, and anybody with an inquisitive nature or wishing to get details about the company may well do that.

I have looked at some high profile sites (both commercial and public sector) and have found that many of these don't have such a file, presumably because of the possible security risk exposing paths to some of the content. Other sites that have free material available for download don't have them either, because if the exact URL was included in the file, anybody would be able to obtain such downloads without registering, effectively using the "security through obscurity" approach.

On the other hand, not having one could result in an accidental inclusion of a URL on a public site, particularly one with a high profile, getting published outside.

I was wondering what other people's thoughts were about this? Is having a robots.txt file a security risk to website content or is not having one a greater risk?

John
 
Potentially, but if it is then there's a larger problem.

Sensitive data should always be protected, even by a simple .htaccess file. That has the benefit of keeping both search engines and prying human eyes out.

If you simply *must* have sensitive data exposed which could be indexed by a search engine, then I'd at least organize the site in such a way that it's all under a directory structure with an innocuous name. Better still is to use two sites, and put the robots.txt in the root of one with "disallow /*
 
Allowing scrawlers in is already a huge security risk.

I agree with lgarner. Keep sensitive data in a non-scrawlable server, anmd access it only by a trusted application, when needed.

__________________________________________
Try forum1391 for lively discussions
 
I'm still undecided on the matter after much consideration, which is why I posted it here to find out other people's opinions.
I fully agree that sensitive data shouldn't be held on a public web server, whether protected with a robots.txt file or not.

John
 
Robots.txt files are simply suggestions to web crawlers about how they should behave. In your example, it's definitely a risk. It really depends on how it's used. I use them simply to narrow down what's put into search engines.

"whether protected with a robots.txt file or not."... No protection at all is provided by robots.txt. It doesn't even pretend to provide any security. It's just a suggestion, which well-behaved search engines respect. There is nothing to prevent a "bad" search engine from simply ignoring the file completely and indexing everything on your site.

Finally, I'd venture that the real reason that high-profile sites don't have a robots.txt file is because they *want* their site completely indexed. It is a public site, and they want as much visibility as possible. They just don't put any sensitive information on the web server.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top