Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

pandora - taking full site content

Status
Not open for further replies.

mrmtek

Programmer
Oct 13, 2002
109
0
0
AU
A web crawler run by the national library has been setup to copy all information from web sites and then re publish them as a archive to allow university staff and general public so search and browse, I would have thought that this was illegal, where they are replicating web sites within there own web pages and does not Google etc do this any way but only extracting header info not the whole site content.


Found them on my log file and noticed the traffic bandwidth was higher than normal then investgated more - what do you think?

Do not take life too seriously, because in the end, you won't come out alive anyway


Do not take life too seriously, because in the end, you won't come out alive anyway
 
Looks like they are trying to extend the normal state/national archive requirement for printed material into the digital world. They claim that they ask permission though -
________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first
'If we're supposed to work in Hex, why have we only got A fingers?'
Essex Steam UK for steam enthusiasts
 
cant see how they can take a copy of my work or any one else for that matter and then make it available to other organisations and universities to use - if it was just header info from the web pages fine, but not the all the pages plus images and java scripts. It is blatant copyright infringement or more to the point theft........


Do not take life too seriously, because in the end, you won't come out alive anyway
 
There seems to be two things going on in Pandora.

There's a long-standing programme of identifying and archiving web sites that are of significance to Australia. Such sites are contacted, permission is sought, and the site is added to the archive. This is not untypical - I know because I recieved a similar request from the national library of Scotland for one of my sites (I was flattered to receive it).

They also appear to be doing a one-off (for now) trawl of the whole .au domain. There's nothing new about this kind of thing, the people who are actually doing the work - the Internet Archive - has been doing it for years. See their site at I don't think it's unethical, provided they acknowledge the original source of the site. If you really don't like it, you can add a robots.txt file that will stop them harvesting your site. Create a text file at that says:
Code:
User-agent: archive.org_bot
Disallow: /

I don't really unbderstand your point about Google. Of course they extract the whole site content - how else would they index it? If you click on the "Cached" link at the bottom of most search results, you can see the copy they make of your site.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd
 
thanks chris, it is just a very intrusive crawler and consumed a hell of lot of bandwidth

Do not take life too seriously, because in the end, you won't come out alive anyway
 
As long as it's confining itself to the public portion of your site, why worry? It's only grabbing the information that any member of the public might have seen. Yes, it consumes a lot of bandwidth in doing so, but if that's a problem, consider putting a robots.txt file on your site to keep it out of the "expensive" parts.

Chip H.


____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first
 
yes true - just that I have sripts to ban opposition companies from seeing my applications and as thse guys are taking my pages I loose control on who sees what, thats why im upset with the process. So I have done the following :
changed robots.txt to disallow and banned the robots ip address range just to be safe.

paranoid yes, but it pays well..........

Do not take life too seriously, because in the end, you won't come out alive anyway
 
on another point I was never contacted they just went in a took my pages - so was a bit miffed

Do not take life too seriously, because in the end, you won't come out alive anyway
 
Technically, isn't this the exact same thing as the Wayback Machine?

And mrmtek, the easiest thing to do is password protect your applications. Many sites have protected areas that the roving robots cannot access at all.
 
lol - yes always done, but I have to restrict access to people who may be able to read and understand - its not only the application that has to be protected but the marketing page used to describe the product, cant let that be read by certain people - so it is not a easy as it sounds, security name and password and vetting access is done on top of the rest -

Do not take life too seriously, because in the end, you won't come out alive anyway
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top