pandora - taking full site content

mrmtek · Jun 12, 2005

A web crawler run by the national library has been setup to copy all information from web sites and then re publish them as a archive to allow university staff and general public so search and browse, I would have thought that this was illegal, where they are replicating web sites within there own web pages and does not Google etc do this any way but only extracting header info not the whole site content.

http://pandora.nla.gov.au/crawl.html

Found them on my log file and noticed the traffic bandwidth was higher than normal then investgated more - what do you think?

Do not take life too seriously, because in the end, you won't come out alive anyway

Do not take life too seriously, because in the end, you won't come out alive anyway

johnwm · Jun 13, 2005

Looks like they are trying to extend the normal state/national archive requirement for printed material into the digital world. They claim that they ask permission though -

http://pandora.nla.gov.au/legaldeposit.html

________________________________________________________________
If you want to get the best response to a question, please check out FAQ222-2244 first
'If we're supposed to work in Hex, why have we only got A fingers?'
Essex Steam UK for steam enthusiasts

mrmtek · Jun 13, 2005

cant see how they can take a copy of my work or any one else for that matter and then make it available to other organisations and universities to use - if it was just header info from the web pages fine, but not the all the pages plus images and java scripts. It is blatant copyright infringement or more to the point theft........

Do not take life too seriously, because in the end, you won't come out alive anyway

JimBarrett · Jun 13, 2005

Surely the matter of legallity depends on their local laws?

ChrisHunt · Jun 13, 2005

There seems to be two things going on in Pandora.

There's a long-standing programme of identifying and archiving web sites that are of significance to Australia. Such sites are contacted, permission is sought, and the site is added to the archive. This is not untypical - I know because I recieved a similar request from the national library of Scotland for one of my sites (I was flattered to receive it).

They also appear to be doing a one-off (for now) trawl of the whole .au domain. There's nothing new about this kind of thing, the people who are actually doing the work - the Internet Archive - has been doing it for years. See their site at

http://www.archive.org/.

I don't think it's unethical, provided they acknowledge the original source of the site. If you really don't like it, you can add a robots.txt file that will stop them harvesting your site. Create a text file at

http://www.yourdomain.au/robots.txt

that says:

Code:

User-agent: archive.org_bot
Disallow: /

I don't really unbderstand your point about Google. Of course they extract the whole site content - how else would they index it? If you click on the "Cached" link at the bottom of most search results, you can see the copy they make of your site.

-- Chris Hunt
Webmaster & Tragedian
Extra Connections Ltd

mrmtek · Jun 13, 2005

thanks chris, it is just a very intrusive crawler and consumed a hell of lot of bandwidth

Do not take life too seriously, because in the end, you won't come out alive anyway

chiph · Jun 13, 2005

As long as it's confining itself to the public portion of your site, why worry? It's only grabbing the information that any member of the public might have seen. Yes, it consumes a lot of bandwidth in doing so, but if that's a problem, consider putting a robots.txt file on your site to keep it out of the "expensive" parts.

Chip H.

____________________________________________________________________
If you want to get the best response to a question, please read FAQ222-2244 first

mrmtek · Jun 13, 2005

yes true - just that I have sripts to ban opposition companies from seeing my applications and as thse guys are taking my pages I loose control on who sees what, thats why im upset with the process. So I have done the following :
changed robots.txt to disallow and banned the robots ip address range just to be safe.

paranoid yes, but it pays well..........

Do not take life too seriously, because in the end, you won't come out alive anyway

mrmtek · Jun 13, 2005

on another point I was never contacted they just went in a took my pages - so was a bit miffed

Do not take life too seriously, because in the end, you won't come out alive anyway

Dollie · Jun 14, 2005

Technically, isn't this the exact same thing as the Wayback Machine?

And mrmtek, the easiest thing to do is password protect your applications. Many sites have protected areas that the roving robots cannot access at all.

mrmtek · Jun 14, 2005

lol - yes always done, but I have to restrict access to people who may be able to read and understand - its not only the application that has to be protected but the marketing page used to describe the product, cant let that be read by certain people - so it is not a easy as it sounds, security name and password and vetting access is done on top of the rest -

Do not take life too seriously, because in the end, you won't come out alive anyway

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

pandora - taking full site content

mrmtek

Programmer

johnwm

Programmer

mrmtek

Programmer

JimBarrett

Programmer

ChrisHunt

Programmer

mrmtek

Programmer

chiph

Programmer

mrmtek

Programmer

mrmtek

Programmer

Dollie

MIS

mrmtek

Programmer

Similar threads

Part and Inventory Search

Sponsor