checking for relative path

pcplod · Aug 29, 2004

i m spidering some websites but get stuck in a loop because of relative paths

can someone help on how to solve this problem please?

haslo · Aug 30, 2004

How do you mean "because of relative paths"?

I'd try to just add every site you already have spidered in a cache of some sort (like an ArrayList of all already visited URLs), so some sites would be spidered more than once, but you wouldn't loop indefinitely.

The best option would be that your program is always aware of the site it's currently on, then add that site string in front of the relative path. That requires heavy parsing, but I think it would be worth it (well, not soo heavy, your parser has to understand the meaning of ".."), and every site would only be spidered once.

BTW, be sure to read robots.txt with your spider and follow the instructions in there.

haslo@haslo.ch - www.haslo.ch

pcplod · Aug 31, 2004

i mean url like

http://www.awebsite.co.uk/ccc/../firstpage.html

http://www.awebsite.co.uk/ccc/../files/../firstpage.html

haslo · Aug 31, 2004

Then that's what I meant by "your parser has to understand the meaning of ".."" - just take the substring from the start to the / before the last one if you encounter a "..", and you'll have no problem with these anymore...

haslo@haslo.ch - www.haslo.ch

sedj · Aug 31, 2004

please remember in the future that this forum is meant for J2EE issues, not J2SE ( forum269 ).

Cheers

--------------------------------------------------
Free Database Connection Pooling Software

http://www.primrose.org.uk

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

checking for relative path

pcplod

Programmer

haslo

Programmer

pcplod

Programmer

haslo

Programmer

sedj

Programmer

Similar threads

Part and Inventory Search

Sponsor