Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

checking for relative path

Status
Not open for further replies.

pcplod

Programmer
Aug 29, 2004
4
GB
i m spidering some websites but get stuck in a loop because of relative paths

can someone help on how to solve this problem please?
 
How do you mean "because of relative paths"?

I'd try to just add every site you already have spidered in a cache of some sort (like an ArrayList of all already visited URLs), so some sites would be spidered more than once, but you wouldn't loop indefinitely.

The best option would be that your program is always aware of the site it's currently on, then add that site string in front of the relative path. That requires heavy parsing, but I think it would be worth it (well, not soo heavy, your parser has to understand the meaning of ".."), and every site would only be spidered once.

BTW, be sure to read robots.txt with your spider and follow the instructions in there.

haslo@haslo.ch - www.haslo.ch​
 
Then that's what I meant by "your parser has to understand the meaning of ".."" - just take the substring from the start to the / before the last one if you encounter a "..", and you'll have no problem with these anymore...

haslo@haslo.ch - www.haslo.ch​
 
please remember in the future that this forum is meant for J2EE issues, not J2SE ( forum269 ).

Cheers


--------------------------------------------------
Free Database Connection Pooling Software
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top