I'd try to just add every site you already have spidered in a cache of some sort (like an ArrayList of all already visited URLs), so some sites would be spidered more than once, but you wouldn't loop indefinitely.
The best option would be that your program is always aware of the site it's currently on, then add that site string in front of the relative path. That requires heavy parsing, but I think it would be worth it (well, not soo heavy, your parser has to understand the meaning of ".."), and every site would only be spidered once.
BTW, be sure to read robots.txt with your spider and follow the instructions in there.
Then that's what I meant by "your parser has to understand the meaning of ".."" - just take the substring from the start to the / before the last one if you encounter a "..", and you'll have no problem with these anymore...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.