Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Robot files 3

Status
Not open for further replies.

robert89

Technical User
Nov 19, 2003
125
CA
I have a bilingual splash page. Want to know if it is possible to create a robot.txt file and have the spiders start at the index page for each of the languages. In other words, have the robot skip the splash page.

Any thoughts would be appreciated.

Thanks,
Bob
 
As mentioned in your previous thread, you don't need to make the spider skip your splash page.

Just make sure there are plain HTML links through to your content pages from the splash page.
The spider will then crawl to them and index them.

Search engines index and list pages and not sites as a whole.
If the search engine doesn't find content on your splash page then the page won't show up in search results.

Just make sure that the spider can get to your other pages.

 
foamcow,

by plain HTML links, do you mean the HREF in Anchor tags or does the URL need to appear in the visible content?

More specifically, will all four of the following be crawled (and, if so, what is an example of something that WOULDN'T be crawled)?

(1)
Code:
<a href="[URL unfurl="true"]http://www.mydomain.com/page1.html">http://www.mydomain.com/page1.html</a>[/URL]

(2)
Code:
<a href="[URL unfurl="true"]http://www.mydomain.com/page1.html">Page[/URL] 1</a>

(3)
Code:
<a href="#" onclick="document.location='[URL unfurl="true"]http://www.mydomain.com/page1.html'">Page[/URL] 1</a>

(4)
Code:
<a href="#" onclick="window.open('[URL unfurl="true"]http://www.mydomain.com/page1.html')">Page[/URL] 1</a>

Thanks.

--Dave
 
1 & 2 will be crawled and followed by the spiders. For a text link it is better to use a descriptive or keyphrase for the anchor text
so
Code:
<a href="[URL unfurl="true"]http://www.mydomain.com/page1.html">Blue[/URL] Widgets</a>

3 & 4 would be crawled but not followed because crawlers do not trigger javascript events or scripts.

robots.txt is an exclusion protocol and only tells the bots not to do something.



Chris.

Indifference will be the downfall of mankind, but who cares?
A website that proves the cobblers kids adage.
Nightclub counting systems

So long, and thanks for all the fish.
 
Very informative, Chris. Thanks.

When you say "crawled but not followed," do you mean that (in my examples (3) and (4))
' would be recognized as a related site, but links ON page1.html would not be included (since the spider doesn't follow the link to that page)?

Thanks again.

--Dave
 
What he means is - the crawler will read the <a> tags, but it's only going to look at the [tt]href[/tt] attribute. It won't execute the code in an [tt]onclick[/tt] attribute, so it won't follow any links that are embedded in that attribute. So the crawler will interpret (3) and (4) as links back to the same page (with a # tacked on the end).


-- Chris Hunt
 
'makes more sense. Thanks, Chris!

Thanks both of you. *'s!

--Dave
 
Yeah.. was away yesterday so I didn't get to answer.

But by plain HTML links I meant normal HREFs without any JavaScript. You should use descriptive text for the link as Chris said.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top