Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Add Hidden OCR Text form Memo File to HTML Report/Page

Status
Not open for further replies.

stanlyn

Programmer
Sep 3, 2003
945
US
Hi,

I need to create SEO friendly web pages that can be indexed by the bots. I'm creating these pages from a report process where each record is a single page within VFP9. I need to get 2 more issues resolved for this to work.

I need to add contents that came from OCR to the page as hidden text and I need to add a clickable link to the outputted page/report.

Any ideas on how to do this, or should I be doing this a different way?

Thanks,
Stanley Barnett



 
and no one will pay for all those manhours, all because the search bots can't keep a secret...

Why should the search bots keep a secret (your secret)? Google's job is to help its visitors find the web pages they want to see. It is not in their interest to present pages in their search results that users can't access.

The trouble is that you want it both ways. You want to exploit this powerful tool known as Google to promote your content - a tool that didn't cost you an ounce of effort to create and doesn't cost you a penny to use. But you also want to take all of the profit from the fees that your users pay to access the content. Maybe you need to review your business model.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
> I'm creating public set of pages with summary info on them that leads to the full secured page via a login, if they are not already logged in.
Fine, than that should be sufficient. shouldn't it?

>Looks like I'll need to manually visit each of the 3+ million pages and growing, then create an "ok to show" keywords list to be shown on the public summary page all for google...
Well, looks like you need to write code extracting key words from text. Why would anyone do that manually?

Some automatisms for creating a "summary" are: Only show the first 1000 chars, first 100 words...

Some mechanisms for extracting keywords: Remove the very general words, take the most frequent words from the rest.

If all your documents are about some common topic, you have a few key words to start with, don't you?

And last not least you can build your own index and provide a search function listing documents containing such info without displaying them fully. Your own google is not so hard to do, see thread184-1553613

Bye, Olaf.
 
>if you are a expert
You don't have to be an expert and automate a browser, you could also simply look into your browser cached files.

If you only put part of the OCRed texts in the summaries/page description you don't give away all for free anyway. So why want google to only secretly index that? You don't have a problem, if you are at that level already. Anyway, you talk about OCR, so you scanned in something. What about the authors of that? Is that all your companies written down knowledge?

Bye, Olaf.
 
Mike said:
Why should the search bots keep a secret (your secret)?

Yes, because the site is mine and google is indexing it. At this point google and I are providing a service to the visitors... What we share is our business. In a way google and I are in business together where google get to use our content to play ads to the visitor while we may get a subscription.


Mike said:
It is not in their interest to present pages in their search results that users can't access.

If we have absolute content (hidden or not) that is relevant to the search terms, IT IS in google's best interest to let their visitors know about it. In fact, they would be doing their visitors a dis-service if they did not let them know about the relevant content..


Olaf said:
So why want google to only secretly index that?

So google can inform their user that a positive match containing their search terms was found. They could even disclose the fact the indexed data came from a secured page if they choose.


Mike said:
The trouble is that you want it both ways. You want to exploit this powerful tool known as Google to promote your content - a tool that didn't cost you an ounce of effort to create and doesn't cost you a penny to use. But you also want to take all of the profit from the fees that your users pay to access the content. Maybe you need to review your business model.

Yes, the same as everyone else, plus we are helping google by giving them a platform to play their ads with our content. And our business model is very similar to most subscription sites including the New York Times...


Olaf said:
thread184-1553613: VFP like GOOGLE…

I solved that one 18 years ago with PHDBase coupled with another unique word (ie. category checkbox) and results were instant, even on the 650mb phd index file. When PHDbase searches, it first searches for the non wildcard matches first, then performs tha additional wildcard search on the results from the first retrieved record set. Lightening fast, and we still use it today in Win10 systems. I have several copies left in inventory...


Olaf said:
Why would anyone do that manually?

It has to be done manually because of its diversity. About 25 different unrelated categories and no uniform style of form... all ad-hoc.


Olaf said:
Only show the first 1000 chars, first 100 words...

This is what I'll probably end up doing, but maybe if I show the text in a light gray color on a white background. Not invisible, but close. I'm reading that one could get banned from the spiders if we put white text on a white background. But then why not create the summary page with the normal summary info in black print on a white background, then add the ocr text as white print on the white background...

Thanks,
Stanley


 
You expect google to blindly believe you have the content you advertise via description/keywords. Short side note: If you think google will index the full OCR text withouit the visible page showing anything from it, aside of the technological problem of not being able to really hide it from anyone else - google will not index that. keywords and descriptions are meant to be kept short. Google results will only show a short description. Google will not show a result with a certain keyword just because you give that keyword in your meta keywords or description tag. It will cross check, whether that content is there. Google is no indexing service for your product catalog. You think aloong the wrong lines, if you think this is a win win situation.

To let users know what info you have, you have to cooperate with google as is and provide attractive content already on the public non memeber area. Only that will go into the google index.

Then you have to provide your own search engine, and in that you can have the full text index with all words from your texts, and still can only list results as far as you want, knowing you can guarantee the content insaid the documents to be paid or subscribed for.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top