Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Add Hidden OCR Text form Memo File to HTML Report/Page

Status
Not open for further replies.

stanlyn

Programmer
Sep 3, 2003
945
US
Hi,

I need to create SEO friendly web pages that can be indexed by the bots. I'm creating these pages from a report process where each record is a single page within VFP9. I need to get 2 more issues resolved for this to work.

I need to add contents that came from OCR to the page as hidden text and I need to add a clickable link to the outputted page/report.

Any ideas on how to do this, or should I be doing this a different way?

Thanks,
Stanley Barnett



 
SEO friendly" is a wide field, so how should we know, whether you should do "this" in a different way. Not even knowing what "this" actually means.

Very generally talking, google "seo friendly content". There are many more topics like seo friendly URLs, but you seem to be concerned with search engine indexing and listing and being found with many key words.

> add contents that came from OCR to the page as hidden text
Add it outside of the html <body></body> and it is invisible to the end user. To be a text, which is taken seriously from search engine web bots, it better be visible text, though. The only other place would be a meta tag about the page decription. So such a tag would be the place for this text, if the OCR text decribes the page:

Code:
<meta name="description" content="This is an example of a meta description. This will often show up in search results.">

Such a tag has to go into the area of the head: [tt]<head>....somewhere in here...</head>[/tt]

>I need to add a clickable link to the outputted page/report.
Well, clickable is nothing special about a link. Do you know HTML?

From what you're saying you need the link somewhere else, on an index page or table of contents. Well, you know where you save your html page or the content data and how it should come to the end user via its URL, at least you should know, as you create and define the page. As this is about generated content, I assume you have a template HTML and the content goes into the html body or even just a special place in that body, for example a div tag with a certain classs. Then a URL could be very generally and via URL rewriting that part /id/ of the URL will become a parameter to a script loading the content data of that id and merging it into the HTML template in the place of that div. And besides writing that script to create the HTML page, that URL is the URL of the link you want to put somewhere else.

Bye, Olaf.
 
Hi Olaf,

Olaf said:
The only other place would be a meta tag about the page decription. So such a tag would be the place for this text, if the OCR text decribes the page:

So how is that added when generating the report which is actually the html page? I have 3 million pages and each page has associated ocr text that needs added to each page where the user cannot see and google bots can index them. The process of printing each page as html output (1 file per page) will be used to create the pages in an automated way. If I'm reading your suggestions correctly, I believe you are talking about doing them manually or semi-manually via a template... Correct me if that is not what you mean...

As far as the hyperlinks, the actual urls and the page's filesystem name will be generated as each page is created from the report process to html.

Thanks,
Stanley
 
Stanley,

It's not clear if you want the text to be hidden from the search engines or from the pages' visitors.

If the former, the usual method is to use either Robots.Txt or a "robots" meta tag. It should be easy enough for you to find out how to do that, but just to get you started, you could place this line in your page header (between the <head> and </head> tags):

[tt]<meta name="robots" content="noindex,nofollow" />[/tt]

Note that this applies to the entire page. You cannot specify individual bits of text within the page in this way.

If, however, you want to hide the text from the user, then simply format the block of text with a [tt]display:none[/tt] style. Any CSS reference will tell you how to do that.

But what's that got to do with SEO? If you are thinking of putting hidden text on a page in the hope of fooling the search engines in some way, forget it. They are all wise to those tactics, and a hundred other so-called SEO techniques.

In any case, this has got nothing to do with VFP. There is another forum here on Tek Tips for SEO.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Why does the mentioning of a template indicate (half) manual interaction to you?

Of course you have to generate a template. In your case this rather is the report itself. Well you have to find out how the HTML Listener will also write into the HTML header to generate the page description or other meta tags not in the visual body section of the HTML files.

You could also think of postprocessing the HTML output to insert the OCR text as aftermath. I don't know about the complexity of the report, but I would use TEXTMERGE() or TEXT...ENDTEXT to merge content with a single teomplate for all pages, which has the placeholders to put in OCR Text and anything else.

The article and code you point to only ensures, that each text matching the pattern of a URL is converted to a link (which in HTML is an a tag). This is not generating the link to the reports themselves. In a good site you'd have links which don't show as the naked URL but are linking to the target URL via a human friendly readable text or even an image having a target URL. So this won't help you, I fear. I would overall do this different via Textmerge, HTML is merely text, not a traditional document like a report.

Bye, Olaf.

 
Hi Mike and Olaf,

Yes, the report itself is the template. Before outputting the report to html, the code will create the url for the hyperlink on the page as well as the os filename to save it to.

What I don't get is how to create the elements for the report that contains the tags and html text. DO I create report fields and assign html text values to them, and where do I tell it to get rendered.

Where can I get instructions and examples to wire all this up?

As far as the hidden ocr text, I need it to be hidden from the user and google needs to index it. The user searches google for keywords found in the hidden ocr text and displays a link to the page in google's search results, and when clicked the user will be taken to the page (the one I'm creating now) that contains user viewable text with a summary of the page's content and a hyperlink that when clicked will take them to the protected page that has full detain including an image of the page if the user is already logged in, otherwise they are directed to a login page. I say protected because protected pages are only viewable by users with active subscriptions. The page google lands them on (the page I'm creating now) is only a summary page and can be seen by anyone.

It would be better if we could expose the indexable stuff to google while removing it from the page's source via some sort of linked data, where the bots would get the ocr text from a hidden link. This way may be too much work, so we'd probable leave that for another day. Ideas are welcomed...

Hope that explains it a bit better,
Stanley
 
Mike,

Mike said:
If, however, you want to hide the text from the user, then simply format the block of text with a display:none style. Any CSS reference will tell you how to do that.

Where does the text go on the report? What holds it? Is it a report variable or field? How is the CSS associated with the report? Is it associated at the report level or variable level?

The one article I referenced above does describe some of this and that is why I asked if anyone had the example download to the FoxTalk March 2005 article where I can pull it apart and see how they done it...

Thanks,
Stanley

 
Very generally speaking the HTML listener, also the one coming with VFP itself, is nothing very special. It renders what you see in a normal report output, too, just as HTML. Without any HTML knowledge of the report user. Just like all the other output vectors of reports as DOCX, XLSX, PDF or even BMP/PNG/TIFF. On the other hand, you don't have much control about the html used, but you can parse and read the output file. In case of HTML unlike TIFF or other formats, the output is no riddle, it's HTML. Text. And on this end you can change it as you like. If you already want to do so in the generation of the HTML, well, then dig into the reportlistener code. I have no intention to do so. Reporting is one of the topics I don't like so much.

AS I already said, to me it would make much more sense to start with HTML itself and I would only look into a reportlistener generating HTML, if I already have a report outputting what I want to output. Seems to be the case. But then I would probably just run it once, take the output HTML file of it, and expand on that. Remove any specific datta from that HTML to create a HTML text with placeholders to use with Textmerge, for example. That's just my preference, because I do have good experience with HTML and CSS itself.

Bye, Olaf.
 
Stanley said:
As far as the hidden ocr text, I need it to be hidden from the user and google needs to index it.

That is unlikely to work. Google goes out of its way only to index text that is visible to the user. If that was not so, Google would present pages in its search results that didn't match users' search term - and it is not in Google's interests to do that.

Regarding where to put [tt]diplay:none[/tt], it goes in the style attribute of the block of text that you want to hide. For example:

Code:
<p>This is normal text - completely visible.</p>
<p style="display:none;">But this text won't be visible.</p>

If this is all new to you, I recommend you find a tutorial on HTML and CSS before you go much further. It could open the doors to other possibilities as well.

But do keep in mind what I said above about Google not indexing hidden text.

Mike



__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
If you aim for the web site/page desciption in googles result list to show your ocr text, the meta description is the right place. And it's not visible on the page itself, but will be displayed on google results. Google is clever though, you can't just crunch in popular search terms into the meta description of a page and then have nothing relating to that description in the visible page text. In the simplest case that'll lead to bad page rankiung. SEO is a topic on its own and nothing you learn casually while doing the HTML. SEO is a topic whole company concentrate on and only do, without doing any web design, illustartion, templateing etc, just core SEO optimization.

Maybe just as a starter: But before you go into that, HTML is the more basic knowledge to know.

And some tips from Google itself about best practices for the web page description to reflect in search results:
Bye, Olaf.
 
Olaf said:
If you aim for the web site/page desciption in googles result list to show your ocr text, the meta description is the right place. And it's not visible on the page itself, but will be displayed on google results.

That is almost completely correct. But it's worth keeping a couple of points in mind:

1. The contents of the Description meta tag does not itself affect the search results. Even if the user searched for an exact phrase from the Description, that won't itself influence whether the page will appear in the search results - unless that same phrase happened to appear in the body of the page as well.

2. Although the Description is often used for the text that appears below the page title in the search results (the so-called snippet text), that is not always the case and cannot be guaranteed. Google will choose whatever text to display that it thinks will be most helpful to the user. In many cases, that will be an extract from the page itself that contains the search term(s). (This is very easy to verify. Just run a few sample searches, and compare the snippet text with the Description meta tag.)

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
OK, so from a high flyover, how would you get the ocr text indexed by google for displaying results while keeping it private for subscribers? If the ocr text is available for everyone, then why would anyone pay for a subscription?

Can viewable text be layered under an image to prevent its view from a non subscriber? Or hidden somehow in plain sight?

I had already intended to show summary info on the viewable page and I want the ocr text indexed so that terms like "red oak tree" or "property line of John Doe" could be found which was not part of the summary by could be an important term known by the researcher. John Doe in this example is not involved in the document therefore not mentioned in the summary, but is mentioned/referenced in it.

Any suggestions?
Stanley
 
So you want to be found by a text passage you intend to sell? You ask for the implossible. You should have description and valuable info in your webpages for google to index and not offer anything you don't want public to the google bots. See robots.txt
Of course what you show public should be enough to attract subscribers.

But you don't integrate OCR text into the html at all for that matter, even if it would not be indexed by google (or other search engines), what in the HTML can be seen. Richt click - view source. You can't hide anything in HTML from not being present to the end user at all. The display:none attribute Mike explained earlier only hides the text visually in the browser canvas, it's still there. It's nonsense to put something "hidden" into the HTML, want to let Google react to search keywords on this part of the HTML and then not have it in the pages end users surf to by clicking the search result. That's not how things work.

Bye, Olaf.
 
If you want to sell electronic documents, media files, PDFs, whatever, users need to see a short summary/abstract or thumbnail or sample, then buy to get the full download. You then put up the file with a link only given to the paying member and only valid for a limited time, so it can't simply be given to others. Besides that, the URL should lead to a script first checking login of the user and check his permission before returning a file, it should not be a link to the file itself. That's how for example Amazon Music MP3 downloads work.

If you provide links to downloads themselves, there is nothing really hindering anonymous access to the file by the mere URL.

The age of urls ending in .htm or .html is over. Nobody has any static web content anymore. Even URLs not ending in some .php .aspx, .cgi or alike extension are leading to some server side dll, script, anything capable to interact, redirect you to login, if you try to access something protected, for example. And such things obviously also are not reachabel to google or other search engines and so also can't be indexed or searched. The reasoning for description and keywords and such is not to keep secrets from users, it's there to give exactly what the name says, a desciption or keywords.

Mike is right and I also already said you can't force google to react to your desciption, if it does tell something not at all having to do with the content, Mike says you can't force google to take your desciption and even less so you can tell google to treat this is viable search terms but to be kept secret nevertheless.

Bye, Olaf.
 
Olaf said:
That's not how things work.

I get that, and that is why I said "high flyover" as at this stage we need to discuss "how it can be done". I was asking how to make it work with whatever magic we need.

I am aware years ago that many developers did not like the fact that all their source and techniques could be discovered and copied by "viewing source". That has been addressed since then, so maybe some of that magic can be used. I don't know as I'm fishing for a way to do it...

Thanks,
Stanley
 
how would you get the ocr text indexed by google for displaying results while keeping it private for subscribers? If the ocr text is available for everyone, then why would anyone pay for a subscription?

If your aim is to give access only to subscribers, then that's an entirely different situation then simply wanting to hide certain text. Using CSS ([tt]display:none[/tt]) is not part of that solution. You presumably have some sort of password control, so that only subscribers will be allowed into that part of your site.

So your problem now is: How to get Google to index that part of your site. Basically, that's not possible. Google is not going to show a site in its search results if the searcher cannot access that site.

The usual workaround is to create separate pages which contain a summary of each of the articles in the subscriber-only part of the site. Those summary pages are normal pages which anyone can access and which Google can crawl. Provided those pages contain the appropriate keywords, Google will show them in its search results. The user can then read the summary, and then has the option of clicking on a link that will lead to a page inviting them to become a subscriber.

Many newspapers work that way, including the New York Times.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
>That has been addressed since then
No it hasn't. But hiding your source code is not necessary for a success of a site at all.
If you think of js code hiding that context menu option, well you automate InternetExplorer.Application, navigate to a URL (also done via code) and finally save the document.innerhtml text, and there you have it.

The secret also isn't in doing the secret things server side, more code has travelled to the client side in the form of JS to be executed there. The server side code still is the only watchguard, the only way to manage access to resources and let there be authentication. You don't let client side JS code decide whether a hidden element is shown or whether a request JS makes via Ajax (XMLHttpRequest) is shown in the browser or not, that's putting a lock made of paper on a cardboard box. Code switching the visibility:hidden to visibiltiy:visible is to make popups possible for example, not usable as a safe. You can easily unlock that.

You're fishing for something not existing. As I said you have to make public what also needs to sink into search engines index to make your site visible and attractive. And no, you can't have a secret with end users and not with google. That's it on that part.

Take a similarity, radio stations play music, people cannot only listen to it but record that, still money is made with selling CDs, DVDs or downloads. The part of the song made available public from shop sites like amazon starts from having the searchable list of artists and their album and song titles and also giving samples, but it doesn't end there, it even goes to playing the whole song in the media. You may sell other information, which once revealed is revealed. Well, then you can only describe your items, just like names and titles of songs or books or movies.

Bye, Olaf.
 
Olaf said:
If you think of js code hiding that context menu option, well you automate InternetExplorer.Application, navigate to a URL (also done via code) and finally save the document.innerhtml text, and there you have it.

Yes if you are a expert, however the cost of admission is too low, making it not worth this hassle for the researcher.


Olaf said:
and also giving samples
And that is what the hidden ocr text is all about, letting the researcher know that we have a match based on their search terms, of which may have came from the ocr text. If I give them the whole ocr text, then why should they pay? I'm trying to give the researcher as much as possible without giving away the farm...


Mike said:
The usual workaround is to create separate pages which contain a summary of each of the articles in the subscriber-only part of the site. Those summary pages are normal pages which anyone can access and which Google can crawl. Provided those pages contain the appropriate keywords, Google will show them in its search results. The user can then read the summary, and then has the option of clicking on a link that will lead to a page inviting them to become a subscriber.

This is exactly what I'm doing. I'm creating public set of pages with summary info on them that leads to the full secured page via a login, if they are not already logged in. Looks like I'll need to manually visit each of the 3+ million pages and growing, then create an "ok to show" keywords list to be shown on the public summary page all for google... I tremble of the thoughts of that... and no one will pay for all those manhours, all because the search bots can't keep a secret...

Thanks,
Stanley


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top