Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parsing documents

Status
Not open for further replies.

audiopro

Programmer
Apr 1, 2004
3,165
GB
I am re-writing a Church website where certain documents need to be uploaded to the server by a number of different people. All the people have log in rights and uploading the files is not the issue.

The files are created to be used as paper copies and are Newletters, Sermons, Bulletins and Diary items etc. All the documents are basic text with some limited formatting - bold, underline, centered, left aligned etc. (ie. No tables, images or any other complications) and are in RTF format although they could be saved as any required type of file.
They are to be uploaded to the server without any further editing being required, the uploaders are neither tech savvy enough to HTML format them, nor do they have the time.
I need to automate the parsing process and have been looking at RTF parsers but all the ones I have found are beta versions which suggest that the RTF format is a hard one to decode.
I was wondering if anyone out there has similar requirements and how you handle such documents.

Keith
 
If they can save the documents as a PDF have a look at this. I make no claims about it, but it looks like a 3 step process to find a document, convert to HTML and save to a file (which I am assuming could include uploading to a server). All for 49.95... but wait, there's more[lol]


Paul
 
Probably one of the simplest ways is to install a free PDF printer driver (e.g. or on each of the uploaders PCs. After authoring their RTF document they simply print and select the install PDF printer driver and the RTF is saved as PDF. They can then upload these in the same fashion as before.

Alternatively, if the formats are fixed and it's just the data that changes you could provide HTML forms for the users to enter data and generate the HTML or PDF (e.g. output yourself.

Clive
Runner_1Revised.gif

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"To err is human, but to really foul things up you need a computer." (Paul Ehrlich)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the best answers from this forum see: faq102-5096
 
I have done some experiments with RTF and have decided that it is a bad format to work with - bleeding rubbish actually but I am not a highly paid computer geek, so what do I know! Why are control codes and actual text not specifically separated by a standard separater? Umm too simple? Why do the people who have influence in development always choose a complicated solution to what is a simple problem?
My scenario:-
The files are created in Wordperfect and the paper copies are printed from that. Wordperfect documents in their raw state are in effect RTF files but with a different extension.
At the moment, for web use, my clients save them using the HTML option which produces bloated but otherwise useable code.
I am looking at converting these large bloated things to a web friendly document. I am getting there, but what a ball ache!
I am using the resulting documents within dynamic web pages which is why I am trying to avoid PDF docs if at all possible. All I need is the text from the original, with a limited amount of formatting so that the documents display in a pretty format.

Keith
 
Why not let them copy and paste the document into a large textarea, then you can format it as you please (like convert newlines to </p><p>, etc.)? Give them a text input at the top for the title or header, and everything else is the body. You'd lose bold, underline, and colors, but it'd be easier than writing something to convert the RTF document itself to HTML.

Here's a link to a free RTF to HTML converter you might check out:


Lee
 
Thanks for the suggestion Lee but I want them to be able to upload the documents to the site with no further human intervention required.
I only require bold, underline, paragraph and line breaks but it is proving difficult. I am parsing the HTML doc. produced by WordPerfect and it seems to be doing the trick.
I will look into the link on Monday.

Keith
 
Unless your web host allows you to run executables on the server, you're going to be stuck with either converting the document before uploading, or learning enough about RTF codes to do the conversion with server-side scripting.

Lee
 
The old site was just static web pages but the new site, created by me, is dynamic and written in perl so running a conversion script is not a problem. The RTF format is a strange beast, I have studied the full documentation and have managed to get a parser working but with a few issues. I have gathered a number RTF documents generated by different programs and I am having some problems with consistency.
I think that this is worth persuing as quite a few of my clients have websites where they upload files and being able to upload a basic RTF document and forget about it would be ideal. I find it hard to believe that a fool proof method has not been developed as I would have though that this is quite a common function in websites, yet no-one seems to have a definitive answer.

I will crack this eventually and who knows, I may become famous and be nominated for a Nobel Peace Prize - somehow I doubt it.

Keith
 
Looks like that might be useful for a copy and paste from the RTF document converting it to HTML. It appears to use the older <font> tags, but those could be changed to <span>s with with the font attributes converted to style sheet values.

Lee
 
How will that help?
The format is already RTF and needs to be converted to a HTML layout in real time.
Because that is exactly what most web based rich text editors allow you to do. You paste in your text (copying it from an application such as MS word, for example), and it spits out the converted HTML.


-------------------------------------------------------

Mark,
[URL unfurl="true"]http://aspnetlibrary.com[/url]
[URL unfurl="true"]http://mdssolutions.co.uk[/url] - Delivering professional ASP.NET solutions
[URL unfurl="true"]http://weblogs.asp.net/marksmith[/url]
 
Code:
You paste in your text

This is not part of the project design.

This is the planned mode of operation:-

The document is created on a local machine by an operator. The document is in rich text format unfortunately but I am working with what the client uses now to save them learning new technology.
The document is automatically uploaded via a scheduled task each evening. Once on the server, each time the document is required, it is converted from RTF into CSS based HTML code for use on the website. All this is carried out automatically.

Once developed, tested and running bug free, the documents will then be uploaded and a converted into a HTML template file and stored for access from the controlling Perl script.
Most of this chain of events is working, the only exception to this is the RTF parsing, which is coming along but the file format is an awkward one to work with.
All of the conversion programs I have seen are beta versions so my approach has been to write a Perl module to do the job. I have read the specs for RTF files (all 229 pages) and I am slowly adding the control codes I need, my intention being to just discard all of the rest.

It is a work in progress but I thought someone would have invented this wheel before as it is such a useful function.

Keith
 
You may be able to utilise the ability of one of these RTF editors to do this for you. If they have the ability to convert RTF to HTML when a user pastes in the text, then they may also expose this functionality to allow you to do this via server-side code.


-------------------------------------------------------

Mark,
[URL unfurl="true"]http://aspnetlibrary.com[/url]
[URL unfurl="true"]http://mdssolutions.co.uk[/url] - Delivering professional ASP.NET solutions
[URL unfurl="true"]http://weblogs.asp.net/marksmith[/url]
 
If your web server is set up for ASP, you should be able to pretty easily change the Javascript to JScript ASP.

Lee
 
Code:
If your web server is set up for ASP, you should be able to pretty easily change the Javascript to JScript ASP.
Sorry Lee - I don't understand what you mean.

Keith
 
You can use VBScript and JScript (which I almost always use) to write ASP code. If you usually use VBScript, you can use the runat="server" attribute in a script tag to run the Javascript server-side. You'll have to change a few things to use server objects (like Response) rather than browser objects (like document), but the conversion is usually pretty simple.

If you change the page language to JScript or Javascript, you won't need script tags, either.

Lee
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top