Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Stripping RTF tags using PL/SQL

Status
Not open for further replies.

slicendice

Programmer
Jun 28, 2002
164
GB
Hi
I have a field that is storing text in RTF format and I need to transfer the actual text into a plain text field - in other words I need to strip out all the RTF tags from the original field.

Does anyone have any idea how to do this in PL/SQL?

Thanks very much
 
Slice,

Sure...I'm just not familiar with RTF tags. Can you please offer either a reference of the full set of RTF tags or a description of RTF formats so that we can create the code necessary to strip the tags?

[santa]Mufasa
(aka Dave of Sandy, Utah, USA)
[ Providing low-cost remote Database Admin services]
Click here to join Utah Oracle Users Group on Tek-Tips if you use Oracle in Utah USA.
 
Thanks for the response Mufasa...

I found a full specification of RTF on the web and to be honest, it looked pretty scary, as there are a LOT of tags that could potentially be embedded into the text. What seems like it would be very awkward to me is that whereas some tags are fixed (e.g. italics is \i), others can change (e.g. for a specific font it's \fN, where N is the number of the font in the font table!).
The full spec can be found at:

In all reality, it's probable that the only tags actually used in the text will be basic formatting tags (e.g. underline \u, bold \b, italic \i, font \fN, font size \fsN, font colour )...BUT there may be others in there too. Plus I think there is RTF header information prepended to all text.

The other thing that also occurred to me is that there could possibly be backslash characters in the text (i.e. ones that do not indicate any kind of RTF attribute)

I'd be very interested to hear your thoughts on this, as I have to admit it looks like a scary task to me! :)

Thanks again...
 
Slice,

Some follow-on questions:

1) How many rows of RTF-tagged text are there?
2) Are you doing a 1-time RTF-to-text conversion or must you retain the RTF tagging and do the RTF-to-text conversions "on the fly" whenever someone needs it converted?
3) What are the ranges in size of the RTF-tagged text pieces?
4) Since you are posting in the Oracle 8i forum, I'm concerned about the data type of the column in which you are storing the RTFs...What is the RTF-column data type?

[santa]Mufasa
(aka Dave of Sandy, Utah, USA)
[ Providing low-cost remote Database Admin services]
Click here to join Utah Oracle Users Group on Tek-Tips if you use Oracle in Utah USA.
 
Answers as follows:

1) At this stage I'm not sure exactly how many but it will be in the thousands
2) Yes this is a one-time RTF-to-text conversion; once it's done, that's it
3) Again, at the moment I'm not sure as I don't currently have access to the actual data. I believe I will be able to access it at some point over the next few days (or so I'm told!)
4) The RTF data is being stored in a LONG RAW field; I'm converting this to text and then storing it (hopefully minus the RTF tags!) in a VARCHAR2 field. The size of the VARCHAR2 field is 2000, so as well as removing the RTF tags I'm trimming the output to 2000 characters.

Thanks for your time on this...
 
Slice,

I've done some checking and here is what I have found:

1) RTF-to-text conversion software can be pretty pricey.
2) I cannot locate any existing database/procedural-language algorthms to strip RTF tags.
3) Stripping RTF tags can be an algorithmic nightmare (ergo the pricey software).

Since this is a one-time shot (to convert the RTF-tagged text into plain text), and if I were in your shoes, I would seriously consider the following:

1) write a PL/SQL routine to read the RTF-tagged text,
2) write the text to individual flat files (using your PK value for each row as the "<PK-file name>.rtf") using Oracle's "utl_file" packaged procedures,
3) Open the resulting flat files in something like Word,
4) Save the content to a non-RTF, plain-text file named "<PK-file name>.txt"
5) Use a PL/SQL routine to read the flat, plain-text files back into their respective rows.

You can probably assign some lower-wage-per-hour administrative person to do the busy work of opening the files in Word, then saving them back out to plain "*.txt" files.

Anyone out there with a better suggestion than this one? I'd be very interested, myself.

Let us know your thoughts, Slice.

[santa]Mufasa
(aka Dave of Sandy, Utah, USA)
[ Providing low-cost remote Database Admin services]
Click here to join Utah Oracle Users Group on Tek-Tips if you use Oracle in Utah USA.
 
I'd been coming to similar conclusions myself actually! The more I look at the RTF formatting codes, the bigger the job seems to be. Just looking at a text field which, via the interface, only has a few words in it can reveal underlying RTF data that is hundreds and hundreds of characters long.
I've done some *very* simple text stripping based purely on removing the RTF header (using the '}' as a postional marker), removing any CR/LF tags ('\par' in RTF speak) within the text and then removing any trailing tags. This works OK up to a point, but if there's any more complex RTF formatting within the actual text (as opposed to at the start or end of the text) then it falls flat on it's face!

I think at this point it will be up to the customer to decide how they best want to proceed, as this is something they have requested. They can either have an automated process that is very simple and not very clever (along the lines of what I've already done) or go for something along the lines that you suggest of using an editor to load the data and then save it as plain text.

I still haven't had access to the actual data that needs converting so I still do'nt know for certain just how much RTF formatting I'm going to have to deal with. I'll keep you posted...!

Thanks very much for your help and suggestions
 
Actually....I'm guessing it would be possible to invoke an RTF reader (such as Word) programatically using something like VBA? If so then could possibly use PL/SQL to generate the RTF text files from the database (using UTL_FILE as you describe above), then run some VBA that reads all the files and saves each as plain text, then run some more PL/SQL to load the plain text back into the database?
Just thinking out loud...!
 

My 2 cents:

Look into winbatch to automate windoze programs. [3eyes]




----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Yes, Slice, between your idea and LK's link to winbatch, this could provide you with the non-labour-intensive options that would make viable a level of automation that you and the customer could afford.

Let us know what you come up with.

[santa]Mufasa
(aka Dave of Sandy, Utah, USA)
[ Providing low-cost remote Database Admin services]
Click here to join Utah Oracle Users Group on Tek-Tips if you use Oracle in Utah USA.
 
I do a lot of automation with standard dos command files. What advantages does WinBatch offer? Just curious.
 


You can emulate any input to a windows program (like point, click double-click, enter data in a field, etc...)

In other words, treat a windows program (like Word or Wordpad)as a batch program, allowing you to loop through a folder, get a file name, invoke wordpad, click-n-all to open the file and save as text.

[2thumbsup]

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Well, we are starting to get a little bit off topic from the original post, but I will definately take a look at WinBatch. Sounds like it could be useful.
 
Just as some kind of closure to this, a decision has been made to do this by writing a Delphi utlity rather than trying to do it directly in the database; consequently I will play no further part in this....knowing nothing about Delphi! (Basically, the data will be read into an RTF text control using Delphi, and then written back out as plain text)

Thanks to all for the input and help
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top