Removing tags from HTML code

Stretchwickster · Apr 16, 2003

I have some HTML code in a TRichEdit and I want to strip out all the tags to leave the text I am interested in. For example, filtering this HTML code:

Code:

<HTML>
<HEAD>
  <TITLE> My Site </TITLE>
</HEAD>
<BODY>
  <B> Lots of useful information </B>
  <H1> And some more </H1>
</BODY>
</HTML>

would give the following as output:

Code:

  My Site
  Lots of useful information
  And some more

Here is the Delphi code I have so far:

Code:

startPos := 0;
  lineNo := 0;
  with richEditHTML do
  begin
    textLen := Length(richEditHTML.Text);
    repeat
      beginFound := richEditHTML.FindText('<', startPos, textLen, []);
      if beginFound <> - 1 then
      begin
        startPos := beginFound;
        textLen := textLen - startPos;
        endFound := richEditHTML.FindText('>', startPos, textLen, []);
        SelStart := beginFound;
        SelLength := (endFound - beginFound) + 1;
        SelText := '' + #13#10;
        SelStart := SelStart + 1;
        Inc(lineNo);
      end;
    until (beginFound = -1) OR (lineNo = 189);
  end;

Unfortunately, I had to put a limit on how many tags it removes because it seems to mess up when it finds the 190th tag! The code works as required up to this point. Another problem is that lots of whitespace is still floating around after doing this. Btw, the text is about 61,000 characters over 1300 lines.

Any help would be much appreciated!

Clive [infinity]

http://www.kucu.co.uk

Ex nihilo, nihil fit (Out of nothing, nothing comes)

earlrainer · Apr 16, 2003

try this

http://www.torry.net/vcl/internet/html/nzhtmlparser.zip

LucieLastic · Apr 16, 2003

hi Clive

You could use a TDomDocument and read each section by tagElement but there will be some codes that fall thru the net which in my case, I just do a StringReplace.

eg, here's a snippet of my code
:
var NodeList : IXMLDomNodeList;
XMLDoc : TDomDocument;
:
:

Status:=XMLDoc.load(sFile);
if (Status=False) then
raise exception.Create('Could not load the XML file');
:
:

NodeList := XMLDoc.getElementsByTagName('title');
StoryTitle :=NodeList.item[0].Get_Text;
StoryTitle := Form_main.CheckForChars(StoryTitle);

//setup Rich edit formatting.

RichEdit.SelStart := 0;
RichEdit.SelLength := length(Storytitle);
RichEdit.SelAttributes.Color := clMaroon;
RichEdit.SelAttributes.Style := [fsBold];

RichEdit.lines.Add(Storytitle);
RichEdit.SelAttributes.Style := [];
RichEdit.SelAttributes.Color := clBlack;
//Get Story Body
NodeList := XMLDoc.getElementsByTagName('fulltext');
for ii := 0 to NodeList.length -1 do
begin
sline := NodeList.item[ii].Get_text;
sline := StringReplace(sline, '<P>', '', [rfReplaceAll]);
:
etc ...

hth
lou

Stretchwickster · Apr 16, 2003

Cheers for the suggestions peeps...
Lou, what do I need to stick in my "uses" clause to get access to a TDomDocument and an IXMLDomNodeList?

Clive [infinity]

http://www.kucu.co.uk

Ex nihilo, nihil fit (Out of nothing, nothing comes)

LucieLastic · Apr 17, 2003

hi Clive

Ah,yes, one minor detail....you need to import MSXML_TLB type library and put MSXML_TLB in your uses. Do you have this file?

lou

Stretchwickster · Apr 17, 2003

Hey Lou,

I'm afraid I don't have this file...where can I get it from?

Clive [infinity]

http://www.kucu.co.uk

Ex nihilo, nihil fit (Out of nothing, nothing comes)

LucieLastic · Apr 17, 2003

hi Clive

You need IE5 or newer on your machine and have a look at this link, and search page for MSXML_TLB or here's the snippet

"Select Project/Import Type Library. This will display the Import Type Library dialog. Select "Microsoft XML, Version 2.0 (version 2.0)" from the list box and click the "Create Unit" button. This will add MSXML_TLB to your project."

http://bdn.borland.com/article/0,1410,26882,00.html

OR, another eg

http://delphi.about.com/library/bluc/text/uc050601a.htm

lou

LucieLastic · Apr 17, 2003

hi

Just fyi, if you search on t'internet (northern lass) for MSXML_TLB you'll find a lot of examples of the parser.

lou

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Removing tags from HTML code

Stretchwickster

Programmer

earlrainer

Programmer

LucieLastic

Programmer

Stretchwickster

Programmer

LucieLastic

Programmer

Stretchwickster

Programmer

LucieLastic

Programmer

LucieLastic

Programmer

Similar threads

Part and Inventory Search

Sponsor