Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing tags from HTML code

Status
Not open for further replies.

Stretchwickster

Programmer
Apr 30, 2001
1,746
GB
I have some HTML code in a TRichEdit and I want to strip out all the tags to leave the text I am interested in. For example, filtering this HTML code:
Code:
<HTML>
<HEAD>
  <TITLE> My Site </TITLE>
</HEAD>
<BODY>
  <B> Lots of useful information </B>
  <H1> And some more </H1>
</BODY>
</HTML>
would give the following as output:
Code:
  My Site
  Lots of useful information
  And some more

Here is the Delphi code I have so far:
Code:
startPos := 0;
  lineNo := 0;
  with richEditHTML do
  begin
    textLen := Length(richEditHTML.Text);
    repeat
      beginFound := richEditHTML.FindText('<', startPos, textLen, []);
      if beginFound <> - 1 then
      begin
        startPos := beginFound;
        textLen := textLen - startPos;
        endFound := richEditHTML.FindText('>', startPos, textLen, []);
        SelStart := beginFound;
        SelLength := (endFound - beginFound) + 1;
        SelText := '' + #13#10;
        SelStart := SelStart + 1;
        Inc(lineNo);
      end;
    until (beginFound = -1) OR (lineNo = 189);
  end;
Unfortunately, I had to put a limit on how many tags it removes because it seems to mess up when it finds the 190th tag! The code works as required up to this point. Another problem is that lots of whitespace is still floating around after doing this. Btw, the text is about 61,000 characters over 1300 lines.

Any help would be much appreciated!

Clive [infinity]
Ex nihilo, nihil fit (Out of nothing, nothing comes)
 
hi Clive

You could use a TDomDocument and read each section by tagElement but there will be some codes that fall thru the net which in my case, I just do a StringReplace.

eg, here's a snippet of my code
:
var NodeList : IXMLDomNodeList;
XMLDoc : TDomDocument;
:
:

Status:=XMLDoc.load(sFile);
if (Status=False) then
raise exception.Create('Could not load the XML file');
:
:

NodeList := XMLDoc.getElementsByTagName('title');
StoryTitle :=NodeList.item[0].Get_Text;
StoryTitle := Form_main.CheckForChars(StoryTitle);

//setup Rich edit formatting.

RichEdit.SelStart := 0;
RichEdit.SelLength := length(Storytitle);
RichEdit.SelAttributes.Color := clMaroon;
RichEdit.SelAttributes.Style := [fsBold];

RichEdit.lines.Add(Storytitle);
RichEdit.SelAttributes.Style := [];
RichEdit.SelAttributes.Color := clBlack;
//Get Story Body
NodeList := XMLDoc.getElementsByTagName('fulltext');
for ii := 0 to NodeList.length -1 do
begin
sline := NodeList.item[ii].Get_text;
sline := StringReplace(sline, '<P>', '', [rfReplaceAll]);
:
etc ...

hth
lou

 
Cheers for the suggestions peeps...
Lou, what do I need to stick in my &quot;uses&quot; clause to get access to a TDomDocument and an IXMLDomNodeList?



Clive [infinity]
Ex nihilo, nihil fit (Out of nothing, nothing comes)
 
hi Clive

Ah,yes, one minor detail....you need to import MSXML_TLB type library and put MSXML_TLB in your uses. Do you have this file?

lou

 
hi Clive

You need IE5 or newer on your machine and have a look at this link, and search page for MSXML_TLB or here's the snippet

&quot;Select Project/Import Type Library. This will display the Import Type Library dialog. Select &quot;Microsoft XML, Version 2.0 (version 2.0)&quot; from the list box and click the &quot;Create Unit&quot; button. This will add MSXML_TLB to your project.&quot;


OR, another eg


lou
 
hi

Just fyi, if you search on t'internet (northern lass) for MSXML_TLB you'll find a lot of examples of the parser.

lou

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top