Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Remove Html Tag Tool 4

Status
Not open for further replies.

Bubbler

IS-IT--Management
Dec 14, 2003
135
US
Not sure where else to ask this, but I am hoping that due to the amount of experience in the forum I can get some leads to tools
Basically I have a 32mb html page and I want to remove the tags, it is over 400,000 lines so manual is OUT!!, I have tried several tools on the net but all of them crash.
I am simply trying to convert a list of links like this
Code:
<a href="[URL unfurl="true"]http://www.website.com/">http://www.website.com/</a>[/URL]
to this
Code:
[URL unfurl="true"]http://www.website.com/[/URL]
 

You can do this via the WebBrowser Control in several different ways...
[tt]
MyString = WB.Document.Body.innerText
[/tt]
or
[tt]
Dim A As Object
For Each A In WB.Document.Links
Debug.Print A.innerText & ", " & A.HREF
Next
[/tt]
and there are a few other ways...

Advanced Search this site in your forums for...

webbrowser
webbrowser object

(and any think else you can think of to find your answer)

using all words.

Good Luck

 
Or you could open it in Internet Explorer (or whatever)

Edit>Select All
Edit>Copy

Open NotePad

Edit>Paste

;-)

Have Fun, Be Young... Code BASIC
-Josh
cubee101.gif


PROGRAMMER: (n) Red-eyed, mumbling mammal capable of conversing with inanimate objects.
 
Thanks for the responses, 2 points

1. The lines are different like this
Code:
<a href="[URL unfurl="true"]http://www.website.com/page1.html">http://www.website.com/</a>[/URL]
<a href="[URL unfurl="true"]http://www.website.com/page2.html">http://www.website.com/</a>[/URL]

and 2 keep in mind that the file is over 400 thousand lines and is a 32mb page, last time I tries it with vb it does not load the whole file or crashes.

and to CubeE101, that method does not work, file is too big, locks up my sys, and my sys aint exaclty weak, WinXp Pro 2.4ghz P4 768mb ram, you can't copy 400 thousand plus lines to the clip board, at least my machine won't

If I could loop through the lines of the file like this
Anything between < and > remove

So again this:
Code:
<a href="[URL unfurl="true"]http://www.website.com/page2.html">http://www.website.com/</a>[/URL]
Would become
Code:
[URL unfurl="true"]http://www.website.com/[/URL]
Because it removed
Code:
<a href="[URL unfurl="true"]http://www.website.com/page2.html">[/URL]
and also
</a>
 
2 Quick questions

1) Did you try vb5prgrmr's solution?

2) are you saying that your 32mb page merely consists of a list of HTML links, or it is an HTML page (with header etc) that has a bunch of these links on it?
 
strongm, yes I tried vb5prgrmr's suggestion, 400 thousand + links crash it. (I think :)) Yes it is a full webpage.

Here is what I tried
Code:
Dim A As Object
For Each A In WB.Document.Links
  Text1.Text = A.innerText & vbNewLine & A.HREF
Next
I think it is the text boxes inability to handle this much input, what should I use?
 

Well with the code you have cited you would only have one result in the textbox, which I guess is not a copy and paste job but just a misprint. So if you are having that many problems with the textbox use a RichTextBox instead and see if that crashes your system/program. Also if you would have followed the advice in FAQ222-2244 item 14 about reporting errors and code examples and where the error occures we might have a better idea on how or where your secondary problem is.

Good Luck

 
Bubbler- you already know the answer..
"If I could loop through the lines of the file like this
Anything between < and > remove"

here is the pseudo code:

Open the source file for binary input
Open a target file for binary output
while not inputfile.eof
read a byte from input into VAR
if not "<" then
write the VAR to outputfile
else
while VAR <> ">" and not inputfile.eof
'eat letters until we reach the >
read a byte from inputfile into VAR
wend
end if
close inputfile
close outputfile
wend

I think that Get <filenum>,,Var
will read a byte from the input file if
VAR is set up in this way
Dim VAR as string *1

 
By the way:
A non-programming method suggests itself:

Set up a new printer driver of type GENERIC/TEXT ONLY.
Set the output device to FILE

Open the HTM in Internet Explorer, and PRINT to this printer driver.
You will be asked for a filename.
The resulting file will be what you wanted!
 
[ignore]So...

Do you want this...
<a href=" <a href="
to become this...

or this:
[/ignore]

Have Fun, Be Young... Code BASIC
-Josh
cubee101.gif


PROGRAMMER: (n) Red-eyed, mumbling mammal capable of conversing with inanimate objects.
 
So something like the following might work:

Option Explicit
Private re As RegExp ' Speed reasons. Keep the object instantiated for lifetime of trimming

Private Sub Command1_Click()
Dim fso As FileSystemObject
Dim tsMyFile As TextStream
Dim Trimmed As String

Set re = New RegExp
Set fso = New FileSystemObject
Set tsMyFile = fso_OpenTextFile("c:\demo.htm", ForReading)
Do Until tsMyFile.AtEndOfStream
Trimmed = TrimTag(tsMyFile.ReadLine)
If Trimmed <> "" Then Debug.Print Trimmed
DoEvents
Loop
Set re = Nothing
Set fso = Nothing
End Sub


Private Function TrimTag(strSource As String) As String
Dim re As RegExp

Set re = New RegExp

re.Pattern = "<a.*?>(.*)<.a>"
If re.Test(strSource) Then TrimTag = re.Replace(strSource, "$1")

Set re = Nothing

End Function
 
Thank you all for your input, I have been able to compile a worable solution fro this thread, that's why I came here. Thanks again.
 
Bubbler, Can I se your code here. I want to do something similar and would be interste din your final product code. Thanks!
RLB2
 
Hello,
Strongm's solution will probably be the best. I use this very same method for picking out links (i.e. Hrefs and Mailto:) in my webspider program.

LF
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top