Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

VBScript Reg Exp help 1

Status
Not open for further replies.

JimFL

Programmer
Jun 17, 2005
131
GB
Hi,

I am trying to generate a simple search function that reads in the text file (ie html or asp) and then strips out the page content using the function below using vbscript.

<%

Function clearAllTags(s)
Dim re
Set re = New RegExp
re.Pattern = "(<[^>]*>)"
re.Global = True
re.IgnoreCase = True
clearAllTags = re.Replace(s, "")
End Function

%>
However it seems to work fine for some instances but not all pages. It doesnt seem to remove the javascript and fails to remove some of the other asp and html for certain pages.
Does anybody have another solution or know how to adapt the code to remove html,javascript and asp?

Can anybody help?

 
Investigating this further I dont think it likes the following code.

<code>
<a href="#Image_<%=streamID%>" onMouseOut="MM_swapImgRestore()" onMouseOver="MM_swapImage('Image_<%=streamID%>','','img/matrix_misc/streams_substreams/streams/<%=strmGraphic%>_off_over.gif',1)" title="<%=thisSubStreamText%>">
</code>

Where I am outputing asp to be interpreted as html, can anyone help to re write the reg expression to cater for this?

 
try this




____________ signature below ______________
General FAQ faq333-2924
5 steps to asking a question faq333-3811
 
Thanks for your response. I have tried both methods and the first reg expression seems to leave me with the best set of results (ie better than the expression before). However it doesnt provide me with a completely clean solution with some asp in there still. Do you know if this has been covered anywhere?

 
Is there a way to remove the ASP after the html has been cleaned?

so I could write another regular expression to remove the ASP tags and all thats within it.

ie.

<% --all code %>

??

 
This is my little invention.
[1] It removes well-matched documents with angle bracket balanced perfectly, such as:
[tt]<...<...>..>xxx<..<..<..>..>..>yyy...[/tt]
leaving basicially
[tt]xxxyyy[/tt]
It will be suitable for well-formed xhtml with server-side code without arithmetical/logical (>) or (<). This is its limitation.
[2] For <script></script> html, it would have the same ill-posedness due to arithmetical/logical operators.
[3] For xml document, parsed string will not have that defects.
[4] It contains a little routine of cleaning up redundant empty lines and spaces. (It may not be visually perfect yet.)
[5] This is the function striptags(s) taking a well-matched string as argument.
[tt]
function striptags(s)
dim rx, rx_count, rx_clean, n, stemp
set rx=new regexp
with rx
.pattern="<(.*?)>"
.global=false
end with
set rx_count=new regexp
with rx_count
.pattern="<"
.global=true
end with
stmp=s
do while rx.test(stmp)
n=rx_count.execute(rx.execute(stmp)(0).submatches(0)).count
stmp=rx.replace(stmp,string(n,"<"))
loop
set rx_clean=new regexp
with rx_clean
.pattern="^\s*(.*?)\s*$"
.global=true
end with
stmp=rx_clean.replace(stmp,"$1")
with rx_clean
.pattern="^\s*$"
.global=true
.multiline=true
end with
stmp=rx_clean.replace(stmp,"")
striptags=stmp
end function
[/tt]
 
Thanks tsuji. It seems to again better than others but I need a solution for ASP using the > < operators. Can this be done?

Is there a way that the regular expression can indentify the difference between an ASP operator and either a closing asp/html tag?

 
For server-side script interspersed with html script, the configuration of nested <<>> is simpler. You can use regexp to discovered matching <% and %>, strip it off rightaway with fear of arithmetical/logical operators intervening. Then proceed with <> html remaining. With <script></script> sector, you can strip it wholesale again using the characteric signature <script></script>. I would foresee success. The above function is at the same time slightly more general (<<<...>>> acceptable) and more restricted (not unbalanced arithmetical/logical operators.

In sum, use <%...%> strip, then <...> strip with special attention to <script>...</script>. That would end up all right.
 
for the first stage to remove the

<% and all in between tags %> - how would this be done with a regular expression?

can you put me on track as Im not that hot with the perl syntax?

 
Such as this.
[tt]
function striptags_asp(s)
dim rx, stmp
set rx=new regexp
with rx
.pattern="<%.*?%>"
.global=true
end with
set rx_count=new regexp
stmp=s
stmp=rx.replace(stmp,"")
with rx
.pattern="<script(.|\n)*?script>"
.global=true
end with
stmp=rx.replace(stmp,"")
with rx
.pattern="<(.|\n)*?>"
end with
stmp=rx.replace(stmp,"")
'the rest is just cleaning up
with rx
.pattern="^\s*(.*?)\s*$"
.global=true
end with
stmp=rx.replace(stmp,"$1")
with rx
.pattern="^\s*$"
.global=true
.multiline=true
end with
stmp=rx.replace(stmp,"")
striptags_asp=stmp
end function
[/tt]
There might be so more abberrations. If you find some, tell the forum.
 
Thanks again tsuji.


I get a problem at

Unterminated string constant

line 5

.pattern="<%.*?
---------------^

I dont understand this error because there is a closing " ?

 
Further note:
Add ignorecase true to the script sector regexp. Also some minor redundant lines or missing lines are due to my editing off-hand here...
[tt]
'etc...
with rx
.pattern="<script(.|\n)*?script>"
.global=true
[red].ignorecase=true[/red]
end with
stmp=rx.replace(stmp,"")
with rx
.pattern="<(.|\n)*?>"
[blue].global=true[/blue] 'redundnat in any case
end with
stmp=rx.replace(stmp,"")
'etc...
[/tt]
 
>.pattern="<%.*?[tt][highlight]%>"[/highlight][/tt]
It shouldn't error out, should it?
 
I have tried different combinations of including the ignorecase and not but all give the same Unterminated string error. Even after reseting the .patterm for the first replace. Can you post up the full function so that I know what should go where?

 
I have not much more than posted. Here is a copy and paste of what posted with minor improvement.
[tt]
function striptags_asp(s)
dim rx, stmp
[blue]stmp=s[/blue] 'moved here
set rx=new regexp
with rx
.pattern="<%[blue](.|\n)[/blue]*?%>"
.global=true
end with
[red]'[/red]set rx_count=new regexp 'sure should edit out
[red]'[/red]stmp=s 'moved up
stmp=rx.replace(stmp,"")
with rx
.pattern="<script(.|\n)*?script>"
.global=true
[blue].ignorecase=true[/blue]
end with
stmp=rx.replace(stmp,"")
with rx
.pattern="<(.|\n)*?>"
end with
stmp=rx.replace(stmp,"")
'the rest is just cleaning up
with rx
.pattern="^\s*(.*?)\s*$"
.global=true
end with
stmp=rx.replace(stmp,"$1")
with rx
.pattern="^\s*$"
.global=true
.multiline=true
end with
stmp=rx.replace(stmp,"")
striptags_asp=stmp
end function
[/tt]
 
I really appreciate your help here tsuji. I have copied the code identically but still get the same Unterminated string constant error

.pattern="<%(.|\n)*?
--------------------^

so Im a bit lost again. Can you help?

 
I make some test on it and it works as a standalone. If you test in asp page, make sure it is a server-side function. I'm calling day-off.
 
ok thanks for your help - much appreciated.

 
Unfortunatly you will get wierd issues with the <% or %> even if they are inside a string. Just seperate them out so your pattern lookis like:
Code:
.pattern="<" & "%.*?%" & ">"

-T

 
cheers - that has solved the issue above. The additional function has helped to clear away the javascript also but it fails to remove the ASP where I am using >< signs in the code. Are there any other possible solutions?

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top