Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegExp - Removing html div tag and attributes 2

Status
Not open for further replies.

atsea

Technical User
Feb 27, 2005
51
JP
I'm hopeing that a RegExp guru can grace me with thier knowledge...

I would like to accomplish the following:

Input:
<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>

Desired Output:
SOME TEXT

currently playing around with something like this:

/<(div)([ ]([a-zA-Z]+)=("|')[^"\']+("|'))*[^>]+>([^<]+)(<\/div>)/ig

Any suggestions?


Thanks

atsea
 
If you assign an ID to the div, you can get the text using document.getElementById('yourdivIDhere').innerHTML.

Lee
 
trollacious:

Thanks for the suggestion, however its not a matter of getting the inner text/html, I want to remove the tags...

I would like to accomplish this with a regular expression (i.e. NOT obj.parentNode.removeChild(obj))

Thanks

atsea
 
You can do this.
[tt]
//s by whatever means
var s='<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>'

var rx=new RegExp("<div .*?>(.*?)</div>","i");
var am=rx.exec(s);
//this is the desired data extracted
var sdata=(am)?am[1]:""; //the only submatch or empty
[/tt]

 
Intresting approach...

I was trying to use the replace() function. Is this possible?
Code:
var data ='<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>'

data = data.replace(/<(div)([ ]([a-zA-Z]+)=("|')[^"\']+("|'))*[^>]+>([^<]+)(<\/div>)/ig); 
//this expression does not work (still), but I was able to do something similar when removing simple tags (i.e <b>)
(using replace(), this: /<div .*?>(.*?)<\/div>/ will remove the tag as well as the contents)

for example...to replace the string <b>FOO BAR</b> with the string FOO BAR, I could use this RegExp; data.replace(/<[\/]{0,1}(U|u)[^><]*>/g,"");

is the same possible with div's? I believe the problem is that <div>'s can contain multiple attributes...I would like the tag, with its attributes, removed.

getting closer...thanks.

any comments/suggstions on the above?

atsea
 
[tt]
//s by whatever means
var s='<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>'
var rx=new RegExp("<div .*?>(.*?)</div>","i");
s=s.replace(rx,RegExp.$1);
[/tt]
 
Hi tsuji,

tried that already...it not only removes the <div> tag, but everything within them as well.

any more suggestions?

also, being new to regular expressions, could you please explain to me the purpose of "RegExp.$1".

Thanks,

atsea
 
What do you want actually to replace? I am not trying to guess between the lines. Could you say it very simply?
 
sorry if my previous posts seemed a little vauge...

Basically I have a sting:

string = '<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>'

I want the RegEx to take the entire string above and replace it with the contents within the <div> tag. so the out put would be:

SOME TEXT


A working example (for more basic tags) is as follows:
Code:
string = "<u>THIS IS A STRING</u>";

string = string.replace(/<[\/]{0,1}(U|u)[^><]*>/g,"");

now, the value of variable string = THIS IS A STRING

is that clearer?

atsea
 
Okay, I understand. It was my bad memory of syntax.
[tt]
//s by whatever means
var s='<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>'
var rx=new RegExp("<div .*?>(.*?)</div>","i");
s=s.replace(rx,[red]"$1"[/red]);
[/tt]
Another viable alternative I prefer more, applicable for case without any attribute.
[tt]
//s by whatever means
var s='<div style="text-align:center;width:inherit;text-color:blue;">SOME TEXT</div>'
[blue]//I prefer more[/blue]
var rx=new RegExp("<div[blue][^>]*[/blue]>(.*?)</div>","i");
s=s.replace(rx,"$1");
[/tt]
 
SWEET!

That is exactly what I was looking for, specifically that "$1" was a great help...

my understaning of the $ is that it forces a match at the end of a line. Could you explaine a little, its purpose in this syntax?

s=s.replace(rx,"$1");

sorry for the extra question, but that little $1 has fixed some other regexp problems I've been having as well...

I appreciate all the time you spent helping me on this matter.

Thanks

atsea
 
It does practice some off-common look in syntax, the regexp, due to its evolution from some other (perl) source.

>my understaning of the $ is that it forces a match at the end of a line

That is one form of it, if it appears in a regexp pattern bare without escape. But, there is a global regexp object which has properties RegExp.$1...RegExp.$9 which are updated from time to time after a successful match happened. It stores the submatches (the sub-patterns specifically regrouped under the parentheses; here (.*?). It is the only one, hence, the submatch will be updated to RegExp.$1).

But, for using it in the one-liner of replace calling for the submatch of the regexp itself, its syntax is really off-common. It uses "$1" (not only no need to explicit the default global RegExp, it also needs to be enclosed by quotes!) to call for RegExp.$1.

As to documentation, I'm sure you can find in those javascript tutorial or documentation sites. I use ms documentation on js and ecma. Most prefer other colorful sites without authority and claiming credit for themselves as if they learn from nowhere.
 
atsea, it looks like tsuji has already solved your problem but I thought you might be interested in this code.
I came up with this code to automatically parse through the innerHTML of a form and rewrite the content into a single string that can be used to submit as an email.
In the translation I alter the form tags properties to make them readonly or disabled as appropriate so that the submitted version of the email does not allow modification to the submitted form data.
It is a way to easily send a near exact replica of the form page in an HTML email without having to manually create a second email'able version.

In any event, I use a case statement to determine which form field type is currently being looked at, set variables for the tags beginning and end strings and a variable for what the new content should be and pass that into one regular expression. You should be able to easily modify this to allow you to use the same regular expression for any type of field you want to parse out and set the new content to just be the former innerText of that field instead of recreating a modified version of the tag as I am doing.
If you want to modify more than one type of tag this could be very useful for you. Though I only modify form fields in this code it should not be hard to modify it for other types of fields.
Code:
function formatform(objForm)
{
  //Grab all embeded and included styles.
  var outstyles = '';
  var arrstyles = document.styleSheets;
  for (var x=0;x<arrstyles.length;x++)
    outstyles = outstyles + '<style type="text/css">' + arrstyles[x].cssText + '</style>';
  var mydiv = escape(document.getElementById('EmailForm').innerHTML);
  //Replace carriage returns with the equivalent character entities.
  var obj = "%0D%0A";
  while (mydiv.match(obj)) { mydiv = mydiv.replace(obj, "&#13;&#10;"); }
  mydiv = unescape(mydiv);
  //Strip out any HTML within span tags containing the word IgnoreMe.
  var oRE = new RegExp('<span IgnoreMe>' + "[^>]*?" + '' + ".*?" + '</span>', "gi");
  mydiv = mydiv.replace(oRE, '');  
  var sStart='';
  var sMiddle='';
  var sNewString='';
  var elArr = objForm.elements;
  for(var i=0; i<elArr.length; i++)
  { //Loop through and process all form elements then replace the corresponding element in the mydiv HTML string with the modified one.
    var ischecked=(elArr[i].checked)?' CHECKED':'';
    var fldname=(elArr[i].name)?' name="new'+elArr[i].name+'"':'';
    var rawname=(elArr[i].name)?elArr[i].name:'';
    var sMiddle=(elArr[i].name)?' name='+elArr[i].name:'';
    var fldvalue=(elArr[i].value)?' value="'+elArr[i].value+'"':'';
    var rawvalue=(elArr[i].value)?elArr[i].value:'';
    var fldsize=(elArr[i].size)?' size="'+elArr[i].size+'"':'';
    var fldmaxlength=(elArr[i].maxLength)?' maxlength="'+elArr[i].maxLength+'"':'';
    var fldrows=(elArr[i].rows)?' rows="'+elArr[i].rows+'"':'';
    var fldcols=(elArr[i].cols)?' cols="'+elArr[i].cols+'"':'';
    var fldclass=(elArr[i].className)?' class="'+elArr[i].className+'"':'';
    var fldstyle=(elArr[i].style.cssText.length > 0)?' style="'+elArr[i].style.cssText+'"':'';
    switch (elArr[i].type) {
      case 'radio': sStart = '<INPUT'; sEnd = '>'; sNewString = '<INPUT type="radio" disabled' + ischecked + fldname + fldclass + fldvalue + fldstyle + '>'; break;
      case 'checkbox': sStart = '<INPUT'; sEnd = '>'; sNewString = '<INPUT type="checkbox" disabled' + ischecked + fldname + fldclass + fldvalue + fldstyle + '>'; break;
      case 'select-one': sOptions = getOptions(objForm, rawname); sStart = '<SELECT'; sEnd = '/SELECT>'; sNewString = '<SELECT disabled' + fldname + fldclass + fldstyle + ' size="' + sOptions.newfldsize + '">' + sOptions.outOptions + '</select>'; break;
      case 'select-multiple': sOptions = getOptions(objForm, rawname); sStart = '<SELECT'; sEnd = '/SELECT>'; sNewString = '<SELECT disabled' + fldname + fldclass + fldstyle + ' size="' + sOptions.newfldsize + '">' + sOptions.outOptions + '</select>'; break;
      case 'text': sStart = '<INPUT'; sEnd = '>'; sNewString = '<INPUT type="text" readOnly' + fldname + fldvalue + fldsize + fldmaxlength + fldclass + fldstyle + '>'; break;
      case 'textarea': sStart = '<TEXTAREA'; sEnd = '/TEXTAREA>'; sNewString = '<TEXTAREA readOnly' + fldname + fldrows + fldcols + fldclass + fldstyle + '>' + rawvalue + '</TEXTAREA>'; break;
      case 'button': sStart = '<INPUT'; sEnd = '>'; sNewString = '<INPUT type="button" style="visibility:hidden;"' + fldname + fldclass + '>'; break;
      case 'submit': sStart = '<INPUT'; sEnd = '>'; sNewString = '<INPUT type="submit" style="visibility:hidden;"' + fldname + fldclass + '>'; break;
      case 'reset': sStart = '<INPUT'; sEnd = '>'; sNewString = '<INPUT type="reset" style="visibility:hidden;"' + fldname + fldclass + '>'; break;
      default: sNewString = '';
    }
    if (sNewString != '' && sMiddle != '')
    {
      var oRE = new RegExp(sStart + "[^>]*?" + sMiddle + ".*?" + sEnd, "i");
      mydiv = mydiv.replace(oRE, sNewString);
    }
  }
  var outhtml='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><HEAD><META http-equiv=Content-Type content="text/html; charset=iso-8859-1">' + outstyles + '</head><body>' + mydiv + '</BODY></HTML>';
  document.getElementById('EmailOut').value = outhtml;
  return true;
}

function getOptions(f,e)
{ //Returns a formatted select field showing the selected options. Accomodates multiple selections.
  var arrOptions = f[e].options;
  var outsel = '';
  var outunsel = '';
  var outblank='';
  var cntsel = 0;
  for (var x=0;x<arrOptions.length;x++)
  {
    var outoption = '<option value="' + arrOptions[x].text + '">'+arrOptions[x].text+'</option>';
    if (arrOptions[x].selected) {
      outsel = outsel + outoption;
      cntsel++;
    }
    else {
      outunsel = outunsel + outoption;
    }
  }
  if (cntsel >= f[e].size) {
    var newfldsize = cntsel;
    var blanks = 0;
  }
  else {
    var newfldsize = f[e].size;
    for (var x=0;x<(f[e].size-cntsel);x++)
      outblank=outblank+'<option value=" "> </option>';
  }
  var outOptions = outsel + outblank + outunsel;
  return {outOptions : outOptions, newfldsize : newfldsize};
}
</script>

There is a bit of code in there to cause the function to skip any data stored in a span beginning <span IgnoreMe> as well so if you automate the parsing of an entire block of code you can flag pieces to be left untouched.


It's hard to think outside the box when I'm trapped in a cubicle.
 
*gasp*

wow, thanks thenightowl.

thats actually somthing I've been trying to accomplish for a while now...although I have recently been spending time fooling around with RegExp's.

your function is WAY better than the one I came up with, it makes the email look a lot cleaner.

I've been diving in to RegExp's in order to format some of the data as it is saved to a DB.

My thanks to all how contributed to this thread...it was a great help.

atsea
 
My goal was to render the email version as completely true to the original as possible. The biggest problem was with select boxes as their size will always vary based on the longest option in the box.
I was originally converting the box into a text field displaying the selected option but the original select width could not be readily determined and used to set the width of the text box. If I left it as a select box with only the selected option the width could still be off if the length of the selected option is not as great as the longest option that was originally in that select.
My reason for worrying about this detail is that I intended the script to be a plug in script for others to use and if they have badly formatted pages a change in width of one field might throw everything else off.
I had the further difficulty of having to deal with multiple select boxes. My eventual solution if I remember correctly was to include ALL the options originally there so width would not be affected but to put all the selected options together and position them viewable with the height of the box appropriate to the number of options selected.
The only problem is that a multiple-select that was originally only one line high could be multiple lines high in the output and throw off the layout of the email version.
In the rare circumstance that someone has such badly designed HTML AND they are using a multiple select box, then they just have to live with the problem. I can't accomodate EVERYTHING. :)

For button and submit type fields I replicate them but set them to hidden so they still take up the same screen space as original to keep from throwing off any page formatting.

I have considered searching for and removing any script tags in the code as well and have been considering what to do about any images on the page.
Also I planned on researching into the use of multipart/alternative mime type so that I can generate a text AND html version of the form that would hopefully display only text or only HTML dependent on the recipients setup.

Let me know how it works for you and if you have any thoughts on improvements. I have not had time to work on it in a while though it is pretty usable as it sits.

Here is the documentation I keep in the script file when I give it to others to use at work.
Code:
<!--  This code will read any form data from within the EmailForm div tag and process it to become the output for an email
version of the form, thus removing the need to maintain two copies of the form code.
NOTE: If form is passed through email to a system that does not support HTML (like our secure email system)
the values will not show through. Might need to setup an alternate output.

USAGE:
1. Put the function into the head of your page.
2. Encapsulate the area of the form that you want to send as an email in a div tag with ID EmailForm. <div id="EmailForm">
3. Add this hidden field into your form: <input type="hidden" id="EmailOut" name="EmailOut">
4. Prevent sections of code from passing to email by enclosing it in a span tag: <span IgnoreMe>...code you do not want to show</span>

The function should be called prior to submitting the form.  You could call it directly from an onsubmit function in your
form tag or at the end of your validation code.  
You MUST pass the form object to the function.
Example: <form id="testform" method="post" action="mypage.asp" onsubmit="return formatform(testform)">
The name of the form passed to the formatform function should NOT be enclosed in quotes.  This passes a reference to
the form object, not just the form name.

In your ASP page you can retrieve the value of the hidden field EmailOut to be used as the body of your email message.

NOTE In order for a field to be processed by this script it MUST have a name tag.  
If you have a type="button" or type="submit" but no name tag with it, it will be ignored by the script.

What the code does:
Text and Textarea fields are altered to include the readOnly property.
Select fields: If the number of selected options (in a multiple select) exceeds the defined size then the
field size is altered to accomodate the number of selections. The non-selected options are included in the output
so that the field will maintain the same width as the original form but blank options are added in between the selected
and non-selected options to push them below the viewable window so they do not actually show in the output. The disabled property is set.
Radio and Checkbox fields have the Disabled property set.
Button and Submit types have their visibility style set to hidden.  This keeps them in the page so formatting is not disturbed,
but it removes the possibility that they could be clicked.

NOTE: In future revision might need to loop through entire page to remove embeded code in <script, <object, <A, tags and any action events.
-->

It's hard to think outside the box when I'm trapped in a cubicle.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top