Hello friends,
I am usually progging in VB6 and still very new to C# but I feel that I can no longer avoid it.
Here's my dilemma (please forgive the amount of detail):
We receive XML files. The node contents vary from plain text to HTML contents. The HTML contents can be either in CDATA sections or escaped into entities or sometimes half escaped or sometimes doubly escaped, partially invalid tags and often rather...umm...just bad!
This means they can be any of these:
<seg>yadda</seg>
<seg><p>yadda...</seg>
<seg><p>yadda...</seg>
<seg><![CDATA[[ignore]<html>[/ignore]<p>yadda...]]></seg>
Get an idea what I'm dealing with?
What I'm trying to do is process this XML before production and unify all nodes with HTML content to CDATA sections with proper HTML content.
This is how I achieve this (htm contains the unprocessed content of that node):
As you can see, I decode twice - in order to also cover doubly escaped stuff like "<seg>&lt;p&gt;yadda" - yes, we also get stuff like that.
This works - but it also unescapes ampersands and < > within HTML text, which is bad. To remedy this, I thought of loading the string into an HTML object. Reading back the innerHTML I just loaded into the object, it is properly HTML escaped.
I first did this in VB6 and it works just fine. Alas, it doesn't in C#.
Here's what I tried:
The red line throws a NullReferenceException - and I don't know why!
I've already tried with the two commented lines, as well as with:
That however tries to locate HtmlDocument.cs which cannot be found.
I've switched to HAP because MSHTML couldn't hack it for me either (too restrictive).
Can you give me a hint on what I'm doing wrong?
I've already googled my fingers off.
Thanks for any help!
MakeItSo
“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.
I am usually progging in VB6 and still very new to C# but I feel that I can no longer avoid it.
Here's my dilemma (please forgive the amount of detail):
We receive XML files. The node contents vary from plain text to HTML contents. The HTML contents can be either in CDATA sections or escaped into entities or sometimes half escaped or sometimes doubly escaped, partially invalid tags and often rather...umm...just bad!
This means they can be any of these:
<seg>yadda</seg>
<seg><p>yadda...</seg>
<seg><p>yadda...</seg>
<seg><![CDATA[[ignore]<html>[/ignore]<p>yadda...]]></seg>
Get an idea what I'm dealing with?
What I'm trying to do is process this XML before production and unify all nodes with HTML content to CDATA sections with proper HTML content.
This is how I achieve this (htm contains the unprocessed content of that node):
Code:
htm = System.Web.HttpUtility.HtmlDecode(System.Web.HttpUtility.HtmlDecode(htm));
tuv.SelectSingleNode("myns:seg", nsmgr).InnerText = "";
int pos = htm.IndexOf("<");
if (pos >= 0)
{
System.Xml.XmlCDataSection cd = null;
cd = tmx.CreateCDataSection(htm);
tuv.SelectSingleNode("myns:seg", nsmgr).AppendChild(cd);
}
This works - but it also unescapes ampersands and < > within HTML text, which is bad. To remedy this, I thought of loading the string into an HTML object. Reading back the innerHTML I just loaded into the object, it is properly HTML escaped.
I first did this in VB6 and it works just fine. Alas, it doesn't in C#.
Here's what I tried:
Code:
if (pos >= 0)
{
HtmlAgilityPack.HtmlDocument htmobj=null;
[COLOR=#73D216]//htmobj.LoadHtml("<html></html>");
//htmobj.OptionReadEncoding = false;[/color]
[COLOR=#CC0000]htmobj.LoadHtml(htm)[/color];
htm = htmobj.ToString();
I've already tried with the two commented lines, as well as with:
Code:
HtmlAgilityPack.HtmlDocument htmobj=new HtmlAgilityPack.HtmlDocument()
I've switched to HAP because MSHTML couldn't hack it for me either (too restrictive).
Can you give me a hint on what I'm doing wrong?
I've already googled my fingers off.
Thanks for any help!
MakeItSo
“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.