Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

C# Noob: HTML object NullReferenceException error - out of clues...

Status
Not open for further replies.

MakeItSo

Programmer
Oct 21, 2003
3,316
0
0
DE
Hello friends,

I am usually progging in VB6 and still very new to C# but I feel that I can no longer avoid it. [tongue]
Here's my dilemma (please forgive the amount of detail):
We receive XML files. The node contents vary from plain text to HTML contents. The HTML contents can be either in CDATA sections or escaped into entities or sometimes half escaped or sometimes doubly escaped, partially invalid tags and often rather...umm...just bad!
This means they can be any of these:
<seg>yadda</seg>
<seg>&lt;p&gt;yadda...</seg>
<seg>&lt;p>yadda...</seg>
<seg><![CDATA[[ignore]<html>[/ignore]<p>yadda...]]></seg>

Get an idea what I'm dealing with? [rednose]

What I'm trying to do is process this XML before production and unify all nodes with HTML content to CDATA sections with proper HTML content.
This is how I achieve this (htm contains the unprocessed content of that node):
Code:
htm = System.Web.HttpUtility.HtmlDecode(System.Web.HttpUtility.HtmlDecode(htm));
tuv.SelectSingleNode("myns:seg", nsmgr).InnerText = "";
int pos = htm.IndexOf("<");
if (pos >= 0)
{
	System.Xml.XmlCDataSection cd = null;
	cd = tmx.CreateCDataSection(htm);
	tuv.SelectSingleNode("myns:seg", nsmgr).AppendChild(cd);
}
As you can see, I decode twice - in order to also cover doubly escaped stuff like "<seg>&amp;lt;p&amp;gt;yadda" - yes, we also get stuff like that.

This works - but it also unescapes ampersands and < > within HTML text, which is bad. To remedy this, I thought of loading the string into an HTML object. Reading back the innerHTML I just loaded into the object, it is properly HTML escaped.
I first did this in VB6 and it works just fine. Alas, it doesn't in C#.

Here's what I tried:
Code:
if (pos >= 0)
{
	HtmlAgilityPack.HtmlDocument htmobj=null;
	[COLOR=#73D216]//htmobj.LoadHtml("<html></html>");
	//htmobj.OptionReadEncoding = false;[/color]
	[COLOR=#CC0000]htmobj.LoadHtml(htm)[/color];
	htm = htmobj.ToString();
The red line throws a NullReferenceException - and I don't know why!
I've already tried with the two commented lines, as well as with:
Code:
HtmlAgilityPack.HtmlDocument htmobj=new HtmlAgilityPack.HtmlDocument()
That however tries to locate HtmlDocument.cs which cannot be found.
I've switched to HAP because MSHTML couldn't hack it for me either (too restrictive).

Can you give me a hint on what I'm doing wrong?
I've already googled my fingers off.

Thanks for any help!
MakeItSo

“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.
 
Aaaargh![hammer]

Sorry guys, solved.
Guess what: I do have .Net Framework 4 but I needed the HAP files for .Net 2.
With the .Net 2 DLL, this line works without any hiccups:
Code:
HtmlAgilityPack.HtmlDocument htmobj = new HtmlAgilityPack.HtmlDocument();
htmobj.LoadHtml(htm);
The .toString part not yet...

VB6 is so beautifully simple in comparison! [cry]

Anyway, maybe it helps someone else in the future...

Cheers,
MiS

“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.
 
Last update to conclude this thingy:
Code:
HtmlAgilityPack.HtmlDocument htmobj = new HtmlAgilityPack.HtmlDocument();
htmobj.LoadHtml(htm);
htm = htmobj[b].DocumentNode.InnerHtml[/b];
Works like a charm now and even converts all HTML tags to lower case, so I don't even have mixed case anymore. What a beaut!
[smile]

“Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.” (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top