Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations biv343 on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

HTML Chars in XML

Status
Not open for further replies.

Sarky78

Programmer
Oct 19, 2000
878
GB
Hi guys,

I have got an XML file that has been created using PHP, and I am now trying to read the file back in. The problem is that where I have replaced HTML special characters using htmlspecialchars, when i read this back in I get what appears to be a line break.

So if in the XML document I have got something like this:

Liverpool's Players

When I come to read it back out from the XML document I get:

Liverpool
'
s Players


i'm using the usual XML reading code

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}//if
while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
$ErrorMessage = "An error has occured while importing the XML data.";
$ErrorMessage = $ErrorMessage . "The XML file that was being imported was $file";
}//if
}//while
xml_parser_free($xml_parser);

and this is the code that I am using to output the ThemeName

} else if (($NodeName == 'THEMENAME') && (Trim($data) != '')) {
echo nl2br($data) . &quot;<BR>&quot;;
$ThemeName = AddSlashes($data);

At the moment this is the only one that I am looking at but I am assuming that all of the other elements in the xml document are going to have the same problem. I am using PHP 4.3.1 on a win2k server.

Any helps would be appreciated

Tony
 
Did you try the entity name instead of the single quote "sign"?

The entity name is [tt]&rsquo;[/tt]. Maybe this'll do the trick?!

Or, within PHP, like this :

Code:
$data = htmlentities($data, ENT_QUOTES);

Regards


Jakob
 
I can tell I wasn't clear.

Consider this rdf item:

Code:
<item rdf:about="[URL unfurl="true"]http://www.printondemand.com/MT/archives/004624.html">[/URL]
<title>PrintMedia Conference and Expo in Philly March 7-9</title>
<link>[URL unfurl="true"]http://www.printondemand.com/MT/archives/004624.html</link>[/URL]
<description>Conference to feature &quot;Semantic Technologies Workshop&quot; developed by IDEAlliance on March 9....</description>
<dc:subject>Conferences &amp; Expos</dc:subject>
<dc:date>2005-01-06T12:39:20-08:00</dc:date>
</item>

When I parse this with PHP (I can show you the code, if you like), the Description tag is turned into multiple description tags, breaking at the html entities.

So, if my code is set to output the contents of description within HTML paragraph tags, I will get

Code:
<p>Conference to feature</p>
<p>"</p>
<p>Semantic Technologies Workshop</p>
<p>"</p>
<p> developed by IDEAlliance on March 9....</p>

This is true regardless of using the html_entity_decode function.

I'm wondering why HTML entities break the description into multiple parts.

My PHP:

Code:
<?php

  $file = "[URL unfurl="true"]http://www.printondemand.com/MT/index.rdf";[/URL]

  $articleMax = 10;
  $articleCount = 1;
  $currentTag = "";
  $flag = "";

  $myRSS = "";

  $xp = xml_parser_create();
  xml_set_element_handler($xp, "elementBegin", "elementEnd");
  xml_set_character_data_handler($xp, "characterData");
  xml_parser_set_option($xp, XML_OPTION_CASE_FOLDING, TRUE);
  xml_parser_set_option($xp, XML_OPTION_TARGET_ENCODING, "UTF-8");


  function myRSSfunc()
  {
    global $file, $xp, $myRSS, $articleCount, $articleMax;

    if (!($fp = fopen($file, "r")))
    {
      die("Could not read $file");
    }

    while ($xml = fread($fp, 4096))
    {

      if ($articleCount >= $articleMax)
      {
        break;
      }

      if (!xml_parse($xp, $xml, feof($fp)))
      {
        die("XML parser error: " .xml_error_string(xml_get_error_code($xp)));
      }

    }

    xml_parser_free($xp);
    return $myRSS;
  }


  function elementBegin($parser, $name, $attributes)
  {
	global $currentTag, $flag, $myRSS;

	$currentTag = $name;
	// if within an item block, set a flag
	if ($name == "ITEM")
	{
      $flag = 1;

      $myRSS = $myRSS."<div>";
	}
  }


  function elementEnd($parser, $name)
  {
	global $currentTag, $flag, $myRSS, $articleCount;
	$currentTag = "";
	if ($name == "ITEM")
	{
      $myRSS = $myRSS."</div>";
      $articleCount++;
      $flag = 0;
	}
  }


  function characterData($parser, $data)
  {
	global $currentTag, $flag, $myRSS;
	// if within an item block, print item data
	if (($currentTag == "TITLE" || $currentTag == "LINK" || $currentTag == "DESCRIPTION") && $flag == 1)
	{
	  if ($currentTag == "TITLE")
	  {
	    $myRSS = $myRSS."<span style='font-family:Verdana; font-size:8pt; font-weight:normal; padding:0px; color:#066;'>".$data."</span>";
	  }

      if ($currentTag == "LINK")
      {
        $myRSS = $myRSS."&nbsp;&nbsp;<a style='font-family:Verdana; font-size:8pt; font-weight:normal; color:#C00; text-decoration: underline;' target='_blank' href='".$data."'>view</a>";
      }

      if ($currentTag == "DESCRIPTION")
      {
        $myRSS = $myRSS."<p style='font-family:Verdana; font-size:8pt; font-weight:normal;'>".html_entity_decode($data)."</p>";
      }

	}
  }

  $x = myRSSfunc();
  echo $x;

?>

Thomas D. Greer

Providing PostScript & PDF
Training, Development & Consulting


Haiku workshop and community.
 
Hi,

Well... strange. Very strange!

Seems to be the exact same happening with arperry's code.

Did you try to escape the quotes? Like [tt]\"[/tt] and [tt]\'[/tt] ...?

I should say that I don't use xml -yet! However, have a look at this bug/fix report:


Does this explain it? If not I will look further with you to solve the problem! There's nothing like a good challenge...

Regards


Jakob

PS. What PHP version are you guys using?
 
What I think is happening here, is that the parser function is called "again" to parse HTML entities.

What I've done is modified my code to set a flag when it first encounters a LINK, TITLE, or DESCRIPTION tag. Then if it encounters the tag AGAIN, with the flag set, then I assume it's parsing an entity within the same tag, and I concatenate the string without adding the HTML tags. This "fixes" the problem.

It makes the code a bit more convoluted, but that's the life of a programmer.

Keep in mind that I'm not producing the XML, I'm consuming it. So I can't alter the XML, I can only write code to parse it.




Thomas D. Greer

Providing PostScript & PDF
Training, Development & Consulting


Haiku workshop and community.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top