Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

simpleXML parser error. Char 0xDC39 out of allowed range 1

Status
Not open for further replies.

beltmanjr

Technical User
Oct 29, 2007
333
NL
Hi all,
after searching googl for the quick fix, I came to the conclusion that either I can't search well, or there isnt a quick fix.

I'm getting the following errors:
Entity: line 477: parser error : Char 0xDC39 out of allowed range in
Entity: line 477: parser error : PCDATA invalid Char value 56377 in

I'm reading xml from an external source and thing the bit that has the codes �� in there is giving a problem.

I'm putting the external source into a new SimpleXMLElement.

How come I fix these issues?
 
these errors tend to be caused by invalid xml.

if you post the xml you are working with we can try to diagnose further.
 
the url is a webpage with normal web-type data on it.
what is the data you are trying to extract and what is the format of the xml you are trying to output.
 
Hi jpadie,
Thanks for responding.

The script tries to extract the complete body.
The coding below.

Code:
             $config1 = array(
                        'show-body-only' => TRUE,
                        'quote-ampersand' => FALSE,
                        'quote-nbsp' => FALSE,
                        'output-encoding' => 'UTF8',
                        'quiet' => TRUE,
                        'show-warnings' => FALSE,
                        'tidy-mark' => FALSE,
                        'indent' => 0,
                        'wrap' => 0,
                        'clean' => TRUE,
                        'bare' => TRUE,
                        'drop-font-tags' => TRUE,
                        'drop-proprietary-attributes' => TRUE,
                        'hide-comments' => TRUE,
                        'numeric-entities' => FALSE,
                        'write-back' => TRUE);
            // create an object from the string
            $tidy = new tidy;
            $tidy->parseString($body, $config1, 'utf8');
            $tidy->cleanRepair();
            // and do it all again, to at add an XML-header...
            $config2 = array(
                        'add-xml-decl' => TRUE,
                        'bare' => TRUE,
                        'clean' => TRUE,
                        'drop-font-tags' => TRUE,
                        'drop-proprietary-attributes' => TRUE,
                        'force-output' => TRUE,
                        'hide-comments' => TRUE,
                        'indent' => 0,
                        'numeric-entities' => FALSE,
                        'output-xml' => TRUE,
                        'output-encoding' => 'UTF8',
                        'quiet' => TRUE,
                        'quote-ampersand' => FALSE,
                        'quote-nbsp' => FALSE,
                        'show-warnings' => FALSE,
                        'tidy-mark' => FALSE,
                        'wrap' => 0,
                        'write-back' => TRUE);
            $tidy2 = new tidy;
            $tidy2->parseString((string)$tidy, $config2, 'utf8');
            $tidy2->cleanRepair();

            $simple_xml_element = new SimpleXMLElement((string)$tidy2);
 
I cannot see the point in the exercise you are undertaking. so I am finding it difficult to foresee what the intended result will be. I don't see any practical value in turning structured html into structured xml.

perhaps if you can explain the why of what you are doing it might help.

also this line
Code:
$tidy2->parseString((string)$tidy, $config2, 'utf8');
looks wrong. surely you are wishing to get the output of the $tidy class? in which case why not use
Code:
$tidy2->parseString($tidy->html(), $config2, 'utf8');
$tidy2->cleanRepair();
$simple_xml_element = simplexml_import_dom($tidy2->html());
 
The truth of the matter is that I inherited this code. I’m not planning to rewrite the complete stuff and will try to fix the programming as it stands as the time simply isn’t there.

Script is putting the html body into an xml format. It then uses the xml to find specific nodes and take the contents. This is stored in a database for later use.

Unfortunately the current code spits out this issue about invalid xml characters because of the contents containing some weird characters.
 
ok. now let's focus on what data you want. once I understand this it will be a minute or two's work to extract the data in a more cogent fashion.

in the alternative you might consider running the output of the content grab through a text converter to make sure that it is all in proper utf-8. the current charset looks more like an ISO variant but it is difficult to tell as the content provider makes no effort to specify the output charset in either the response header or the <head> information
 
The part I'm trying to take from the site is
Code:
Klaassen Horecamakelaardij biedt aan: Gunstig gelegen in het centrum van Harlingen. Aan de Grote Bredeplaats van deze historische havenstad vindt u “Fish and Chips”. Een keurig en kompleet ingericht cafetariabedrijf met, zoals de naam al doet vermoeden, vis en eigengemaakte friet als specialiteit. 

Aan de counter en aan tafeltjes is plaats voor ca. 30 personen. Het terras is gelegen aan de zonnige kant van het plein en biedt plaats aan ca. 40 personen. 

De gemeente Harlingen heeft besloten om de Grote Bredeplaats in te richten als horeca/evenementenplein. De zonzijde wordt tijdens het hoogseizoen gesloten voor verkeer en zodoende volledig beschikbaar voor terrassen. Hierdoor is de mogelijkheid ontstaan om de zitplaatsen op het terras uit te breiden tot meer dan 100 in plaats van 40. 

Het bedrijf is in 2009 compleet nieuw ingericht. Achter de counter worden, in het zicht van de gasten, de gerechten bereid. Achter in de zaak bevindt zich nog een aparte spoelkeuken en de toiletgroepen. 

De huidige ondernemer exploiteert het bedrijf sinds juni 2009. De omzet is veelbelovend. Er is een volledige vergunning. Vanwege dubbele zaken van de eigenaar wordt dit bedrijf nu ter verkoop aangeboden. 

Deze zaak beschikt over alle ingrediënten voor een succesvolle onderneming. Er mist nog een nieuwe, enthousiaste ondernemer of ondernemersstel. 

De gunstige ligging, smaakvolle inrichting en complete uitrusting brengen niet alleen nu al een veelbelovende omzet maar bieden nog volop kansen voor de toekomst. 

?? Compleet bedrijf! 
?? Gunstige ligging! 
?? Nieuwe inrichting! 

Voor meer informatie: Horecamakelaar René Ausema 06 51 981 524 
Vraagprijs exploitatie: Bod gevraagd! 
Huurprijs registergoed: € 1700 pm.
 
Code:
function getExcerpt($url){
 $cH = curl_init($url);
 curl_setopt($cH, CURLOPT_RETURNTRANSFER, true);
 $data = curl_exec($cH);
 $pattern = '/<P CLASS="text_plain_1">(.*?)<\/p>/ims';
 $result = preg_match($pattern, $data, $match);
 if ($result) return $match[1];
 return false;
}
$myText = getExcerpt('[URL unfurl="true"]http://www.horecasite.nl/aanbod_3.php?id=18562');[/URL]
 
Hi jpadie,
this is only one of the sites we're looking at. However, your method is a lot easier as what is being used in the current programming.

For now I will simply replace any 'bad' characters as they come along.

For in the future your method seems a lot better.

Thanks for your help.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top