Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

strip out data of xml file 4

Status
Not open for further replies.

dbase77

Technical User
Apr 23, 2002
591
IE
Hi,

I have an xml file about 3 MB in size. Huge data inside the file. I would like to strip out some data out of it and import into other application. Preferable common delimited file with comma between each data. Some of the line from xml file

============================
<?xml version="1.0" encoding="UTF-8" ?>
- <WHOLE Date="17/07/2008">
- <ENTITY Id="1" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE" remark="">
- <NAME Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<LASTNAME>Mugabe</LASTNAME>
<FIRSTNAME>Robert</FIRSTNAME>
<MIDDLENAME>Gabriel</MIDDLENAME>
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION>President</FUNCTION>
<LANGUAGE />
</NAME>
- <BIRTH Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<DATE>1924-02-21</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="2" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE" remark="">
- <NAME Id="2" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>President’s office and former Minister of State for Special Affairs responsible for Land and Resettlement Programmes</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="2572" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State in the Vice-President’s office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5715" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State for the Land Reform in the President's Office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5716" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Bhuka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="2" Entity_id="2" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<DATE>1968-02-25</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="928" Type="P" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA" remark="(1) Correction published in OJ L 232, 25.8.2006, p. 42, (2) Case against him in Germany dismissed, (3) Convicted in Italy on 2002-12-11 (six year sentence), (4) Professor of Chemistry.">
- <NAME Id="1943" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME>Ben Hani</LASTNAME>
<FIRSTNAME>Al As’ad</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4910" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME>Ben Heni</LASTNAME>
<FIRSTNAME>Lased</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4911" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME />
<FIRSTNAME />
<MIDDLENAME />
<WHOLENAME>Mohamed Abu Abda</WHOLENAME>
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="323" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<DATE>1969-02-05</DATE>
<PLACE>Tripoli</PLACE>
<COUNTRY>LBY</COUNTRY>
</BIRTH>
</ENTITY>
</WHOLE>
======================

What I'm interested on are:

* whole date
* entity id
* type
* lastname
* firstname
* middlename
* wholename

End result should be in form of:

whole date,entity id,type,lastname,firstname,middlename,wholename

Some of them are empty so may be we can ignore it. Example:

17/07/2008,1,P,Mugabe,Robert,Gabriel,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Bhuka,Flora,,
17/07/2008,928,P,Ben Hani,Al As'ad,,
17/07/2008,928,P,Ben Heni,Lased,,
17/07/2008,928,P,,,,Mohamed Abu Abda


I have no clue on how to achieve this. May be it is not possible. I don't know. I would appreciate if somebody could have a look at it. The actual xml file can be found here
Thank you in advance.
 
Hmm... I thought I would try this in perl rather than using my usual weapon of choice, awk. It wasn't easy! But the experience may be useful in future... try this:

Perl:
#!/usr/bin/perl

use XML::Parser;

sub handle_start {
        $expat = shift;
        $element = shift;
        if ($element eq "WHOLE") {
                while ($attr = shift) {
                        $val = shift;
                        if ($attr eq "Date") { $wholedate=$val; }
                }
        } elsif ($element eq "ENTITY") {
                while ($attr = shift) {
                        $val = shift;
                        if    ($attr eq "Id") { $entity_id=$val }
                        elsif ($attr eq "Type") { $entity_type=$val }
                }
        }
}

sub handle_end {
        $expat = shift;
        $element = shift;
        if ($element eq "NAME") {
                print "$wholedate,$entity_id,$entity_type,$lastname,$firstname,$middlename,$wholename\n";
                $lastname=$firstname=$middlename=$wholename="";
        }
}

sub handle_char {
        $expat = shift;
        $element = shift;
        if    ($expat->current_element eq "DATE") { $date=$element }
        elsif ($expat->current_element eq "LASTNAME") { $lastname=$element }
        elsif ($expat->current_element eq "FIRSTNAME") { $firstname=$element }
        elsif ($expat->current_element eq "MIDDLENAME") { $middlename=$element }
        elsif ($expat->current_element eq "WHOLENAME") { $wholename=$element }
}

$p1 = new XML::Parser(Handlers => {
                Start => \&handle_start,
                End => \&handle_end,
                Char  => \&handle_char
        });

$p1->parsefile('inputfile.xml');

Annihilannic.
 
Hi,

Thanks Annihilannic for a great piece of work. You are awesome, I must say.

It worked. But on some output I've got "Wide character in print at ./xml.pl line 26" message.

And some of the data that the script pull doesn't make any sense to me. Some weird character. Example:

17/07/2008;5291;E;;;;Organiz\xc3\xa1cia \xc5\xa1t\xc3\xa1tneho n\xc3\xa1kupu
17/07/2008;5291;E;;;;Organiza\xe7\xe3o de Aquisi\xe7\xf5es do Estado
etc
etc

Any reason? Thanks again.

 

Those that do not make sense are words (letters) with special characters, kinda like "accented". [3eyes]

Look up hex(a1, c3, c5, etc..) in the ascii table.



----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Another way is to write a simple XSL stylesheet to transform the XML document
Code:
<xsl:stylesheet version="1.0" xmlns:xsl="[URL unfurl="true"]http://www.w3.org/1999/XSL/Transform">[/URL]
<xsl:output method="text"/>

<xsl:template match="WHOLE">
  <xsl:apply-templates select="ENTITY"/>
</xsl:template>

<xsl:template match="ENTITY">
  <xsl:apply-templates select="NAME"/>
</xsl:template>

<xsl:template match="NAME">
  <xsl:value-of select="/WHOLE/@Date"/>
  <xsl:value-of select="','"/>
  <xsl:value-of select="../@Id"/>
  <xsl:value-of select="','"/>
  <xsl:value-of select="../@Type"/>
  <xsl:value-of select="','"/>
  <xsl:for-each select="LASTNAME | FIRSTNAME | MIDDLENAME | WHOLENAME">
     <xsl:value-of select="."/>
     <xsl:if test="position() != last()">
        <xsl:value-of select="','"/>
     </xsl:if>
  </xsl:for-each>
  <xsl:text>&#10;</xsl:text>
</xsl:template>

</xsl:stylesheet>
and use xsltproc to do the transformation i.e.
Code:
$ xsltproc xslfile xmlfile
17/07/2008,1,P,Mugabe,Robert,Gabriel,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Bhuka,Flora,,
17/07/2008,928,P,Ben Hani,Al As'ad,,
17/07/2008,928,P,Ben Heni,Lased,,
17/07/2008,928,P,,,,Mohamed Abu Abda
$
 
Wow, interesting, never heard of this xsltproc before. I must try this. I hope I can run this on freebsd or solaris OS.

If I use xsltproc, am I going to get all these weird/strange character like I use perl program above? Did you run it against the full xml files?

Thanks.
 
Hi,

OK. Tested it myself against the full file. I still got this weird/strange character. I guess nothing we can do to about this as it's been converted during the process of getting data out of xml file.

I was wondering if there is another way without translating this character. Leave it as it is as you would view the xml file.

Thanks.
 
You don't see weird/strange character when you execute something like this ?
more xmlfile

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
hmmm, you are right. I view it in windows and not in unix. Hence I missed this weird character.

Anyway I can overcome this? Thanks.
 
If you're using Linux you can use the recode command to convert from any character set to another. What character set are you using on Unix (check locale output)?

Annihilannic.
 
Hi,

I'm running from freebsd 6.3 OS. Output of locale:

LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=


Thanks.
 
What do you get if you run:

Code:
awk 'BEGIN { print "\xc3 \xa1 \xc5"}'


Annihilannic.
 
Hi,

This is what I got:

freebsd# awk 'BEGIN { print "\xc3 \xa1 \xc5"}'
à ¡ Å
freebsd#


Thanks.
 
Well that confirms that your terminal and system support those characters to some degree (I can only presume those characters look right - not knowing the language I can't say!).

Try changing the print statement in my scripts as follows:

Perl:
                eval "print \"$wholedate,$entity_id,$entity_type,$lastname,$firstname,$middlename,$wholename\\n\"";

Probably not very good practice (I believe eval can be risky to use in some circumstances), but may work for you.

Annihilannic.
 
Have you tried xml2txt available on sourceforge?

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."
 
Annihilannic,

I still got the strange characters in my output and couple of errors. I guess nothing we can do on this issue as when I view it in windows, it does change to some weird character.


mrn,

I haven't try xml2txt yet. Have you try it? You can always dl the full xml file from the link on my first post.
 
Ok had a play and have got

vi xml.ksh

Code:
awk '
BEGIN{
        RS="<"
}
!/^\//{
        r=split($0,t,">");
        if (r==2)
                printf("%s: %s\n",t[1],t[2]);
        else if (r==1)
                printf("%s",t[1])
}' filename

run as xml.ksh | paste -d, - - - - - -

Produces

LASTNAME: Mugabe,FIRSTNAME: Robert,MIDDLENAME:Gabriel,GENDER: M,FUNCTION: President,DATE: 1924-02-21




Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top