strip out data of xml file 4

dbase77 · Sep 29, 2008

Hi,

I have an xml file about 3 MB in size. Huge data inside the file. I would like to strip out some data out of it and import into other application. Preferable common delimited file with comma between each data. Some of the line from xml file

============================
<?xml version="1.0" encoding="UTF-8" ?>
- <WHOLE Date="17/07/2008">
- <ENTITY Id="1" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2007/l_173/l_17320070703en00030015.pdf"

programme="ZWE" remark="">
- <NAME Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link="

http://eur-lex.europa.eu/lex/LexUriServ/site/en/oj/2005/l_153/l_15320050616en00090014.pdf"

programme="ZWE">
<LASTNAME>Mugabe</LASTNAME>
<FIRSTNAME>Robert</FIRSTNAME>
<MIDDLENAME>Gabriel</MIDDLENAME>
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION>President</FUNCTION>
<LANGUAGE />
</NAME>
- <BIRTH Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link="

http://eur-lex.europa.eu/lex/LexUriServ/site/en/oj/2005/l_153/l_15320050616en00090014.pdf"

programme="ZWE">
<DATE>1924-02-21</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="2" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2007/l_173/l_17320070703en00030015.pdf"

programme="ZWE" remark="">
- <NAME Id="2" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2007/l_173/l_17320070703en00030015.pdf"

programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>President’s office and former Minister of State for Special Affairs responsible for Land and Resettlement Programmes</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="2572" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2007/l_173/l_17320070703en00030015.pdf"

programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State in the Vice-President’s office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5715" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2007/l_173/l_17320070703en00030015.pdf"

programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State for the Land Reform in the President's Office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5716" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2007/l_173/l_17320070703en00030015.pdf"

programme="ZWE">
<LASTNAME>Bhuka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="2" Entity_id="2" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link="

http://eur-lex.europa.eu/lex/LexUriServ/site/en/oj/2005/l_153/l_15320050616en00090014.pdf"

programme="ZWE">
<DATE>1968-02-25</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="928" Type="P" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2006/l_219/l_21920060810en00140019.pdf"

programme="TAQA" remark="(1) Correction published in OJ L 232, 25.8.2006, p. 42, (2) Case against him in Germany dismissed, (3) Convicted in Italy on 2002-12-11 (six year sentence), (4) Professor of Chemistry.">
- <NAME Id="1943" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2006/l_219/l_21920060810en00140019.pdf"

programme="TAQA">
<LASTNAME>Ben Hani</LASTNAME>
<FIRSTNAME>Al As’ad</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4910" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2006/l_219/l_21920060810en00140019.pdf"

programme="TAQA">
<LASTNAME>Ben Heni</LASTNAME>
<FIRSTNAME>Lased</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4911" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2006/l_219/l_21920060810en00140019.pdf"

programme="TAQA">
<LASTNAME />
<FIRSTNAME />
<MIDDLENAME />
<WHOLENAME>Mohamed Abu Abda</WHOLENAME>
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="323" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link="

http://eur-lex.europa.eu/LexUriServ/site/en/oj/2006/l_219/l_21920060810en00140019.pdf"

programme="TAQA">
<DATE>1969-02-05</DATE>
<PLACE>Tripoli</PLACE>
<COUNTRY>LBY</COUNTRY>
</BIRTH>
</ENTITY>
</WHOLE>
======================

What I'm interested on are:

* whole date
* entity id
* type
* lastname
* firstname
* middlename
* wholename

End result should be in form of:

whole date,entity id,type,lastname,firstname,middlename,wholename

Some of them are empty so may be we can ignore it. Example:

17/07/2008,1,P,Mugabe,Robert,Gabriel,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Bhuka,Flora,,
17/07/2008,928,P,Ben Hani,Al As'ad,,
17/07/2008,928,P,Ben Heni,Lased,,
17/07/2008,928,P,,,,Mohamed Abu Abda

I have no clue on how to achieve this. May be it is not possible. I don't know. I would appreciate if somebody could have a look at it. The actual xml file can be found here

http://www.yonez.net/donation/global.xml

Thank you in advance.

elgrandeperro · Sep 29, 2008

perl can do it.

Annihilannic · Sep 29, 2008

Hmm... I thought I would try this in perl rather than using my usual weapon of choice, awk. It wasn't easy! But the experience may be useful in future... try this:

Perl:

#!/usr/bin/perl

use XML::Parser;

sub handle_start {
        $expat = shift;
        $element = shift;
        if ($element eq "WHOLE") {
                while ($attr = shift) {
                        $val = shift;
                        if ($attr eq "Date") { $wholedate=$val; }
                }
        } elsif ($element eq "ENTITY") {
                while ($attr = shift) {
                        $val = shift;
                        if    ($attr eq "Id") { $entity_id=$val }
                        elsif ($attr eq "Type") { $entity_type=$val }
                }
        }
}

sub handle_end {
        $expat = shift;
        $element = shift;
        if ($element eq "NAME") {
                print "$wholedate,$entity_id,$entity_type,$lastname,$firstname,$middlename,$wholename\n";
                $lastname=$firstname=$middlename=$wholename="";
        }
}

sub handle_char {
        $expat = shift;
        $element = shift;
        if    ($expat->current_element eq "DATE") { $date=$element }
        elsif ($expat->current_element eq "LASTNAME") { $lastname=$element }
        elsif ($expat->current_element eq "FIRSTNAME") { $firstname=$element }
        elsif ($expat->current_element eq "MIDDLENAME") { $middlename=$element }
        elsif ($expat->current_element eq "WHOLENAME") { $wholename=$element }
}

$p1 = new XML::Parser(Handlers => {
                Start => \&handle_start,
                End => \&handle_end,
                Char  => \&handle_char
        });

$p1->parsefile('inputfile.xml');

Annihilannic.

dbase77 · Sep 30, 2008

Hi,

Thanks Annihilannic for a great piece of work. You are awesome, I must say.

It worked. But on some output I've got "Wide character in print at ./xml.pl line 26" message.

And some of the data that the script pull doesn't make any sense to me. Some weird character. Example:

17/07/2008;5291;E;;;;Organiz\xc3\xa1cia \xc5\xa1t\xc3\xa1tneho n\xc3\xa1kupu
17/07/2008;5291;E;;;;Organiza\xe7\xe3o de Aquisi\xe7\xf5es do Estado
etc
etc

Any reason? Thanks again.

LKBrwnDBA · Sep 30, 2008

Those that do not make sense are words (letters) with special characters, kinda like "accented". [3eyes]

Look up hex(a1, c3, c5, etc..) in the ascii table.

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb

fpmurphy · Sep 30, 2008

Another way is to write a simple XSL stylesheet to transform the XML document

Code:

<xsl:stylesheet version="1.0" xmlns:xsl="[URL unfurl="true"]http://www.w3.org/1999/XSL/Transform">[/URL]
<xsl:output method="text"/>

<xsl:template match="WHOLE">
  <xsl:apply-templates select="ENTITY"/>
</xsl:template>

<xsl:template match="ENTITY">
  <xsl:apply-templates select="NAME"/>
</xsl:template>

<xsl:template match="NAME">
  <xsl:value-of select="/WHOLE/@Date"/>
  <xsl:value-of select="','"/>
  <xsl:value-of select="../@Id"/>
  <xsl:value-of select="','"/>
  <xsl:value-of select="../@Type"/>
  <xsl:value-of select="','"/>
  <xsl:for-each select="LASTNAME | FIRSTNAME | MIDDLENAME | WHOLENAME">
     <xsl:value-of select="."/>
     <xsl:if test="position() != last()">
        <xsl:value-of select="','"/>
     </xsl:if>
  </xsl:for-each>
  <xsl:text>&#10;</xsl:text>
</xsl:template>

</xsl:stylesheet>

and use xsltproc to do the transformation i.e.

Code:

$ xsltproc xslfile xmlfile
17/07/2008,1,P,Mugabe,Robert,Gabriel,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Bhuka,Flora,,
17/07/2008,928,P,Ben Hani,Al As'ad,,
17/07/2008,928,P,Ben Heni,Lased,,
17/07/2008,928,P,,,,Mohamed Abu Abda
$

dbase77 · Sep 30, 2008

Wow, interesting, never heard of this xsltproc before. I must try this. I hope I can run this on freebsd or solaris OS.

If I use xsltproc, am I going to get all these weird/strange character like I use perl program above? Did you run it against the full xml files?

Thanks.

dbase77 · Sep 30, 2008

Hi,

OK. Tested it myself against the full file. I still got this weird/strange character. I guess nothing we can do to about this as it's been converted during the process of getting data out of xml file.

I was wondering if there is another way without translating this character. Leave it as it is as you would view the xml file.

Thanks.

PHV · Sep 30, 2008

You don't see weird/strange character when you execute something like this ?
more xmlfile

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

dbase77 · Sep 30, 2008

hmmm, you are right. I view it in windows and not in unix. Hence I missed this weird character.

Anyway I can overcome this? Thanks.

Annihilannic · Sep 30, 2008

If you're using Linux you can use the recode command to convert from any character set to another. What character set are you using on Unix (check locale output)?

Annihilannic.

dbase77 · Sep 30, 2008

Hi,

I'm running from freebsd 6.3 OS. Output of locale:

LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

Thanks.

Annihilannic · Sep 30, 2008

What do you get if you run:

Code:

awk 'BEGIN { print "\xc3 \xa1 \xc5"}'

Annihilannic.

dbase77 · Oct 1, 2008

Hi,

This is what I got:

freebsd# awk 'BEGIN { print "\xc3 \xa1 \xc5"}'
Ã ¡ Å
freebsd#

Thanks.

Annihilannic · Oct 2, 2008

Well that confirms that your terminal and system support those characters to some degree (I can only presume those characters look right - not knowing the language I can't say!).

Try changing the print statement in my scripts as follows:

Perl:

                eval "print \"$wholedate,$entity_id,$entity_type,$lastname,$firstname,$middlename,$wholename\\n\"";

Probably not very good practice (I believe eval can be risky to use in some circumstances), but may work for you.

Annihilannic.

mrn · Oct 3, 2008

Have you tried xml2txt available on sourceforge?

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."

dbase77 · Oct 3, 2008

Annihilannic,

I still got the strange characters in my output and couple of errors. I guess nothing we can do on this issue as when I view it in windows, it does change to some weird character.

mrn,

I haven't try xml2txt yet. Have you try it? You can always dl the full xml file from the link on my first post.

mrn · Oct 6, 2008

Ok had a play and have got

vi xml.ksh

Code:

awk '
BEGIN{
        RS="<"
}
!/^\//{
        r=split($0,t,">");
        if (r==2)
                printf("%s: %s\n",t[1],t[2]);
        else if (r==1)
                printf("%s",t[1])
}' filename

run as xml.ksh | paste -d, - - - - - -

Produces

LASTNAME: Mugabe,FIRSTNAME: Robert,MIDDLENAME:Gabriel,GENDER: M,FUNCTION: President,DATE: 1924-02-21

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

strip out data of xml file 4

Technical User

Technical User

MIS

Technical User

MIS

Technical User

Technical User

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

MIS

MIS

Technical User

MIS

Similar threads

Log in

Part and Inventory Search

Sponsor