Hi,
I have an xml file about 3 MB in size. Huge data inside the file. I would like to strip out some data out of it and import into other application. Preferable common delimited file with comma between each data. Some of the line from xml file
============================
<?xml version="1.0" encoding="UTF-8" ?>
- <WHOLE Date="17/07/2008">
- <ENTITY Id="1" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE" remark="">
- <NAME Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<LASTNAME>Mugabe</LASTNAME>
<FIRSTNAME>Robert</FIRSTNAME>
<MIDDLENAME>Gabriel</MIDDLENAME>
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION>President</FUNCTION>
<LANGUAGE />
</NAME>
- <BIRTH Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<DATE>1924-02-21</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="2" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE" remark="">
- <NAME Id="2" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>President’s office and former Minister of State for Special Affairs responsible for Land and Resettlement Programmes</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="2572" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State in the Vice-President’s office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5715" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State for the Land Reform in the President's Office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5716" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Bhuka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="2" Entity_id="2" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<DATE>1968-02-25</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="928" Type="P" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA" remark="(1) Correction published in OJ L 232, 25.8.2006, p. 42, (2) Case against him in Germany dismissed, (3) Convicted in Italy on 2002-12-11 (six year sentence), (4) Professor of Chemistry.">
- <NAME Id="1943" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME>Ben Hani</LASTNAME>
<FIRSTNAME>Al As’ad</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4910" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME>Ben Heni</LASTNAME>
<FIRSTNAME>Lased</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4911" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME />
<FIRSTNAME />
<MIDDLENAME />
<WHOLENAME>Mohamed Abu Abda</WHOLENAME>
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="323" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<DATE>1969-02-05</DATE>
<PLACE>Tripoli</PLACE>
<COUNTRY>LBY</COUNTRY>
</BIRTH>
</ENTITY>
</WHOLE>
======================
What I'm interested on are:
* whole date
* entity id
* type
* lastname
* firstname
* middlename
* wholename
End result should be in form of:
whole date,entity id,type,lastname,firstname,middlename,wholename
Some of them are empty so may be we can ignore it. Example:
17/07/2008,1,P,Mugabe,Robert,Gabriel,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Bhuka,Flora,,
17/07/2008,928,P,Ben Hani,Al As'ad,,
17/07/2008,928,P,Ben Heni,Lased,,
17/07/2008,928,P,,,,Mohamed Abu Abda
I have no clue on how to achieve this. May be it is not possible. I don't know. I would appreciate if somebody could have a look at it. The actual xml file can be found here
Thank you in advance.
I have an xml file about 3 MB in size. Huge data inside the file. I would like to strip out some data out of it and import into other application. Preferable common delimited file with comma between each data. Some of the line from xml file
============================
<?xml version="1.0" encoding="UTF-8" ?>
- <WHOLE Date="17/07/2008">
- <ENTITY Id="1" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE" remark="">
- <NAME Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<LASTNAME>Mugabe</LASTNAME>
<FIRSTNAME>Robert</FIRSTNAME>
<MIDDLENAME>Gabriel</MIDDLENAME>
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION>President</FUNCTION>
<LANGUAGE />
</NAME>
- <BIRTH Id="1" Entity_id="1" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<DATE>1924-02-21</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="2" Type="P" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE" remark="">
- <NAME Id="2" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>President’s office and former Minister of State for Special Affairs responsible for Land and Resettlement Programmes</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="2572" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State in the Vice-President’s office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5715" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Buka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION>Former Minister of State for the Land Reform in the President's Office</FUNCTION>
<LANGUAGE />
</NAME>
- <NAME Id="5716" Entity_id="2" legal_basis="777/2007 (OJ L 173)" reg_date="2007-07-03" pdf_link=" programme="ZWE">
<LASTNAME>Bhuka</LASTNAME>
<FIRSTNAME>Flora</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>F</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="2" Entity_id="2" legal_basis="898/2005 (OJ L 153)" reg_date="2005-06-16" pdf_link=" programme="ZWE">
<DATE>1968-02-25</DATE>
<PLACE />
<COUNTRY />
</BIRTH>
</ENTITY>
- <ENTITY Id="928" Type="P" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA" remark="(1) Correction published in OJ L 232, 25.8.2006, p. 42, (2) Case against him in Germany dismissed, (3) Convicted in Italy on 2002-12-11 (six year sentence), (4) Professor of Chemistry.">
- <NAME Id="1943" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME>Ben Hani</LASTNAME>
<FIRSTNAME>Al As’ad</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER>M</GENDER>
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4910" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME>Ben Heni</LASTNAME>
<FIRSTNAME>Lased</FIRSTNAME>
<MIDDLENAME />
<WHOLENAME />
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <NAME Id="4911" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<LASTNAME />
<FIRSTNAME />
<MIDDLENAME />
<WHOLENAME>Mohamed Abu Abda</WHOLENAME>
<GENDER />
<TITLE />
<FUNCTION />
<LANGUAGE />
</NAME>
- <BIRTH Id="323" Entity_id="928" legal_basis="1210/2006 (OJ L 219)" reg_date="2006-08-10" pdf_link=" programme="TAQA">
<DATE>1969-02-05</DATE>
<PLACE>Tripoli</PLACE>
<COUNTRY>LBY</COUNTRY>
</BIRTH>
</ENTITY>
</WHOLE>
======================
What I'm interested on are:
* whole date
* entity id
* type
* lastname
* firstname
* middlename
* wholename
End result should be in form of:
whole date,entity id,type,lastname,firstname,middlename,wholename
Some of them are empty so may be we can ignore it. Example:
17/07/2008,1,P,Mugabe,Robert,Gabriel,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Buka,Flora,,
17/07/2008,2,P,Bhuka,Flora,,
17/07/2008,928,P,Ben Hani,Al As'ad,,
17/07/2008,928,P,Ben Heni,Lased,,
17/07/2008,928,P,,,,Mohamed Abu Abda
I have no clue on how to achieve this. May be it is not possible. I don't know. I would appreciate if somebody could have a look at it. The actual xml file can be found here
Thank you in advance.