Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Striping text out of a xml file

Status
Not open for further replies.

IMAUser

Technical User
May 28, 2003
121
CH
I have an xml file , a sample of which is as below.

<?xml version="1.0"?>
<data>
<header>
<name>Red Entity Download 2</name>
<version>1.1a</version>
<date>2004-07-27</date>
</header>
<entity>
<name>Democratic and Popular Republic of Algeria</name>
<shortname>Dem &amp; Pop Rep Algeria</shortname>
<ticker>ALGERI</ticker>
<yourentity>1225</yourentity>
<red>VZ5ACN</red>
<cusip>V4193K</cusip>
<type>Sov</type>
<jurisdiction>Algeria</jurisdiction>
<liquidity>Low</liquidity>
</entity>
<entity>
<name>Argentine Republic</name>
<shortname>Argentine Rep</shortname>
<ticker>ARGENT</ticker>
<yourentity>849</yourentity>
<red>PP7D7E</red>
<cusip>P0761D</cusip>
<type>Sov</type>
<jurisdiction>Argentina</jurisdiction>
<liquidity>Low</liquidity>
<updated>1999-04-07</updated>
</entity>
</data>

I m trying to get a 'comma seperated file' out of this which shld look like

Democratic and Popular Republic of Algeria,Dem &amp; Pop Rep Algeria ,ALGERI , 1225, VZ5ACN, V4193K, Sov ,Algeria,Low

Argentine Republic,Argentine Rep,ARGENT, 849, PP7D7E, P0761D, Sov, Argentina,Low, 1999-04-07

So basically, the entire record is between the tags <entity> and </entity>


Any pointers folks.

Thanks .
 
Try

/<\/entity>/ {print ""; flg=0;next}
/<entity>/ {flg=1;next}
flg{
if (flg>1) printf ","
gsub(/<[^>]*>/,"")
printf $0
flg = 2
}

CaKiwi
 
Hi CaKiwi,
Thanks for the response.

If you tell me what the script does then I can debug it maybe, coz at the moment it says syntax error on line 3 and I dont think I can see anything wrong there.

Thanks
 
What system are you on? Use nawk on solaris.

CaKiwi
 
Thanks nawk seems to have worked, but the output is not what I expected :-(
It is as below:

<data>
, <header>
, <name>Red Entity Download 2</name>
, Red Entity Download 2 <version>1.1a</version>
, 1.1a <date>2004-07-27</date>
, 2004-07-27 </header>
,
, <name>Democra
tic and Popular Republic of Algeria</name>
Democratic and Popular Republic of Algeria <shortname>Dem &amp; Pop Rep Algeria</shortname>
, Dem &amp; Pop Rep Algeria <ticker>ALGERI</ticker>
, ALGERI <yourentity>1225</yourentity>
, 1225 <red>VZ5ACN</red>
, VZ5ACN <cusip>V4193K</cusip>
, V4193K <type>Sov</type>
, Sov <jurisdiction>Algeria</jurisdiction>
, Algeria <liquidity>Low</liquidity>
, Low
<name>Argentine Republic</name>
Argentine Republic <shortname>Argentine Rep</shortname>
, Argentine Rep <ticker>ARGENT</ticker>
, ARGENT <yourentity>849</yourentity>
, 849 <red>PP7D7E</red>
, PP7D7E <cusip>P0761D</cusip>
, P0761D <type>Sov</type>
, Sov <jurisdiction>Argentina</jurisdiction>
, Argentina <liquidity>Low</liquidity>
, Low <updated>1999-04-07</updated>
, 1999-04-07
 
/<\/entity>/ {print ""; flg=0;next} # found </entity>, print linefeed, turn off printing flag
/<entity>/ {flg=1;next} # found <entity>, turn on printing flag
flg{ # if printing flag is not zero, perform following action
if (flg>1) printf "," # if this is second or subsequent line, print comma
gsub(/<[^>]*>/,"") # remove tags
printf $0 #print line
flg = 2 # set flg to indicate 2nd or subsequent line
}

CaKiwi
 
/<\/entity>/ {print ""; flg=0;next}
/<entity>/ {flg=1;next}
flg
{
if (flg>1) printf (",")
gsub(/<[^>]*>/,"")
printf $0
flg = 2
}
 
Replace this:
flg
{
By this:
flg {


Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
something like this should get you "closer".

Code:
BEGIN {
  FS="[<>]"
  OFS=","
}

function trim(str)
{
    nsub1=sub("^[ ]*", "", str);
    nsub2=sub("[ ]*$", "", str);
    return str;
}

$2 ~ /^entity/,$2 ~ "^/entity" {
      if ( $2 ~ /^entity/ ) next;
      if ( $2 ~ "^/entity" ) { print ""; next };

      printf("%s%s", trim($3), OFS);
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
You beat me again PHV. BTW, congrats on being tipmaster of the week AGAIN. (Doesn't it get boring? :) )

Elegant code as usual, Vlad

CaKiwi
 
congrats to PHV and thanks to CaKiwi.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Thanks Guys,

Didnt realise the curly brace on the next line would make so much of a difference. Any explanation why it made so much of a difference please. Thanks again
 
Anyway: man awk
1) flg{
The action code is executed only if flg is not equal zero.
2-a) flg
If flg <> 0 the default action (ie print $0) is executed.
2-b) {
The action code is executed for ALL input lines because no pattern.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
Hi,
I m using the code give my vgersh99. But have hit a slight problem. There are some entities which have a sub tag called <companynumber>. I can extract the data within this tag by making a small change to the script.
$2 ~ /^entity/,$2 ~ "^/entity" {
if ( $2 ~ /^entity/ ) next;
if ( $2 ~ /^companynumber/ ) next;
if ( $2 ~ "^/companynumber" ) next;

if ( $2 ~ "^/entity" ) { print ""; next };
printf("%s%s", trim($3), OFS);
}

But what I need is that if there isnt a companynumber tag I need to still output two delimiters. This is because sqlLoader expects a column there. So the output of the below examples shld look like


Q B E Insurance Group Limited|Q B E Ins Gp Ltd|QBEAU|7BB9AO|74728G|||Corp|Australia|Low|
LEND LEASE CORPORATION LTD|Lend Lease Corp Ltd|LLC|363|5H8625|526023|Australian Company Number|000 226 228|Corp|Australia|Med|2004-07-21|

I have used pipe as a field-seperator than a comma.

<entity>
<name>Q B E Insurance Group Limited</name>
<shortname>Q B E Ins Gp Ltd</shortname>
<ticker>QBEAU</ticker>
<red>7BB9AO</red>
<cusip>74728G</cusip>
<type>Corp</type>
<jurisdiction>Australia</jurisdiction>
<liquidity>Low</liquidity>
</entity>
<entity>
<name>LEND LEASE CORPORATION LTD</name>
<shortname>Lend Lease Corp Ltd</shortname>
<ticker>LLC</ticker>
<yourentity>363</yourentity>
<red>5H8625</red>
<cusip>526023</cusip>
<companynumber>
<type>Australian Company Number</type>
<value>000 226 228</value>
</companynumber>
<type>Corp</type>
<jurisdiction>Australia</jurisdiction>
<liquidity>Med</liquidity>
<updated>2004-07-21</updated>
</entity>

I struggled with it for a while setting flags and things like that but havent been able to get the right output.

Help Please.
 
Dont worry abt it, I think I have solved it. Unless I hit another issue ;-)

Phew !!

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top