Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract data from an XML file 2

Status
Not open for further replies.

IMAUser

Technical User
May 28, 2003
121
0
0
CH
Hi ,
I m relatively sure people have done this before but I couldnt find any pointers on Google. The issue is I have a xml data file with all kinds of tags, a sample is as below


<ROWSET>
<ROW>
<DOCID> 91000 </DOCID>
<SUBJECT> Bond Inserted</SUBJECT>
<TYPE> PROBLEM </TYPE>
<CONTENT_TYPE> TEXT/PLAIN </CONTENT_TYPE>
<STATUS> PUBLISHED </STATUS>
<CREATION_DATE> 14-DEC-1999 </CREATION_DATE>
<LAST_REVISION_DATE> 05-JUN-2000 </LAST_REVISION_DATE>
<LANGUAGE> USAENG </LANGUAGE>
</ROW>

<ROW>
<DOCID> 92000 </DOCID>
<SUBJECT> Bond Updated </SUBJECT>
<TYPE> PROBLEM </TYPE>
<CONTENT_TYPE> TEXT/PLAIN </CONTENT_TYPE>
<STATUS> PUBLISHED </STATUS>
<CREATION_DATE> 04-DEC-2003 </CREATION_DATE>
<LAST_REVISION_DATE> 14-DEC-2003 </LAST_REVISION_DATE>
<LANGUAGE> USAENG </LANGUAGE>
</ROW>
</ROWSET>

I need a script which can extract all the data in a comma seperated file so for the above two records I would have

91000,Bond Inserted,Problem,Text/plain,Published,14-dec-1999,05-Jun-2000,usaeng
92000,Bond Updated,Problem,Text/plain,Published,04-dec-2003,14-dec-2003,usaeng

Any ideas where I can start, First of all whether it is doable using AWK ? Or maybe someone has already done something similar.

Any help appreaciated .
Thanks
 
Try...
[tt]
awk '/DOCID/,/LANGUAGE/{
sub($1 &quot; &quot;,&quot;&quot;);
sub(&quot; &quot; $NF,&quot;&quot;);
print $0;
}' example.xml | paste -d',' - - - - - - - -

[/tt]
 
Or

NF>1{if (s) s=s &quot;,&quot;; s=s $2}
/^<\/ROW>/{print s; s=&quot;&quot;}

CaKiwi

&quot;I love mankind, it's people I can't stand&quot; - Linus Van Pelt
 
CaKiwi,

Neat - but I don't think it will work on <SUBJECT>
 
how 'bout, not perfect yet - but a start.....

nawk -f ima.awk myFile.txt

#------------------ ima.awk
BEGIN {
RS=FS=&quot;&quot;
OFS=&quot;,&quot;
}

{
for(i=1; i <= NF; i++) {
cfN=split($i, cfA, &quot; &quot;);
if ( cfN != 3) continue;
printf(&quot;%s%s&quot;, cfA[2], (i+1 != NF) ? OFS : &quot;\n&quot;);
}
}


vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Oh right. Maybe

NF>1{if (s) s=s &quot;,&quot;; gsub(/ *<[^>]*> */,&quot;&quot;); s=s $0}
/^<\/ROW>/{print s; s=&quot;&quot;}


CaKiwi

&quot;I love mankind, it's people I can't stand&quot; - Linus Van Pelt
 
Thanx guys,

Any chance you can include some explanation as to how and what these scripts are doing. For the above mentioned datafile there is no issue but I have some more xml docs to be extracted. So maybe if I can understand how these scripts are working, I can carry out some changes on my own.

Thanx a ton.
 
#------------------ ima.awk
BEGIN {
# records are separated by blank/empty lines and
# fields are one per line.
RS=FS=&quot;&quot;

# the Output Field Separator [OFS] is ','
OFS=&quot;,&quot;
}

{
# iterate through all the field for a given record
# as described above the FIELD is a LINE.
for(i=1; i <= NF; i++) {
# split a 'field/line' on 'space' - 'cfA' array will
# contain 'words' on current line/field
cfN=split($i, cfA, &quot; &quot;);

# if the number of 'words' is not 3 [assumption]
# go to the next field/line
if ( cfN != 3) continue;

# print the SECOND word from a 'split' array:
# it assumed that every 'valid' field/line is of
# the form: TAG value endTAG
# your 'value' is always a SECOND word in a field/line
printf(&quot;%s%s&quot;, cfA[2], (i+1 != NF) ? OFS : &quot;\n&quot;);
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
A more complicated solution ....
Accept multiple tags by lines.

#!/usr/bin/awk -f

# ============================================================
# F u n c t i o n s . . .
# ============================================================

# ------------------------------------------------------------
# GetNextTag text tagid - Get tag text and remove tag
# Input: text[&quot;VAL&quot; ] = Text to analyze
# Output: text[&quot;VAL&quot; ] = Text after tag
# text[&quot;TAG_ID&quot; ] = Identification tag
# text[&quot;TAG_VAL&quot;] = Text ogf tag
# Return: 0 = No tag found
# 1 = Tag found
# ------------------------------------------------------------

function GetNextTag(text, tagid ,textval, tagpos, tagval, sts) {

sts = 1 ;
textval = text[&quot;VAL&quot;] ;
if (tagid == &quot;&quot;)
tagstart = &quot;<[^>]*>&quot; ;
else
tagstart = &quot;<&quot; tagid &quot;>&quot; ;

tagpos = match(textval, tagstart) ;
if (tagpos != 0) {
tagid = substr(textval, tagpos+1, RLENGTH-2) ;
textval = substr(textval, tagpos+RLENGTH) ;
tagpos = match(textval, (&quot;</&quot; tagid &quot;>&quot;)) ;
if (tagpos != 0) {
tagval = substr(textval, 1, tagpos-1) ;
textval = substr(textval, tagpos+RLENGTH) ;
} else {
tagval = textval ;
textval = &quot;&quot; ;
}
} else {
sts = 0 ;
tagid = &quot;&quot; ;
tagval = &quot;&quot; ;
textval = &quot;&quot; ;
}

text[&quot;VAL&quot; ] = textval ;
text[&quot;TAG_ID&quot; ] = tagid ;
text[&quot;TAG_VAL&quot;] = tagval ;
return sts ;

}

# ------------------------------------------------------------
# ProceedFile filetext - Analyze file
# ------------------------------------------------------------

function ProceedFile(filetext ,file) {
file[&quot;VAL&quot; ] = filetext ;
while (GetNextTag(file, &quot;ROWSET&quot;))
ProceedRowSet(file[&quot;TAG_VAL&quot;]);
}


# ------------------------------------------------------------
# ProceedRowSet rowsettext - Analyze ROWSET tag
# ------------------------------------------------------------

function ProceedRowSet(rowsettext ,rowset) {
rowset[&quot;VAL&quot; ] = rowsettext ;
while (GetNextTag(rowset, &quot;ROW&quot;))
ProceedRow(rowset[&quot;TAG_VAL&quot;]);
}

# ------------------------------------------------------------
# ProceedRow rowtext - Analyze ROW tag
# ------------------------------------------------------------

function ProceedRow (rowtext , row, result, firsttag) {
firsttag = 1;
row[&quot;VAL&quot; ] = rowtext;
while (GetNextTag(row, &quot;&quot;)) {
result = result (firsttag ? &quot;&quot; : &quot;,&quot;) row[&quot;TAG_VAL&quot;] ;
firsttag = 0 ;
}
print result ;
}


# ============================================================
# P a t t e r n s / A c t i o n s . . .
# ============================================================

{
filetext = filetext &quot; &quot; $0 ; # Memorize file in var
}

END {
gsub(&quot;[[:space:]]+&quot;, &quot; &quot;, filetext) ;
ProceedFile(filetext);
}


Jean Pierre.
 
A generic version might be...

awk 'BEGIN{FS=&quot;[<>]&quot;}NF>3{print $3}' example.xml|paste -d, - - - - - - - -

...which changes the field separator to < or > {FS=&quot;[<>]&quot;}so that the line...

<SUBJECT> Bond Inserted </SUBJECT>

...is split into fields....

$1 < $2 > $3 < $4 > $5

Only data lines have more than three fields (NF>3) where...

$2 = opening tag
$3 = data
$4 = closing tag

The number of dashes of the paste command should be set to the number of data fields.


P.S. Vlad, I tried your code and got errors in the result...

91000,Bond,PROBLEM,TEXT/PLAIN,PUBLISHED,14-DEC-1999,05-JUN-2000,USAENG
92000,PROBLEM,TEXT/PLAIN,PUBLISHED,04-DEC-2003,14-DEC-2003,USAENG,
 
ooops, sorry 'bout that.

nawk -f ima.awk myFile.txt

#------------------ ima.awk
BEGIN {
RS=FS=&quot;&quot;
OFS=&quot;,&quot;
}

function trim(str)
{
nsub1=sub(&quot;^[ ]*&quot;, &quot;&quot;, str);
nsub2=sub(&quot;[ ]*$&quot;, &quot;&quot;, str);
return str;
}

{
for(i=1; i <= NF; i++) {
cfN=split($i, cfA, &quot;[<>]&quot;);
if ( cfN <= 3) continue;
printf(&quot;%s%s&quot;, trim(cfA[3]), (i+1 != NF) ? OFS : &quot;\n&quot;);
}
}


vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
I have an xml file (given below) which lists out installs, changes and removes for each product. The xml also provides totals at the end.
I need to extract the totals (system_change_msg/totals/total_installs, system_change_msg/totals/total_changes and system_change_msg/totals/total_changes) using awk... anybody has any idea on how to do this?

Thanks
Nishant

<system_change_msg
transaction="1"
changed_by="SW$ADMIN"
device_id="0"
object_name="device_counts"
batch="1"
changed_by_timestamp="05/19/2005 01:45:09"
event_type="install">
<product_abc>
<total_installs>13</total_installs>
<total_changes>2</total_changes>
<total_removes>3</total_removes>
</product_abc>
<product_xyz>
<total_installs>4</total_installs>
<total_changes>51</total_changes>
<total_removes>42</total_removes>
</product_xyz>
<product_pqr>
<total_installs>0</total_installs>
<total_changes>300</total_changes>
<total_removes>22</total_removes>
</product_pqr>
<totals>
<total_installs>17</total_installs>
<total_changes>353</total_changes>
<total_removes>67</total_removes>
</totals>
</system_change_msg>
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top