Extract data from an XML file 2

IMAUser · Dec 15, 2003

Hi ,
I m relatively sure people have done this before but I couldnt find any pointers on Google. The issue is I have a xml data file with all kinds of tags, a sample is as below

<ROWSET>
<ROW>
<DOCID> 91000 </DOCID>
<SUBJECT> Bond Inserted</SUBJECT>
<TYPE> PROBLEM </TYPE>
<CONTENT_TYPE> TEXT/PLAIN </CONTENT_TYPE>
<STATUS> PUBLISHED </STATUS>
<CREATION_DATE> 14-DEC-1999 </CREATION_DATE>
<LAST_REVISION_DATE> 05-JUN-2000 </LAST_REVISION_DATE>
<LANGUAGE> USAENG </LANGUAGE>
</ROW>

<ROW>
<DOCID> 92000 </DOCID>
<SUBJECT> Bond Updated </SUBJECT>
<TYPE> PROBLEM </TYPE>
<CONTENT_TYPE> TEXT/PLAIN </CONTENT_TYPE>
<STATUS> PUBLISHED </STATUS>
<CREATION_DATE> 04-DEC-2003 </CREATION_DATE>
<LAST_REVISION_DATE> 14-DEC-2003 </LAST_REVISION_DATE>
<LANGUAGE> USAENG </LANGUAGE>
</ROW>
</ROWSET>

I need a script which can extract all the data in a comma seperated file so for the above two records I would have

91000,Bond Inserted,Problem,Text/plain,Published,14-dec-1999,05-Jun-2000,usaeng
92000,Bond Updated,Problem,Text/plain,Published,04-dec-2003,14-dec-2003,usaeng

Any ideas where I can start, First of all whether it is doable using AWK ? Or maybe someone has already done something similar.

Any help appreaciated .
Thanks

Ygor · Dec 15, 2003

Try...
[tt]
awk '/DOCID/,/LANGUAGE/{
sub($1 " ","&quot

;
sub(" " $NF,"&quot

;
print $0;
}' example.xml | paste -d',' - - - - - - - -

[/tt]

CaKiwi · Dec 15, 2003

Or

NF>1{if (s) s=s ","; s=s $2}
/^<\/ROW>/{print s; s=""}

CaKiwi

"I love mankind, it's people I can't stand" - Linus Van Pelt

Ygor · Dec 15, 2003

CaKiwi,

Neat - but I don't think it will work on <SUBJECT>

vgersh99 · Dec 15, 2003

how 'bout, not perfect yet - but a start.....

nawk -f ima.awk myFile.txt

#------------------ ima.awk
BEGIN {
RS=FS=""
OFS=","
}

{
for(i=1; i <= NF; i++) {
cfN=split($i, cfA, " &quot

;
if ( cfN != 3) continue;
printf("%s%s", cfA[2], (i+1 != NF) ? OFS : "\n&quot

;
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

CaKiwi · Dec 15, 2003

Oh right. Maybe

NF>1{if (s) s=s ","; gsub(/ *<[^>]*> */,"&quot

; s=s $0}
/^<\/ROW>/{print s; s=""}

CaKiwi

"I love mankind, it's people I can't stand" - Linus Van Pelt

IMAUser · Dec 16, 2003

Thanx guys,

Any chance you can include some explanation as to how and what these scripts are doing. For the above mentioned datafile there is no issue but I have some more xml docs to be extracted. So maybe if I can understand how these scripts are working, I can carry out some changes on my own.

Thanx a ton.

vgersh99 · Dec 16, 2003

#------------------ ima.awk
BEGIN {
# records are separated by blank/empty lines and
# fields are one per line.
RS=FS=""

# the Output Field Separator [OFS] is ','
OFS=","
}

{
# iterate through all the field for a given record
# as described above the FIELD is a LINE.
for(i=1; i <= NF; i++) {
# split a 'field/line' on 'space' - 'cfA' array will
# contain 'words' on current line/field
cfN=split($i, cfA, " &quot

;

# if the number of 'words' is not 3 [assumption]
# go to the next field/line
if ( cfN != 3) continue;

# print the SECOND word from a 'split' array:
# it assumed that every 'valid' field/line is of
# the form: TAG value endTAG
# your 'value' is always a SECOND word in a field/line
printf("%s%s", cfA[2], (i+1 != NF) ? OFS : "\n&quot

;
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

aigles · Dec 16, 2003

A more complicated solution ....
Accept multiple tags by lines.

#!/usr/bin/awk -f

# ============================================================
# F u n c t i o n s . . .
# ============================================================

# ------------------------------------------------------------
# GetNextTag text tagid - Get tag text and remove tag
# Input: text["VAL" ] = Text to analyze
# Output: text["VAL" ] = Text after tag
# text["TAG_ID" ] = Identification tag
# text["TAG_VAL"] = Text ogf tag
# Return: 0 = No tag found
# 1 = Tag found
# ------------------------------------------------------------

function GetNextTag(text, tagid ,textval, tagpos, tagval, sts) {

sts = 1 ;
textval = text["VAL"] ;
if (tagid == "&quot

tagstart = "<[^>]*>" ;
else
tagstart = "<" tagid ">" ;

tagpos = match(textval, tagstart) ;
if (tagpos != 0) {
tagid = substr(textval, tagpos+1, RLENGTH-2) ;
textval = substr(textval, tagpos+RLENGTH) ;
tagpos = match(textval, ("</" tagid ">&quot

) ;
if (tagpos != 0) {
tagval = substr(textval, 1, tagpos-1) ;
textval = substr(textval, tagpos+RLENGTH) ;
} else {
tagval = textval ;
textval = "" ;
}
} else {
sts = 0 ;
tagid = "" ;
tagval = "" ;
textval = "" ;
}

text["VAL" ] = textval ;
text["TAG_ID" ] = tagid ;
text["TAG_VAL"] = tagval ;
return sts ;

}

# ------------------------------------------------------------
# ProceedFile filetext - Analyze file
# ------------------------------------------------------------

function ProceedFile(filetext ,file) {
file["VAL" ] = filetext ;
while (GetNextTag(file, "ROWSET&quot

)
ProceedRowSet(file["TAG_VAL"]);
}

# ------------------------------------------------------------
# ProceedRowSet rowsettext - Analyze ROWSET tag
# ------------------------------------------------------------

function ProceedRowSet(rowsettext ,rowset) {
rowset["VAL" ] = rowsettext ;
while (GetNextTag(rowset, "ROW&quot

)
ProceedRow(rowset["TAG_VAL"]);
}

# ------------------------------------------------------------
# ProceedRow rowtext - Analyze ROW tag
# ------------------------------------------------------------

function ProceedRow (rowtext , row, result, firsttag) {
firsttag = 1;
row["VAL" ] = rowtext;
while (GetNextTag(row, "&quot

) {
result = result (firsttag ? "" : ",&quot

row["TAG_VAL"] ;
firsttag = 0 ;
}
print result ;
}

# ============================================================
# P a t t e r n s / A c t i o n s . . .
# ============================================================

{
filetext = filetext " " $0 ; # Memorize file in var
}

END {
gsub("[[:space:]]+", " ", filetext) ;
ProceedFile(filetext);
}

Jean Pierre.

Ygor · Dec 17, 2003

A generic version might be...

awk 'BEGIN{FS="[<>]"}NF>3{print $3}' example.xml|paste -d, - - - - - - - -

...which changes the field separator to < or > {FS="[<>]"}so that the line...

<SUBJECT> Bond Inserted </SUBJECT>

...is split into fields....

$1 < $2 > $3 < $4 > $5

Only data lines have more than three fields (NF>3) where...

$2 = opening tag
$3 = data
$4 = closing tag

The number of dashes of the paste command should be set to the number of data fields.

P.S. Vlad, I tried your code and got errors in the result...

91000,Bond,PROBLEM,TEXT/PLAIN,PUBLISHED,14-DEC-1999,05-JUN-2000,USAENG
92000,PROBLEM,TEXT/PLAIN,PUBLISHED,04-DEC-2003,14-DEC-2003,USAENG,

vgersh99 · Dec 17, 2003

ooops, sorry 'bout that.

nawk -f ima.awk myFile.txt

#------------------ ima.awk
BEGIN {
RS=FS=""
OFS=","
}

function trim(str)
{
nsub1=sub("^[ ]*", "", str);
nsub2=sub("[ ]*$", "", str);
return str;
}

{
for(i=1; i <= NF; i++) {
cfN=split($i, cfA, "[<>]&quot

;
if ( cfN <= 3) continue;
printf("%s%s", trim(cfA[3]), (i+1 != NF) ? OFS : "\n&quot

;
}
}

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

nishantu · May 27, 2005

I have an xml file (given below) which lists out installs, changes and removes for each product. The xml also provides totals at the end.
I need to extract the totals (system_change_msg/totals/total_installs, system_change_msg/totals/total_changes and system_change_msg/totals/total_changes) using awk... anybody has any idea on how to do this?

Thanks
Nishant

<system_change_msg
transaction="1"
changed_by="SW$ADMIN"
device_id="0"
object_name="device_counts"
batch="1"
changed_by_timestamp="05/19/2005 01:45:09"
event_type="install">
<product_abc>
<total_installs>13</total_installs>
<total_changes>2</total_changes>
<total_removes>3</total_removes>
</product_abc>
<product_xyz>
<total_installs>4</total_installs>
<total_changes>51</total_changes>
<total_removes>42</total_removes>
</product_xyz>
<product_pqr>
<total_installs>0</total_installs>
<total_changes>300</total_changes>
<total_removes>22</total_removes>
</product_pqr>
<totals>
<total_installs>17</total_installs>
<total_changes>353</total_changes>
<total_removes>67</total_removes>
</totals>
</system_change_msg>

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Extract data from an XML file 2

IMAUser

Technical User

Ygor

Programmer

CaKiwi

Programmer

Ygor

Programmer

vgersh99

Programmer

CaKiwi

Programmer

IMAUser

Technical User

vgersh99

Programmer

aigles

Technical User

Ygor

Programmer

vgersh99

Programmer

nishantu

Programmer

Similar threads

Part and Inventory Search

Sponsor