Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

processing an xml file with awk 2

Status
Not open for further replies.

daula

Programmer
May 29, 2006
38
US
hello everyone;
i'm very new in awk and as i was reading on i wondered if awk be the solution to my problem.

first a sample of my xml file:

-- xml --
<dts:dtservices xmlns:dts="uri:dts_namespace">
<dts:tally_output>
<dts:tally dts:name="test_lrfs">
<dts:tally dts:name="cState">
<dts:tally dts:name="dc_range">
<dts:value>00-06 </dts:value>
<dts:count>1</dts:count>
</dts:tally>
<dts:tally dts:name="dc_range">
<dts:value>07-12 </dts:value>
<dts:count>2</dts:count>
</dts:tally>
<dts:tally dts:name="dc_range">
<dts:value>13-24 </dts:value>
<dts:count>3</dts:count>
</dts:tally>
<dts:value>AK</dts:value>
<dts:count>6</dts:count>
</dts:tally>
<dts:tally dts:name="cState">
<dts:tally dts:name="dc_range">
<dts:value>07-12 </dts:value>
<dts:count>5</dts:count>
</dts:tally>
<dts:tally dts:name="dc_range">
<dts:value>13-24 </dts:value>
<dts:count>6</dts:count>
</dts:tally>
<dts:value>AL</dts:value>
<dts:count>11</dts:count>
</dts:tally>
<dts:tally dts:name="cState">
<dts:tally dts:name="dc_range">
<dts:value>00-06 </dts:value>
<dts:count>3</dts:count>
</dts:tally>
<dts:tally dts:name="dc_range">
<dts:value>07-12 </dts:value>
<dts:count>2</dts:count>
</dts:tally>
<dts:tally dts:name="dc_range">
<dts:value>13-24 </dts:value>
<dts:count>1</dts:count>
</dts:tally>
<dts:value>AR</dts:value>
<dts:count>6</dts:count>
</dts:tally>
</dts:tally>
</dts:tally_output>
</dts:dtservices>
---end---

my plan is to extract data from this file ordered by "cState" producing seven coma separated columns.

col 1 = cState
col 2 = the count for 00-06 dc_range
col 3 = % (value from col 2 divided by sum of all 00-06s multiplied by a 100)
col 4 = the count for 07-12 dc_range + value from col 2
col 5 = % (value from col 4 divided by sum of all 00-06s and 07-12s multiplied by a 100)
col 6 = the count for 13-24 dc_range + value from col 4
col 7 = % (value from col 6 divided by sum of all 00-06s, 07-12s and 13-24 multiplied by a 100).
Finally, list the totals at the end of the file.

Therefore, based on the above xml, the output should look something like:

AK,1,11.11,3,23.08,6,26.09
AL,0,00.00,5,38.46,11,47.83
AR,3,33.33,5,38.46,6,26.09
Total: 9,13,23

is this possible to achieve???

btw, i have tried to process the xml via xslt transform but is very slow.

Thanks in advance for any input and assistance.

daula

 
I have a feeling my main problem right now is to be able to modify the search portion of the script to do the job.

Code:
 /:name="Ccurrent_date">/{++r;next}
/>000-000/{c=2;next}
/>000-005/{c=4;next}
/>005-010/{c=6;next}
/>010-015/{c=8;next}
/>015-020/{c=10;next}
/>020-025/{c=12;next}
/>025-050/{c=14;next}
/>050-075/{c=16;next}
/>075-100/{c=18;next}
/>100-150/{c=20;next}
/>150-200/{c=22;next}

all along i have been assuming the above searches for the 'Ccurrent_date' then searches for the specific 'currdollar' value (e.g 000-000, 000-005 etc) and then go to the next 'Ccurrent_date' until they're all exhausted.

if that's the case, i'm wanting to add a step in between the 'currdollar' values to get the 'Gender's count for the M(ale) value.

honestly, it's shame that i have tried several guesses that didn't work. i have no idea how to fix-up the above to work ... and that's where you guys comes in.

i would really appreciate any help.

thanks again

Code:
***tally.xml snip***

<dps:tally dps:name="test_lrfs">
  <dps:tally dps:name="Ccurrent_date">
    <dps:tally dps:name="currdollar">
      <dps:tally dps:name="Gender">
        <dps:value>M</dps:value>
        <dps:count>57</dps:count>
      </dps:tally>
      <dps:value>000-005</dps:value>
      <dps:count>57</dps:count>
    </dps:tally>
    <dps:tally dps:name="currdollar">
      <dps:tally dps:name="Gender">
        <dps:value>M</dps:value>
        <dps:count>82</dps:count>
      </dps:tally>
      <dps:value>005-010</dps:value>
      <dps:count>82</dps:count>
    </dps:tally>
    .
    .
    . 
  </dps:tally>      
<dps:tally>
 
The way awk scripts work is that there are expressions followed by chunks of code; for each line it reads from the input, if it matches the expression, it executes the chunk of { code } that immediately follows it. So when it finds Ccurrent all it's doing is incrementing r, the record number, then skipping to the next record/line in the input.

When all of the input has been processed this way, it jumps to the END part and executes the code in this section.

So PHV's script works by reading all of the input into the array a, and then processing all of the contents of the array in the END clause. Does that make it clearer?

Annihilannic.
 
yes and no...
i was thinking to be able to get the M(ale) count for instance, for the 150-200 col, i could add a "/>value>M/{c=0;next};/>count/{c=22;next}" after "/>150-200/{c=22;next}" which is just a mere guess work.
please, forgive me for my ignorance.

hopely, you can throw me a hint/idea/assistance... to lead me in the right direction.

thanks

daula
 
please someone, i'm still desperately stuck.

daula
 
Is there any particular reason why you need to do this in awk? Have you tried loading your XML file into Excel or something and manipulating the data that way?

Annihilannic.
 
the main reason is that awk performances way better than xslt and xsltproc combo. i had done some transformation using xslt that were taking nearly 3 min to spit the output (even after rigorous tweaking) but when i did the same transform in awk (my very first post), it took 2 secs or less.

all my xml have the same layout... the one in question is the only one with a different layout.

i have a feeling it's just a small change within the search block that i'll need to do. the only problem is that i still haven't quite understood what i need to change or add to accomodate this new xml layout.

please, is still need some help.

i apologize for my ignorance...

thanks again

daula
 
please, guys don't give up on me... i'm getting desparate here

daula
 
When you posted this sample output:

Code:
Sample output:
--------------
MALE
200602,515,458,376,267,233,102,18,0,0,0,0
200601,1105,1011,861,658,586,363,115,7,0,0,0
200512,.,.,.,.,.,.,

The second line (200601) doesn't seem to match up with the sample XML data you posted at all, is that correct? I can't see how you arrived at these figures.

Annihilannic.
 
I'm very sorry about that... the output was generated using xslt and xsltproc. You're right the output for the second line is all wrong.
However, the data in the xml is correct.

thanks again

daula
 
Just to add... I think the 200601's line should be:

200601,590,553,485,391,353,261,97,7,0,0,0

I trully appreciate your help

daula
 
Try this to get you started:

Code:
awk '
BEGIN {
  FS=" *<[^>]+>"
  f="%s,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\n"
  cols["000-000"]=1;
  cols["000-005"]=2;
  cols["005-010"]=3;
  cols["010-015"]=4;
  cols["015-020"]=5;
  cols["020-025"]=6;
  cols["025-050"]=7;
  cols["050-075"]=8;
  cols["075-100"]=9;
  cols["100-150"]=10;
  cols["150-200"]=11;
  debug=0

}
/:name="Ccurrent_date">/{++r;next}
/>[0-9][0-9][0-9]-[0-9][0-9][0-9]</{
        c=cols[$2];
  a[r,c]=savedcount;
  if (debug) print "a["r","c"]="savedcount;
  getline; # skip over the next count as we are ignoring them
  next
}
/:value>M/{next} # this prevents gender being treated as date
/:value>/{a[r,1]=$2;c=0;next} # this just gets the date
/:count>/{savedcount=$2; t[c]+=$2; } # save count until later because
# we do not know which column it belongs to yet
END{

 for(i=1;i<=r;++i)
 {
    dat=a[i,1]
    delete col
    for (m=2; m<=11; m++) {
      for (n=m; n<=11; n++) {
        col[m]+=a[i,n] ;
        if (debug) print "col["m"]+=a["i","n"]",a[i,n],col[m]
      }
    }

    # start from column 2 as there seems to be no 000-000 range in the latest data
    printf f,dat,col[2],col[3],col[4],col[5],col[6],col[7],col[8],col[9],col[10],col[11]

    # untested
    if(i==4)
    {
       dat="0-3"
       printf "%s\n", ""
       printf f,dat,0+a[i,2],a[i,4],a[i,6],a[i,8],a[i,10],a[i,12],a[i,14],a[i,16],a[i,18],a[i,20],a[i,22]
    }

    # untested
    if(i==7)
    {
       dat="4-6"
       k=4
       for (m=1; m<=11; m++) { t[m]=a[i,m*2] }
       t[1]-=a[k,2]
       t[2]-=a[k,4]
       t[3]-=a[k,6]
       t[4]-=a[k,8]
       t[5]-=a[k,10]
       t[6]-=a[k,12]
       t[7]-=a[k,14]
       t[8]-=a[k,16]
       t[9]-=a[k,18]
       t[10]-=a[k,20]
       t[11]-=a[k,22]
       printf f,dat,t[1],t[2],t[3],t[4],t[5],t[6],t[7],t[8],t[9],t[10],t[11]
    }

    # untested
    if(i>7&& ((i-1)%6)==0)
    {
       k=i-6
       dat=sprintf("%d-%d",k,i-1)
       for (m=1; m<=11; m++) { t[m]=a[i,m*2] }
       t1-=a[k,2]
       t2-=a[k,4]
       t3-=a[k,6]
       t4-=a[k,8]
       t5-=a[k,10]
       t6-=a[k,12]
       t7-=a[k,14]
       t8-=a[k,16]
       t9-=a[k,18]
       t10-=a[k,20]
       t11-=a[k,22]
       printf f,dat,t[1],t[2],t[3],t[4],t[5],t[6],t[7],t[8],t[9],t[10],t[11]
    }
 }
}
' tally.xml

They are horrible requirements because each version of the XML you posted differs in subtle ways which completely breaks the previous version of the code. Also I notice that the 000-000 data is now missing, so there is one less column of output.

I haven't updated the special handling of the 4th and 7th rows. Running this against your sample data gives me:

[tt]200602,515,458,376,267,233,102,18,0,0,0
200601,590,553,485,391,353,261,97,7,0,0[/code]

If you do need to post lots of sample data, I recommend you use or something similar to save making the threads here too long.

Annihilannic.
 
Good morning;
Thanks... I really appreciate your help.
Your script works but only and only if there is only one Gender value (i.e value>M)... but in my xml every 'currdollar' can have mutiple Gender values.
for instance;
I could have the following:
Code:
within the 000-005 'currdollar' of 200602 'Ccurrent_date'
 
      <dps:tally dps:name="Gender">
        <dps:value>U</dps:value>
        <dps:count>4</dps:count>
      </dps:tally>
      <dps:tally dps:name="Gender">
        <dps:value>M</dps:value>
        <dps:count>57</dps:count>
      </dps:tally>
      <dps:tally dps:name="Gender">
        <dps:value>I</dps:value>
        <dps:count>5</dps:count>
      </dps:tally>
      <dps:tally dps:name="Gender">
        <dps:value>F</dps:value>
        <dps:count>220</dps:count>
      </dps:tally>
      <dps:tally dps:name="Gender">
        <dps:value>C</dps:value>
        <dps:count>1</dps:count>
      </dps:tally>
      <dps:tally dps:name="Gender">
        <dps:value>A</dps:value>
        <dps:count>18</dps:count>
      </dps:tally>

All I'm interested in (for right now) is the 'value>M's corresponding count.
If I run your script with mutiple gender values like the above, the output changes to:
Code:
200602,0,476,458,376,267,233,102,18,0,0
expected: 200602,515,458,376,267,233,102,18,0,0,0

In that case, I apologize for not clarifying that in the begining.

How can I take care of this problem?

Thanks again

daula
 
Here is a version that should work for Males, Females and Unknowns (note the different parameters specified on the awk command line):

Code:
#awk -v gender="F" '
#awk -v gender="[A-EG-LN-Z]" '
awk -v gender="M" '
BEGIN {
  FS=" *<[^>]+>"
  f="%s,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\n"
  cols["000-000"]=1;
  cols["000-005"]=2;
  cols["005-010"]=3;
  cols["010-015"]=4;
  cols["015-020"]=5;
  cols["020-025"]=6;
  cols["025-050"]=7;
  cols["050-075"]=8;
  cols["075-100"]=9;
  cols["100-150"]=10;
  cols["150-200"]=11;
  debug=0
}
/:name="Ccurrent_date">/{++r;next}
/>[0-9][0-9][0-9]-[0-9][0-9][0-9]</{
  c=cols[$2];
  a[r,c]=savedcount;
  if (debug) print "a["r","c"]="savedcount;
  savedcount=0;
  next
}
# get the count from the line following the gender
$0 ~ ":value>" gender {getline; savedcount+=$2; t[c]+=$2; }
/:value>/{a[r,1]=$2;next} # this just gets the date
END{

 for(i=1;i<=r;++i)
 {
    dat=a[i,1]
    delete col
    for (m=2; m<=11; m++) {
      for (n=m; n<=11; n++) {
        col[m]+=a[i,n] ;
        if (debug) print "col["m"]+=a["i","n"]",a[i,n],col[m]
      }
    }

    # start from column 2 as there seems to be no 000-000 range in the latest data
    printf f,dat,col[2],col[3],col[4],col[5],col[6],col[7],col[8],col[9],col[10],col[11]

    # untested
    if(i==4)
    {
       dat="0-3"
       printf "%s\n", ""
       printf f,dat,0+a[i,2],a[i,4],a[i,6],a[i,8],a[i,10],a[i,12],a[i,14],a[i,16],a[i,18],a[i,20],a[i,22]
    }

    # untested
    if(i==7)
    {
       dat="4-6"
       k=4
       for (m=1; m<=11; m++) { t[m]=a[i,m*2] }
       t[1]-=a[k,2]
       t[2]-=a[k,4]
       t[3]-=a[k,6]
       t[4]-=a[k,8]
       t[5]-=a[k,10]
       t[6]-=a[k,12]
       t[7]-=a[k,14]
       t[8]-=a[k,16]
       t[9]-=a[k,18]
       t[10]-=a[k,20]
       t[11]-=a[k,22]
       printf f,dat,t[1],t[2],t[3],t[4],t[5],t[6],t[7],t[8],t[9],t[10],t[11]
    }

    # untested
    if(i>7&& ((i-1)%6)==0)
    {
       k=i-6
       dat=sprintf("%d-%d",k,i-1)
       for (m=1; m<=11; m++) { t[m]=a[i,m*2] }
       t1-=a[k,2]
       t2-=a[k,4]
       t3-=a[k,6]
       t4-=a[k,8]
       t5-=a[k,10]
       t6-=a[k,12]
       t7-=a[k,14]
       t8-=a[k,16]
       t9-=a[k,18]
       t10-=a[k,20]
       t11-=a[k,22]
       printf f,dat,t[1],t[2],t[3],t[4],t[5],t[6],t[7],t[8],t[9],t[10],t[11]
    }
 }
}
' tally.xml

Annihilannic.
 
daula-

No offense intended, but I don't think you're trying hard enough. I had a similar situation, extracting data from an xml file. I knew nothing about awk. I asked somebody to point me to a command and they tossed awk (gawk for me) in front of me. I read up on it only looking for my specific needs. I was able to solve my problem in half a day starting from NOTHING, knowing NOTHING about awk. My problem was not as complex as yours, but there's been enough information posted on here to point you to what you need, if you take the time to go and look for it. I did have a small problem with a variation on my original script that I needed help with, and I got that here. I had already researched and the solution I found didn't work. It turns out that the documentation was wrong for what I'm using (listed FILE-NAME instead of FILENAME). I think you need to google for some documentation on how to use awk, and I'm sure if you're still having difficulties, people will help you.

Good luck,
Kevin
 
Thanks, you've been alot of help...
lots of blessing to you.

daula
 
hello everyone;
i'm back again...
as for right now the script works like a champ with M(ales) and F(emales) params but not with the Unknowns.
actually, the Unknowns are just alittle off because there're some few Unknowns that have numeric 'value's. so, i tried to modify the regex for Unknowns to gender="[A-EG-LN-Z0-9]".

this new change tend to produce some strange output for unknowns. i'm assuming it's a conflict caused by the following line:
/>[0-9][0-9][0-9]-[0-9][0-9][0-9]</

any idea??

thanks a bunch

daula
 
Just change this regex:

[tt]$0 ~ ":value>" gender [/tt]

So that it matches only the gender lines and not the category lines. Something like:

[tt]$0 ~ ":value>" gender "<"[/tt]

Annihilannic.
 
phew! works like a champ...
thanks for all your help. you've been great.

daula
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top