locating and reporting text from within a free format text file 2

alfie002 · Mar 3, 2004

Hello all,

I would appreciate help with the following. The aim is to extract XML (tagged) data from within an xml file. The data itself is free format and potentially could exist in the file in various positions. The tags are known, eg <nedn/> being one of the tags.

I need to read the tagged data, that data that exist between the xml tags; <nedn>test_data</nedn> and write the output to an external text file.

I have copied an extract of the data (XML) file into the thread for your persual.

Any help would be appreciated.

Thanks

Alf

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<!DOCTYPE MeasDataCollection SYSTEM "MeasDataCollection.dtd">
<mdc xmlns:HTML="

http://www.w3c.org">

<mfh>
<ffv>1</ffv>
<sn>System=MS, ManagementServer=ManagementServer</sn>
<st></st>
<vn>Ericsson</vn>
<cbt>200402180000</cbt>
</mfh>
<md>
<neid>
<neun>SK1</neun>
<nedn>System=MS, Sitekeeper=SK1</nedn>
</neid>
<mts>200402191600</mts>
<gp>900</gp>
<mt>Resource,CPU load</mt>
<mt>Resource,RAM used</mt>
<mt>Resource,Disk used</mt>
<mt>Accounting,Radius events</mt>
<moid>Calls=1,Orig calls=1,SK release - no route</moid>
<r>0</r>

The tags that I am interested in are as follows;

nedn
mts
gp
moid
r

thanks in advance

Regards

Alf

CaKiwi · Mar 3, 2004

This assumes that the tagged data does not extend over more than 1 line and that there are no nested tags.

/<nedn>|<mts>|<gp>|<moid>|<r>/{gsub(/<\/?[^>]*>/,"");print}

CaKiwi

alfie002 · Mar 4, 2004

Thanks CaKiwi,

one other thing and that is the ability to extract data from within a text. I used to be able to do this but have forgotten the syntax.

Basically, using the above example, I want to be able to extract the data between the tags;

<nedn>extract_this_text</nedn> and assign to a variable(s) for printing to an external file. So, as the script reads the text file it locates the tags, extracts the text between the tags and stores them in variables.

Any help would be appreciated.

Thanks

Alf

CaKiwi · Mar 4, 2004

I don't really understand what you want. The

gsub(/<\/?[^>]*>/,"")

above removes all tags from the current line and the print prints the line to standard out. You can save it in a variable with something like

var1 = $0

CaKiwi

aigles · Mar 4, 2004

The following 'awk' script extracts tag datas.
The tagged datas can span lines.

The list of tags to proceed (delimited by ,) is specified in the TAGS variable.

The datas of each selected tag is written to file which name are tag.dat

Code:

#!/usr/bin/awk -f

# ============================================================
# F u n c t i o n s . . .
# ============================================================

# ------------------------------------------------------------
# GetTagList
# Input:        Var TAGS   : list of tags, demilited by ","
# Output:       Var Tags[] : Array of tags to search (index=tag)
# Return:       N/A
# ------------------------------------------------------------

function GetTagList(     i,t,cnt) {
   if (split(TAGS, t, ",") == 0) {
      print "Missing tag list (TAGS)" | "cat 1>&2";
      exit;
   }
   for (i in t) 
      Tags[t[i]] = i;
}

# ------------------------------------------------------------
# ReadXmlText
# Input:        Input file
# Output:       Var XmlText : text read from input file 
# Return:       1 = Text read, 0 End of input file
# ------------------------------------------------------------

function ReadXmlText(    sts,line) {
   if ((getline line) > 0) {
      XmlText = XmlText line;
      sts = 1;
   }
   return sts;
}

# ------------------------------------------------------------
# GetNextTag
# Input/Output: var XmlText
# Return:       Tag id
# ------------------------------------------------------------

function GetNextTag(    tpos, tag) {
   while (1) {
      if (length(XmlText) == 0)
         if (ReadXmlText() == 0)
            break;
      tpos = match(XmlText, /<[^\/][^>]*>/);
      if (tpos != 0)
         break;
      XmlText = "";
   }
   if (tpos != 0) {
      tag = tolower(substr(XmlText, tpos+1, RLENGTH-2));
      XmlText = substr(XmlText, tagpos+RLENGTH+1) ;
   }
   return tag

}

# ------------------------------------------------------------
# GetTagDatas
# Arg:          Tag Id
# Input/Output: var XmlText
# Output:       N/A
# Return:       Tag datas
# ------------------------------------------------------------

function GetTagDatas(tag    ,tpos,datas) {
   while (tpos == 0) {
      if (length(XmlText) == 0)
         if (ReadXmlText() == 0)
            break;
      tpos = match(tolower(XmlText), "</" tolower(tag) ">" );
   }
   if (tpos != 0) {
      datas = substr(XmlText, 1, tpos-1);
      XmlText = substr(XmlText, tpos+RLENGTH) ;
   }
   return datas

}


# ============================================================
# P a t t e r n s / A c t i o n s . . .
# ============================================================

BEGIN { 
   GetTagList()
   tag = GetNextTag();
   while (tag != "") {
      if (tag in Tags) {
         print GetTagDatas(tag) >> tag ".dat" ;
      }
      tag = GetNextTag();
   }
}

An example of execution with your datas:
[tt]
/home/jp> xml.awk -v TAGS='mt,nedn' xml.txt
/home/jp> cat mt.dat
Resource,CPU load
Resource,RAM used
Resource,Disk used
Accounting,Radius events
/home/jp>
[/tt]

Jean Pierre.

alfie002 · Mar 8, 2004

Hello CaKiwi,

In your example of the script to locate and extract the tag data, you have your script to print out the tags line by line. What change is necessary to print the tag data to a file in a comman separated format ?

Thanks

Alf

alfie002 · Mar 8, 2004

aigles,

Firstly, thanks for your help. Secondly, would it be possible to get your script to print all the TAG entries to one file, CSV formated, named using the first Tag;

file name : sk1_<this section can be appended>

Format should look some thing like follows;

Sitekeeper=SK1,200402191600,900,0,0,0 etc !!!!

Thanks in advance.

Regards

Alf

PS: I used to be able to do this sort of thing but haven't been working with scripts and parsing for some time. I have under estimated just how rusty I am.

CaKiwi · Mar 8, 2004

Try

/<nedn>|<mts>|<gp>|<moid>|<r>/{gsub(/<\/?[^>]*>/,"");printf cma $0;cma=","}

CaKiwi

CaKiwi · Mar 8, 2004

Which tag is the first tag and how do you want the file to be named from it?

CaKiwi

futurelet · Mar 8, 2004

The following will print the data in the order nedn,mts,gp,moid,r --- regardless of its order in the file.
A tag can start on one line and end on another.
[tt]
{whole=whole $0}

END { lineout=""
# If "neun" isn't the tag from which file should
# be named, change next line.
split( whole, temp, /<neun>|<\/neun>/ )
fileout=temp[2] ".txt"
split( "", tags, /,/ )
for (i=1;i in tags;i++)
{ tag=tags
c=split( whole, temp, "<"tag">|</"tag">" )
for (j=2;j<=c;j=j+2)
lineout=lineout temp[j] ","
}
sub(/,$/, "", lineout )
print "Write to file? " fileout "\n" lineout
# Delete preceding line and
# uncomment next line if the filename suits you.
# print lineout >fileout
}
[/tt]

futurelet · Mar 8, 2004

Oops. In the preceding program, change
[tt]
split( "", tags, /,/ )
[/tt]
to
[tt]
split( "nedn,mts,gp,moid,r", tags, /,/ )
[/tt]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

locating and reporting text from within a free format text file 2

alfie002

Technical User

CaKiwi

Programmer

alfie002

Technical User

CaKiwi

Programmer

aigles

Technical User

alfie002

Technical User

alfie002

Technical User

CaKiwi

Programmer

CaKiwi

Programmer

futurelet

Programmer

futurelet

Programmer

Similar threads

Part and Inventory Search

Sponsor