Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

locating and reporting text from within a free format text file 2

Status
Not open for further replies.

alfie002

Technical User
Mar 3, 2004
121
GB
Hello all,

I would appreciate help with the following. The aim is to extract XML (tagged) data from within an xml file. The data itself is free format and potentially could exist in the file in various positions. The tags are known, eg <nedn/> being one of the tags.

I need to read the tagged data, that data that exist between the xml tags; <nedn>test_data</nedn> and write the output to an external text file.

I have copied an extract of the data (XML) file into the thread for your persual.

Any help would be appreciated.

Thanks

Alf


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<!DOCTYPE MeasDataCollection SYSTEM "MeasDataCollection.dtd">
<mdc xmlns:HTML="<mfh>
<ffv>1</ffv>
<sn>System=MS, ManagementServer=ManagementServer</sn>
<st></st>
<vn>Ericsson</vn>
<cbt>200402180000</cbt>
</mfh>
<md>
<neid>
<neun>SK1</neun>
<nedn>System=MS, Sitekeeper=SK1</nedn>
</neid>
<mts>200402191600</mts>
<gp>900</gp>
<mt>Resource,CPU load</mt>
<mt>Resource,RAM used</mt>
<mt>Resource,Disk used</mt>
<mt>Accounting,Radius events</mt>
<moid>Calls=1,Orig calls=1,SK release - no route</moid>
<r>0</r>

The tags that I am interested in are as follows;

nedn
mts
gp
moid
r

thanks in advance

Regards

Alf
 
This assumes that the tagged data does not extend over more than 1 line and that there are no nested tags.

/<nedn>|<mts>|<gp>|<moid>|<r>/{gsub(/<\/?[^>]*>/,"");print}

CaKiwi
 
Thanks CaKiwi,

one other thing and that is the ability to extract data from within a text. I used to be able to do this but have forgotten the syntax.

Basically, using the above example, I want to be able to extract the data between the tags;

<nedn>extract_this_text</nedn> and assign to a variable(s) for printing to an external file. So, as the script reads the text file it locates the tags, extracts the text between the tags and stores them in variables.

Any help would be appreciated.

Thanks

Alf
 
I don't really understand what you want. The

gsub(/<\/?[^>]*>/,"")

above removes all tags from the current line and the print prints the line to standard out. You can save it in a variable with something like

var1 = $0

CaKiwi
 
The following 'awk' script extracts tag datas.
The tagged datas can span lines.

The list of tags to proceed (delimited by ,) is specified in the TAGS variable.

The datas of each selected tag is written to file which name are tag.dat

Code:
#!/usr/bin/awk -f

# ============================================================
# F u n c t i o n s . . .
# ============================================================

# ------------------------------------------------------------
# GetTagList
# Input:        Var TAGS   : list of tags, demilited by ","
# Output:       Var Tags[] : Array of tags to search (index=tag)
# Return:       N/A
# ------------------------------------------------------------

function GetTagList(     i,t,cnt) {
   if (split(TAGS, t, ",") == 0) {
      print "Missing tag list (TAGS)" | "cat 1>&2";
      exit;
   }
   for (i in t) 
      Tags[t[i]] = i;
}

# ------------------------------------------------------------
# ReadXmlText
# Input:        Input file
# Output:       Var XmlText : text read from input file 
# Return:       1 = Text read, 0 End of input file
# ------------------------------------------------------------

function ReadXmlText(    sts,line) {
   if ((getline line) > 0) {
      XmlText = XmlText line;
      sts = 1;
   }
   return sts;
}

# ------------------------------------------------------------
# GetNextTag
# Input/Output: var XmlText
# Return:       Tag id
# ------------------------------------------------------------

function GetNextTag(    tpos, tag) {
   while (1) {
      if (length(XmlText) == 0)
         if (ReadXmlText() == 0)
            break;
      tpos = match(XmlText, /<[^\/][^>]*>/);
      if (tpos != 0)
         break;
      XmlText = "";
   }
   if (tpos != 0) {
      tag = tolower(substr(XmlText, tpos+1, RLENGTH-2));
      XmlText = substr(XmlText, tagpos+RLENGTH+1) ;
   }
   return tag

}

# ------------------------------------------------------------
# GetTagDatas
# Arg:          Tag Id
# Input/Output: var XmlText
# Output:       N/A
# Return:       Tag datas
# ------------------------------------------------------------

function GetTagDatas(tag    ,tpos,datas) {
   while (tpos == 0) {
      if (length(XmlText) == 0)
         if (ReadXmlText() == 0)
            break;
      tpos = match(tolower(XmlText), "</" tolower(tag) ">" );
   }
   if (tpos != 0) {
      datas = substr(XmlText, 1, tpos-1);
      XmlText = substr(XmlText, tpos+RLENGTH) ;
   }
   return datas

}


# ============================================================
# P a t t e r n s / A c t i o n s . . .
# ============================================================

BEGIN { 
   GetTagList()
   tag = GetNextTag();
   while (tag != "") {
      if (tag in Tags) {
         print GetTagDatas(tag) >> tag ".dat" ;
      }
      tag = GetNextTag();
   }
}

An example of execution with your datas:
[tt]
/home/jp> xml.awk -v TAGS='mt,nedn' xml.txt
/home/jp> cat mt.dat
Resource,CPU load
Resource,RAM used
Resource,Disk used
Accounting,Radius events
/home/jp>
[/tt]


Jean Pierre.
 
Hello CaKiwi,

In your example of the script to locate and extract the tag data, you have your script to print out the tags line by line. What change is necessary to print the tag data to a file in a comman separated format ?

Thanks

Alf
 
aigles,

Firstly, thanks for your help. Secondly, would it be possible to get your script to print all the TAG entries to one file, CSV formated, named using the first Tag;

file name : sk1_<this section can be appended>

Format should look some thing like follows;

Sitekeeper=SK1,200402191600,900,0,0,0 etc !!!!

Thanks in advance.

Regards

Alf

PS: I used to be able to do this sort of thing but haven't been working with scripts and parsing for some time. I have under estimated just how rusty I am.
 
Try

/<nedn>|<mts>|<gp>|<moid>|<r>/{gsub(/<\/?[^>]*>/,"");printf cma $0;cma=","}

CaKiwi
 
Which tag is the first tag and how do you want the file to be named from it?

CaKiwi
 
The following will print the data in the order nedn,mts,gp,moid,r --- regardless of its order in the file.
A tag can start on one line and end on another.
[tt]
{whole=whole $0}

END { lineout=""
# If "neun" isn't the tag from which file should
# be named, change next line.
split( whole, temp, /<neun>|<\/neun>/ )
fileout=temp[2] ".txt"
split( "", tags, /,/ )
for (i=1;i in tags;i++)
{ tag=tags
c=split( whole, temp, "<"tag">|</"tag">" )
for (j=2;j<=c;j=j+2)
lineout=lineout temp[j] ","
}
sub(/,$/, "", lineout )
print "Write to file? " fileout "\n" lineout
# Delete preceding line and
# uncomment next line if the filename suits you.
# print lineout >fileout
}
[/tt]
 
Oops. In the preceding program, change
[tt]
split( "", tags, /,/ )
[/tt]
to
[tt]
split( "nedn,mts,gp,moid,r", tags, /,/ )
[/tt]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top