Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

XML Parser

Status
Not open for further replies.

ricgamch

Programmer
Jan 25, 2007
3
CR
Hi People:

Could anybody tell me how can I do this with awk?
I've a file with some XML systaxis:

#cat file
<SENT>word1 <ENT> ent1</ENT> word2 word3 word4<ENT>ent2</ENT> </SENT>
<SENT>word5 word6 word7 <ENT>ent3</ENT> word8 word9 word10<ENT>ent4</ENT></SENT>

I need a script that gets all the entities (entX) from a text file, then the 2 previus words(wordX) and 2 words (wordX) afters the ent.


and i need get this:

# -----WORD WORD ENT WORD WORD------
word1 ent1 word2 word3
word3 word4 ent2
word6 word7 ent3 word8 word9
word9 word10 ent4

Thanks in advance and regards! =)

-ric
 
I am more familiar with Perl than awk, and can see a method for doing this, but I'll let you write the code. Here is what I would probably do:

Open the file and read it line by line in a loop
On each line substitute <.+> with ! (where .+ means one or more characters)
"split" the line into an array using [! ] as the word separator (where [! ] means '!' or 'space' characters)
Go through the array looking for 'entX'
When found, output the contents of the 5 locations in the array (if they exist) around & including 'entX'
Repeat the loop
Close the files


I hope that helps to get you started.

Mike
 

Thanks for write Mike042,

Well,i can get all the ENT from the file now

#awk -f script.awk file
ent1
ent2
ent3
ent4

This is my code:

#cat script.awk
BEGIN {
FS="<ENT>"
}
{
for (i =1; i <= NF; i++)
{
#print NR "->" $i;
FIN=match($i,"</ENT>")

if ( FIN > 0 )
{ printf substr ($i ,0 ,FIN-1)
printf "\n"
}
}



And i need return this:

pal1 ent1 pal2 pal3
pal3 pal4 ent2
pal6 pal7 ent3 pal8 pal9
pal9 pal10 ent4 pal11


Well, thanks!

-ric
 
Hi ricgamch,

This is a problem with awk. From your example you need several input field separators (FS) at the same time for:
<SENT> <ENT> </SENT> and </ENT>

How about replacing these with a single space for ease of processing? Then when you find a match, print the previous 2 words (if they exist), the current word ($i) and the next 2 words (if they exist).

I hope that helps.

Mike
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top