Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

awk- remove some words

Status
Not open for further replies.

ricgamch

Programmer
Jan 25, 2007
3
CR
Hi people,

Could anybody help me with this awk code?

I've a file with this systaxis (in one single line have the whole sentence ):

#cat file.xml

<instance id="bass.v.bnc.001" docsrc="BNC">
<context>
I went fishing for some sea <head>bass</head> .
</context>
</instance>


<instance id="bass.v.bnc.002" docsrc="BNC">
<context>
The <head>bass</head> part of the song is very moving.
</context>
</instance>


<instance id="program.v.bnc.001" docsrc="BNC">
<context>
he proposed an elaborate <head>program</head> of
public works . This
information was taken
</context>
</instance>


<instance id="program.v.bnc.002" docsrc="BNC">
<context>
the <head>program</head> required several hundred
lines of code .
</context>
</instance>


<instance id="smell.v.bnc.001" docsrc="BNC">
<context>
It 's making me annoyed .I did n't want to stay there
and I did n't want to go to Combe Court , cos I hate it and it <head>smells</head> and the Captain slobbers in his food and Christmas ishorrible with no good prezzies and Annie not there .Why did n't you visit me ? Why not ?
</context>
</instance>



I need a script that gets 3 previus words and 3 words after the tag(<head> </head>) from a file,EXCEPT the words greater than 2 chars length(example of,a,an).


Returning this:
#----------------------------------
for some sea bass
The bass part the song
proprosed elaborate program public works
the program required several hundred
cos hate and smells and the Captain


This is the actual code,but need some some fix
(I need remove "/context","/instance","context" words)

#cat solution.awk

/context/{flag=1} /\/context/{flag=0} !/context/{
if (flag==1)

gsub (/[,;:]/, " ", $0) ;
gsub (/[.]/, " . ", $0) }
/<.*>/ { for (i = 1; i <= NF; i++)
if ($i~/<.*>/) { s = substr ($i, 2, length($i)-2)
c = 0
for (j = i-1; j > 0 && c != 3 && $j != "." ; j--)
if (length($j)>2) { s = $j FS s ; c++ }
c = 0
for (j = i+1; j <= NF && c != 3 && $j != "." ; j++)
if (length($j)>2) { s = s FS $j ; c++ }
}
print s
}


I' getting some extra text,this is the actual output:

#cat output
context
for some sea bass
/context
/instance
/instance
context
The bass part the song
/context
/instance
/instance
context
proposed elaborate program public works
/context
/instance
/instance
context
the program required several hundred
/context
/instance
/instance
context
cos hate and smells and the Captain
/context
/instance
#

Best Regards,
-ric


 
The /<.*>/ section will process every line. Maybe you should make it /<.*>/ && flag to only process the lines between the context delimiters.

Incidentally your code is very difficult to read, try indenting it more consistently so that you can see which code block statements belong to, especially if you are going to use if and for statements without delimiting their blocks using braces, e.g.

Code:
/context/{flag=1}
/\/context/{flag=0}
!/context/{
    if (flag==1)
        gsub (/[,;:]/, " ", $0)
    gsub (/[.]/, " . ", $0)
}
/<.*>/ {
    for (i = 1; i <= NF; i++)
        if ($i~/<.*>/) {
            s = substr ($i, 2, length($i)-2)
            c = 0
            for (j = i-1; j > 0 && c != 3 && $j != "." ; j--)
                if (length($j)>2) { s = $j FS s ; c++ }
            c = 0
            for (j = i+1; j <= NF && c != 3 && $j != "." ; j++)
                if (length($j)>2) { s = s FS $j ; c++ }
        }
    print s
}

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top