Hi people,
Could anybody help me with this awk code?
I've a file with this systaxis (in one single line have the whole sentence ):
#cat file.xml
<instance id="bass.v.bnc.001" docsrc="BNC">
<context>
I went fishing for some sea <head>bass</head> .
</context>
</instance>
<instance id="bass.v.bnc.002" docsrc="BNC">
<context>
The <head>bass</head> part of the song is very moving.
</context>
</instance>
<instance id="program.v.bnc.001" docsrc="BNC">
<context>
he proposed an elaborate <head>program</head> of
public works . This
information was taken
</context>
</instance>
<instance id="program.v.bnc.002" docsrc="BNC">
<context>
the <head>program</head> required several hundred
lines of code .
</context>
</instance>
<instance id="smell.v.bnc.001" docsrc="BNC">
<context>
It 's making me annoyed .I did n't want to stay there
and I did n't want to go to Combe Court , cos I hate it and it <head>smells</head> and the Captain slobbers in his food and Christmas ishorrible with no good prezzies and Annie not there .Why did n't you visit me ? Why not ?
</context>
</instance>
I need a script that gets 3 previus words and 3 words after the tag(<head> </head>) from a file,EXCEPT the words greater than 2 chars length(example of,a,an).
Returning this:
#----------------------------------
for some sea bass
The bass part the song
proprosed elaborate program public works
the program required several hundred
cos hate and smells and the Captain
This is the actual code,but need some some fix
(I need remove "/context","/instance","context" words)
#cat solution.awk
/context/{flag=1} /\/context/{flag=0} !/context/{
if (flag==1)
gsub (/[,;:]/, " ", $0) ;
gsub (/[.]/, " . ", $0) }
/<.*>/ { for (i = 1; i <= NF; i++)
if ($i~/<.*>/) { s = substr ($i, 2, length($i)-2)
c = 0
for (j = i-1; j > 0 && c != 3 && $j != "." ; j--)
if (length($j)>2) { s = $j FS s ; c++ }
c = 0
for (j = i+1; j <= NF && c != 3 && $j != "." ; j++)
if (length($j)>2) { s = s FS $j ; c++ }
}
print s
}
I' getting some extra text,this is the actual output:
#cat output
context
for some sea bass
/context
/instance
/instance
context
The bass part the song
/context
/instance
/instance
context
proposed elaborate program public works
/context
/instance
/instance
context
the program required several hundred
/context
/instance
/instance
context
cos hate and smells and the Captain
/context
/instance
#
Best Regards,
-ric
Could anybody help me with this awk code?
I've a file with this systaxis (in one single line have the whole sentence ):
#cat file.xml
<instance id="bass.v.bnc.001" docsrc="BNC">
<context>
I went fishing for some sea <head>bass</head> .
</context>
</instance>
<instance id="bass.v.bnc.002" docsrc="BNC">
<context>
The <head>bass</head> part of the song is very moving.
</context>
</instance>
<instance id="program.v.bnc.001" docsrc="BNC">
<context>
he proposed an elaborate <head>program</head> of
public works . This
information was taken
</context>
</instance>
<instance id="program.v.bnc.002" docsrc="BNC">
<context>
the <head>program</head> required several hundred
lines of code .
</context>
</instance>
<instance id="smell.v.bnc.001" docsrc="BNC">
<context>
It 's making me annoyed .I did n't want to stay there
and I did n't want to go to Combe Court , cos I hate it and it <head>smells</head> and the Captain slobbers in his food and Christmas ishorrible with no good prezzies and Annie not there .Why did n't you visit me ? Why not ?
</context>
</instance>
I need a script that gets 3 previus words and 3 words after the tag(<head> </head>) from a file,EXCEPT the words greater than 2 chars length(example of,a,an).
Returning this:
#----------------------------------
for some sea bass
The bass part the song
proprosed elaborate program public works
the program required several hundred
cos hate and smells and the Captain
This is the actual code,but need some some fix
(I need remove "/context","/instance","context" words)
#cat solution.awk
/context/{flag=1} /\/context/{flag=0} !/context/{
if (flag==1)
gsub (/[,;:]/, " ", $0) ;
gsub (/[.]/, " . ", $0) }
/<.*>/ { for (i = 1; i <= NF; i++)
if ($i~/<.*>/) { s = substr ($i, 2, length($i)-2)
c = 0
for (j = i-1; j > 0 && c != 3 && $j != "." ; j--)
if (length($j)>2) { s = $j FS s ; c++ }
c = 0
for (j = i+1; j <= NF && c != 3 && $j != "." ; j++)
if (length($j)>2) { s = s FS $j ; c++ }
}
print s
}
I' getting some extra text,this is the actual output:
#cat output
context
for some sea bass
/context
/instance
/instance
context
The bass part the song
/context
/instance
/instance
context
proposed elaborate program public works
/context
/instance
/instance
context
the program required several hundred
/context
/instance
/instance
context
cos hate and smells and the Captain
/context
/instance
#
Best Regards,
-ric