Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

searching patterns

Status
Not open for further replies.

Guest_imported

New member
Jan 1, 1970
0
The following script matches a pattern specified on the command line (e.g. gawk -f file.awk pattern="the" test.txt):

$0~pattern{
gsub(pattern, "|&|",)
print $0
}

it matches sentences like:

John is the greatest.
He's up there.

and turns this into:

John is |the| greatest.
He's up |the|re.

But in fact, I only wish it to match the first sentence (John is |the| greatest.) Can someone tell me what I should add to my script so that only the word "the" will be found and not a sequence of 't' 'h' 'e' in words like "there", "them", "ether", etc. ?

Thanks



 
{
$0 = tolower($0)
for (i=0 ; i <= NF ; i++) {
if ($i ~ /the/ && length($i) == 3) {
print gensub(/the/,&quot;|&|&quot;,G,$x)
}
}
}

Output:
|the| red man ran up the back stair
|the| red man ran up the back stair
a man and his honey in absconded in |the| night
&quot;failed and failed again&quot;, said |the| reverend bitter.

You can see from the program output the problems with
any of these approaches.
#1: it doesn't get all matches of &quot;the&quot;.
#2: it doesn't perform the substitution (magic &quot;&&quot;)

comp.lang.awk is full of jerks, but they may be able to
help you out if no one here can come up with anything.
 
Hi,

You can adapt the following awk script for your definition of a word (see matchEre variable)

awk -v pattern=the -f SearchWord.awk test.txt

--- test.txt ---
there is two lines in the input file
three substitutions, one in the fist line and two in the last line
----------------

--- Output ---
there is two lines in |the| input file
three substitutions, one in |the| fist line and two in |the| last line
--------------


--- SearchWord.awk ---
BEGIN {
matchEre = &quot;(^|[ \t]+){1,1}&quot; pattern &quot;($|[ \t]+){1,1}&quot;
}
{
if (pattern != &quot;&quot;) {
while (1) {
len = length($0)
pos = match($0,matchEre)
if (pos == 0) break
wrk=substr($0,RSTART,RLENGTH)
gsub(pattern,&quot;|&|&quot;, wrk)
$0=substr($0,1,RSTART-1) wrk substr($0,RSTART+RLENGTH)
}
}
print $0
}
[COLOR=/blue]----------------------[/color]
Jean Pierre.
 
Aigles' suggestion is about the best you can do in awk (IMHO). Another option would be to use ex which has the \< \> syntax to find a word. Also, Perl has \w RE pattern. CaKiwi
 
This should work as well:

BEGIN { pattern = &quot;the&quot; }
$0 ~ pattern {
gsub(&quot; &quot;pattern&quot; &quot;,&quot; |&quot;pattern&quot;| &quot;)
print $0
}
 
Sorry Jescat, with your solution, when the pattern is at the begining or the end of the line it is not substitued.
Jean Pierre.
 
oh, yes your right. I verified that the solution will work if you check for beginning of line and end of line. However the solution given above is cleaner.

Thanks for the correction
 
Hi,

CaKiwi' solution using ex works fine:

ex -s -c '%s/\<the\>/|the|/g' -c 'wq' test.txt Jean Pierre.
 
So, it's not possible to use the \< \> syntax(cf. aigles' command line ex) in an AWK script for finding a particular word like 'the'?

Stold
 
You can use the \<\> syntax with Gawk (GNU Project's implementation of the AWK programming language.

BEGIN {
matchEre = &quot;\\<&quot; pattern &quot;\\>&quot;
}
{
if (pattern != &quot;&quot;) {
gsub(matchEre,&quot;|&|&quot;,$0)
print $0
}
}
Jean Pierre.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top