extracting more than one word...

cyphrix · Nov 17, 2006

Thanks to some help from PaulTEG and others, I have finally gotten my extraction program to extract the words I want, with the exception that I cannot extract as many words as I want after the word I want is found.

Basically, I am matching the word 'located' in a text document and I want to extract up to 10 words after it. I have used the (\w+) modifier however it only gives me one word after 'located'. When I try putting in a second (\w+) it actually takes away the first word. If anyone knows how to enumerate past the matched word up to ten words after, please help. This is my code thus far:

Code:

open(INFILE, "utah.txt") or die "The file cannot be found."'
open(OUTFILE, ">>utah_locations.txt);
$count = 0

while(<INFILE>) {
     chomp($_);
     if(/located (\w+)/g) {
          print OUTFILE "located $1\n";
          $count++;
     }
}

print "There were $count matches";
close INFILE;
close OUTFILE;

Now, this pulls and prints the word 'located' and one word after it. I would like to be able to pull up to ten words after it so instead of getting:

located at
located in
located around
located near
located beside

I would get:

located at the corner of Marbach and Maple
located in the deli inside the Supermarket on Broadway.
located around the corner at the bakery on Patterson Avenue.

Any help is greatly appreciated.

Kirsle · Nov 17, 2006

Try:

Code:

if(/located (\w+\b){1,10}/g) {

Looks for \w (word character) followed by a word boundary (a space or other character), and looks for 1 to 10 of this pattern.

-------------
Kirsle.net | Kirsle's Programs and Projects

ishnid · Nov 17, 2006

A word boundary is slightly different to what you think. It's a "zero-width assertion", which means that it doesn't match any characters. It checks that there is a word character (something that \w would match) on one side of it and a non-word (\W) on the other. You can think of it as matching the place between two characters.

For that reason, the regexp you supply won't work, because there's nothing in there that will match spaces.

This might work, though it's untested (I'm assuming there are just spaces between the words, as in the examples):

Code:

if(/located((?: \w+){1,10})/g) {

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

extracting more than one word...

cyphrix

Programmer

Kirsle

Programmer

ishnid

Programmer

Similar threads

Part and Inventory Search

Sponsor