Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations John Tel on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

extracting more than one word...

Status
Not open for further replies.

cyphrix

Programmer
Nov 16, 2006
27
US
Thanks to some help from PaulTEG and others, I have finally gotten my extraction program to extract the words I want, with the exception that I cannot extract as many words as I want after the word I want is found.

Basically, I am matching the word 'located' in a text document and I want to extract up to 10 words after it. I have used the (\w+) modifier however it only gives me one word after 'located'. When I try putting in a second (\w+) it actually takes away the first word. If anyone knows how to enumerate past the matched word up to ten words after, please help. This is my code thus far:

Code:
open(INFILE, "utah.txt") or die "The file cannot be found."'
open(OUTFILE, ">>utah_locations.txt);
$count = 0

while(<INFILE>) {
     chomp($_);
     if(/located (\w+)/g) {
          print OUTFILE "located $1\n";
          $count++;
     }
}

print "There were $count matches";
close INFILE;
close OUTFILE;


Now, this pulls and prints the word 'located' and one word after it. I would like to be able to pull up to ten words after it so instead of getting:


located at
located in
located around
located near
located beside

I would get:

located at the corner of Marbach and Maple
located in the deli inside the Supermarket on Broadway.
located around the corner at the bakery on Patterson Avenue.


Any help is greatly appreciated.

 
Try:

Code:
if(/located (\w+\b){1,10}/g) {

Looks for \w (word character) followed by a word boundary (a space or other character), and looks for 1 to 10 of this pattern.

-------------
Kirsle.net | Kirsle's Programs and Projects
 
A word boundary is slightly different to what you think. It's a "zero-width assertion", which means that it doesn't match any characters. It checks that there is a word character (something that \w would match) on one side of it and a non-word (\W) on the other. You can think of it as matching the place between two characters.

For that reason, the regexp you supply won't work, because there's nothing in there that will match spaces.

This might work, though it's untested (I'm assuming there are just spaces between the words, as in the examples):
Code:
if(/located((?: \w+){1,10})/g) {
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top