including new lines when extracting strings 1

cyphrix · Nov 29, 2006

The below code searches for the word 'located' in a text file. When it finds the word, it looks for 1 to 10 words after it, extracts the string, and writes the string to a file. The problem I am having is that the current script stops looking for up to the 10th word when it reaches a line break. I would like it to continue on to the next line up to the 10th word and then extract. As of right now I am getting results that look like this:

located around

located near the corner of West 86 and

located in the vicinity of

I would like to be able to get all ten words after 'located'. I am not sure if I'm using the '\n' operator correctly in the pattern or if that's even how I should do it. Any help is appreciated.

Code:

open(INFILE, "TheTextFile.txt");
open(OUTFILE, ">>locations.txt");

while(<INFILE>) {
     chomp;
     if (/located((?: \w+\n?|\W+\n?\w+\n?){1,10}\n?)+/g) {
          $loc = $1\n\n";
          print OUTFILE "located $loc";
     }
}

KevinADC · Nov 29, 2006

see if this works:

Code:

if (/located(\b\B\b){1,10})/mg) {

uses the 'm' modifier (multiline) and \b (word boundary) and \B (non-word boundary).

- Kevin, perl coder unexceptional!

cyphrix · Nov 29, 2006

thanks Kevin. Will try this out and see what happens.

cyphrix · Nov 29, 2006

Kevin,

I tried the word boundary and non-word boundary. It ended up shortening the results that I have somehow.

The pattern that I have right now works, it just doesn't include newline breaks and I don't know how to include newlines in a pattern so that it will go to the next line and read up to the 10th word. I tried \n, \n+, \n*...to no avail.

stevexff · Nov 29, 2006

I don't think it will work, because $_ is only as long as the current line. You'd have to slurp the whole file into a single scalar...

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

KevinADC · Nov 29, 2006

how big is the file you are searching through?

- Kevin, perl coder unexceptional!

cyphrix · Nov 29, 2006

its a 3.5Mb file.

I was thinking that if I could just turn the whole file into one long string that it could work that way...unfortunately, I'm learning as I go and don't know how to do that exactly. Hope I'm not being too much of a pest here.

KevinADC · Nov 29, 2006

well, 3.5 mb is a little large but should be doable as a string (untested code):

Code:

my @matches = ();
open(INFILE, "TheTextFile.txt");
my $text  = do {local $/; <INFILE>};
close(INFILE);
@matches = $text =~ /located\s+(\b*\B+\b*){1,10})/sg;
if (@matches) {
   s/\n/ /g for @matches;
   open(OUTFILE, ">>locations.txt");
   print OUTFILE "located $_\n" for @matches;
   close(OUTFILE);
}
else {
   print "No matches found.\n";
}

- Kevin, perl coder unexceptional!

cyphrix · Nov 29, 2006

Thanks again Kevin. Will try this out and report back.

cyphrix · Nov 30, 2006

Ok, so this is my "semi-final" script. It does what I need it to which is assign the whole file to @data, close the output file, reopens it and runs the script right under it. I know it may be a little inefficient but it works. If anyone has any ideas on how to make it look/work a little better, please let me know, because as of right now I have to create an output file, close it, and then reopen it in the format that it was saved in, and then run the remainder of the script against that. The problem is that instead of creating just one file like I'd like, I have to create two with this. Again, if anyone has any ideas of a way to make this more efficient, let me know. Thanks to you all for your help.

Code:

#Locations Extractor

print "\n\n\n";
print "Which file would you like to process?:  ";

$File = <STDIN>;

print "\n\n";
print "What would you like to name the processed file?:  ";

$Output = <STDIN>;

@data = (open(INFILE, $File)) or die;
open(OUTFILE, ">>$Output");

# Each document in the text file starts with *** Document  # And ends with ===EOD===.  So for each document found, put 
# It all in @data and print it to the output file.
while(<INFILE>) {
     chomp;
     if(/\*\*\* Document/ .. /===EOD===/) {
          foreach($_) {
          push @data, $_;
          print OUTFILE "$_";
          }
     }
}
close (OUTFILE);
print "\n\n";
print "Name the file to be processed.:  ";

$locfile = <STDIN>;

# Assign the contents of the file that was created above to
# @data again.
@data = open(INPUT, $Output);
open(OUTPUT2, ">>$locfile");

# For each instance of the word 'located' and ten words 
# after it, push it to @data and print it to output file.
while(<INPUT>) {
     chomp
     foreach(/located((?: \w+|\W+\w+){1,10})/g) {
          push @data, $_;
          print OUTFILE2 "located $_\n";
     }
}

print "\n\n\n";
print "Location Extraction Complete.";
print "\n\n\n"
close (INPUT);
close (OUTFILE2);
}

Now there are a few things that I know I still have to add like a warning that a file wasn't processed correctly and a counter, but for now I'm just trying to figure out a better way of writing this. Again, thanks to those who helped get me this far...

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

including new lines when extracting strings 1

cyphrix

Programmer

KevinADC

Technical User

cyphrix

Programmer

cyphrix

Programmer

stevexff

Programmer

KevinADC

Technical User

cyphrix

Programmer

KevinADC

Technical User

cyphrix

Programmer

cyphrix

Programmer

Similar threads

Part and Inventory Search

Sponsor