Pattern matching: keywords in contexts

jupiler · Dec 4, 2003

Hello,

I'm trying to retrieve the contexts of each keyword in an array. The problem is that I only get the right output for the last keyword in the array. Example:

FILE1:

tree
apple

FILE2:

This is a tree.
The aPple is green.
It looks like a tree.

should become:

**
<KEYWORD>tree
<CONTEXT>This is a tree.
<CONTEXT>It looks like a tree.
**
<KEYWORD>apple
<CONTEXT>The aPple is green.

I have tried to get this format with the following script, but as I said, it refuses to do the job properly ;-):

open(FILE1, "FILE1.txt&quot

||die;
@Keywords=<FILE1>; #makes list of keywords
close(<FILE1>);
open(FILE2, "FILE2.txt&quot

||die;
@Contexts=<FILE2>; # makes list of contexts
for($i=0; $i<=scalar(@Keywords); $i){ # for each keyword:
@foo=grep(/$Keywords[$i]/i, @Contexts); # get contexts
print "**\n"; # print FS
print "<KEYWORD>$Keywords[$i]\n"; # print the keyword
for($j=0; $j<=scalar(@foo); $j++){ # print the contexts
print "<CONTEXT>$foo[$j]";
delete $foo[$j]; # remove it from list
}
}

What should I change about this script in order to make it work? Thanks in advance. J

mikevh · Dec 5, 2003

You were close. The biggest problem was that all of the elements in your @Keywords array had
line-termination chars at the end (newline on Unix, return-newline on Windows), but the matching words
in @Contents didn't, so your grep didn't work and thus @foo was empty.

I'd recommend that you always chomp files when reading, and put the "\n" back on when writing or
printing. Leaving the line-terminators on is almost never a good idea.

Other points:

You had "close(<FILE1>)" which gave a compile error. <> around FH means "read"

You explicitly close FILE1, but don't close FILE2. It isn't really necessary to close either, though it's
good practice, but it's nice to be consistent.

The termination tests on your for loops both said "<=scalar(@array)". @Array in scalar context returns the
number of elements in the array, but array indices start at 0, so when index is one less than @Array,
you're at the last element. Your termination test should be "< @array", not "<= @array". It isn't necessary
to use the scalar function here, as using an array name with a numeric operator will cause the array
to be evaluated in scalar context. Not wrong, just unnecessary. Also you never
incremented $i in your "for $i" loop.

By the way, C-style for loops like this aren't used so much in Perl. I left them as I figure you have your reasons for using them, e.g., you want practice using them or your instructor (if this is homework) wants you to use them.

Why delete $foo[$j]? Deleting array elements while looping over them is not a good idea.
Also served no purpose I could see.

Things to consider:

Your code makes no provision for the fact that a keyword in FILE1 might not exist in FILE2.
Or that a keyword might exist more than once in FILE1. Can these things happen?

I've made the minimal modifications to your code to make it give the output you asked for with the
data you provided. Note that I changed the names of the input files for my convenience.
Change them back to match what you've got.

Here's the output I got running the program:

**
<KEYWORD>tree
<CONTEXT>This is a tree.
<CONTEXT>It looks like a tree.
**
<KEYWORD>apple
<CONTEXT>The aPple is green.

Code follows

Code:

open(FILE1, &quot;kw1.txt&quot;)||die;
@Keywords=<FILE1>;
chomp @Keywords; #remove line-terminators
            
close(FILE1);  
   
open(FILE2, &quot;kw2.txt&quot;)||die;
@Contexts=<FILE2>;
chomp @Contexts;   #remove line-terminators

for($i=0; $i< @Keywords; $i++) {   # for each keyword:
  @foo=grep(/$Keywords[$i]/i, @Contexts); # get contexts
  print &quot;**\n&quot;;                         # print FS  
  print &quot;<KEYWORD>$Keywords[$i]\n&quot;;     # print the keyword
  for($j=0; $j< @foo; $j++) {    # print the contexts
    print &quot;<CONTEXT>$foo[$j]\n&quot;;
  }
}

jupiler · Dec 5, 2003

Thanks mikevh,

Great feedback, I appreciate that!

>>I'd recommend that you always chomp files when reading, and put the "\n" back on when writing or
printing. Leaving the line-terminators on is almost never a good idea.

J: ok, set to automatic pilot from now on ;-)

>>Also you never incremented $i in your "for $i" loop.

J: Was a typo, my mistake (sorry)

>>By the way, C-style for loops like this aren't used so much in Perl. I left them as I figure you have your reasons for using them, e.g., you want practice using them or your instructor (if this is homework) wants you to use them.

J: it is not a homework

. I used to do some programming in Gawk (which may be an explanation for the style of for loops), but I'm learning Perl (on my own) now.

>>I've made the minimal modifications to your code to make it give the output you asked for with the
data you provided. Note that I changed the names of the input files for my convenience.
Change them back to match what you've got.

J: Thanks, it works perfectly ok now. And, more importantly, I can see where my script went wrong.

Cheers. J

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Pattern matching: keywords in contexts

jupiler

Technical User

mikevh

Programmer

jupiler

Technical User

Similar threads

Part and Inventory Search

Sponsor