Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

creating co-occurrence matrix from raw text

Status
Not open for further replies.

lillyth

Programmer
Aug 11, 2008
17
DE
Hi!

I need to create a co-occurrence matrix from a text file. So far I have a term extractor that given the file ( data.txt ) returns a file with the relevant terms (term.txt). From these two I would now like to create a co-occurrence matrix using a window of size w. I am guessing that the algorithm will look something like

for every term t in data.txt
if (t co-occurs with a term s from term.txt
within w terms)
count(t, s) ++;

The output should be a text file with all terms i and j
term i, term j, count(term i, term j).
I'm guessing some kind of stemming is necessary?
Any ideas?

/lillyth
 
Sorry, the pseudocode above may be further simplified (and made faster). I'll write the new version in perl (untested), as it would be now not much different from pseudocode.
The main change to the above is that the hash [tt]%terms_in_window[/tt] is not used.
Before entering into the code, the following data must be prepared:
-a hash [tt]%term[/tt] containing all the terms as the keys, and a unique index ranging from 0 to t-1 as the values
-an array [tt]@data[/tt] containing all the words in data.txt (in the same order)
-the array [tt]@term_values[/tt] as above is not necessary in the code below, but should be prepared for output purposes
Code:
my$m=0;
my@window;
for my$word(@data){
  $window[$m]=-1;
  my($i,$j,$n);
  if(exists $term{$word}){
    $i=$term{$word};
    $n=$m+1;
    while(1){
      $n=0 if $n>$#window;
      last if $n==$m;
      $j=$window[$n];
      if($j>=0){
        if($i>=$j){
          $cooccur[$i][$j]++;
        }else{
          $cooccur[$j][$i]++;
        }
      }
      $n++;
    }
    $window[$m]=$i;
  }
  $m++;
  $m=0 if $m>=$window_size;
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I thought I'd try to make a working example. My 'terms' are words, the assumption being that you can use your term parser to extract them if they are more complex than that. The terms we are interested in get loaded from __DATA__. The 'data file' is faked up by replicating an array. The terms in this 'data file' get mapped into an integer array. Fnords are terms we aren't interested in, but we still need a placeholder for them.
Perl:
#!/usr/bin/perl
use strict;
use warnings;

my $wsize = 2;                 # window size
my $fnord = -1;

my @data = qw{the quick brown fox jumps over the lazy dog};

push @data, @data for (0..10); # fake about 18K entries

my @terms = <DATA>;            # get the terms list
chomp @terms;

my %terms;

for (my $i = 0; $i < @terms; $i++) {
   $terms{$terms[$i]} = $i;    # load the terms mapping hash
}

my @matrix;

for my $i (0..$#terms) {
   for my $j (0..$#terms) {
      $matrix[$i][$j] = 0;     # initialise result matrix
   }
}

# convert terms into integer values once, for speed

my @map = map{exists $terms{$_} ? $terms{$_} : $fnord} @data;

my @fnords;
push @fnords, $fnord for (1..$wsize);  

@map = (@fnords, @map, @fnords);  # pad with fnords at either end

for my $i ($wsize .. $#map - $wsize) {
   my @slice = (@map[$i - $wsize .. $i - 1], @map[$i + 1 .. $i + $wsize]);
   foreach (@slice) {
      $matrix[$map[$i]][$_]++ unless ($_ == $fnord);
   }
}

# print it

print "Window size: $wsize\n\n";
print "\t", join("\t", @terms), "\n";
for my $i (0 .. $#terms) {
   print "$terms[$i]\t";
   for my $j (0 .. $#terms) {
      print "$matrix[$i][$j]\t";
   }
   print "\n";
}

__DATA__
the
brown
fox

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top