creating co-occurrence matrix from raw text

lillyth · Aug 11, 2008

Hi!

I need to create a co-occurrence matrix from a text file. So far I have a term extractor that given the file ( data.txt ) returns a file with the relevant terms (term.txt). From these two I would now like to create a co-occurrence matrix using a window of size w. I am guessing that the algorithm will look something like

for every term t in data.txt
if (t co-occurs with a term s from term.txt
within w terms)
count(t, s) ++;

The output should be a text file with all terms i and j
term i, term j, count(term i, term j).
I'm guessing some kind of stemming is necessary?
Any ideas?

/lillyth

stevexff · Aug 11, 2008

Sounds like school work. Quite hard school work admittedly, but school work nonetheless.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

lillyth · Aug 12, 2008

Well, you are a bit right... It is phd-work. The actual aim is to extract semantic information from the graphs build by the co-occurrence matrix, using mathematics. So, because this is not what I specialize in I was hoping someone would help, even though it is not very far from school work.

ishnid · Aug 12, 2008

Can you expand on what you mean by "co-occurrence" exactly? If you have only one text file, it doesn't mean what I usually take it to mean (I would expect two words to co-occur when they appear in the same document, but that's from an Information Retrieval context)! Do you mean that they occur in the document within w terms of one another.

lillyth · Aug 12, 2008

I am looking for semantic meaning of words, so occurring in the same document is too wide. I need to say that two words co-occur if they appear within w words from each other and I want to be able to set w as a parameter.

stevexff · Aug 12, 2008

Just to check my understanding then, would this scenario work?[ol][li]Read the term.txt file into a hash, to map each term string onto a sequential integer starting with zero.[/li][li]Using your term parser, parse data.txt into an array[/li][li]iterate over the array, and using the hash to look up the indices of the found terms, increment the appropriate values in the matrix of any terms within the window[/li][li]use the hash to translate the matrix values into something more legible for printing[/li][/ol]One question: assuming we have something like

Code:

term termA term term termB term term

When we hit termA and discover that termB is within w terms from termA, we increment the termA/termB point in the matrix. When we get to termB and increment the termB/termA point, would this be deemed to be double counting?

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

lillyth · Aug 12, 2008

@steve

We should increment both termA/termB as well as termB/termA. This because the matrix will later map to a graph where the edges have direction and we want both directions to be valid. If termA co-occurs with termB then it also holds the other way around.

ishnid · Aug 12, 2008

Something like:

1) divide the document into an array of terms (in order), with whatever preprocessing you want already applied (stemming, stopword removal etc.)
2) then (pseudocode):

for each position 'a' in the array {
for each position 'b' from 'a - w + 1' to 'a - 1' {
increment the count for the terms at 'a' and 'b'
}
}

(obviously if 'a - w' is less than zero, you would only begin at element zero)

lillyth · Aug 12, 2008

@ishnid

It seems to me that you do not consider the distances between terms when you remove the rest and only leave the terms in the array. Then the w terms that occur after each other in the document co-occur, but they may not be within a distance of w from each other in the original text.

Am I making sense, and did I understand you correctly?

ishnid · Aug 12, 2008

True. In that case you have two options:
1) Leave the stopwords in the array and check for them within the inner loop. Only do the increments if neither 'a' or 'b' occur in your stopword list.
2) Replace stopwords in the array with 'undef'. It'll be marginally quicker to check for undefined elements than check if a term occurs in a stopword list. They'll still occupy space in the correct places in the array, so it won't skew the distances.

stevexff · Aug 12, 2008

Instead of removing them, just replace them with a fnord

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

lillyth · Aug 13, 2008

Thank you both for your valuable insights. If I am not mistaken the algorithm proposed here will have a complexity of O(N*t*w) where N is the number of words in the datafile, t is the number of terms and w is the window size. The algorithm that I chose I believe only needs O(N*t) to run. Thank you again!

Bests,
lillyth

ishnid · Aug 13, 2008

Not N*t*w anyway. Is the number of words in the datafile not equal to the number of terms? O(Nw) I would think.

lillyth · Aug 13, 2008

@ishnid,

No the number of words in the data file is larger than the number of terms. It is not that every word (except stop words) become terms. We are only interested in nouns and noun phrases and hence those are the only words in our terms list.

The algorithm proposed is:
for every word a in data. txt,
for every word j = a+1, ... , a+w
check if j is a term
if so add to the count of (a, j)

The check-up should take the same number of operations as there exists terms right? It could be that the check-up is constant in Perl and then I am wrong, else I think this should yield in N*t*w.

ishnid · Aug 13, 2008

To be O(N*t*w) would require an additional loop - that would essentially be:
for every word in data.txt {
for every term in data.txt {
for every word within w of the current position {
do some operation
}
}
}
[/code]
But that second loop isn't in there. I believe it's O(Nw).

lillyth · Aug 13, 2008

@ishnid

You are right, the second loop is not there, but I claim that the third loop, where you wrote ( do some operation ), will need t look-ups. That is, to check if a word j within w of the current position is a term, you need to scan through the terms and see if j is in this list.

Here is where I am not certain, but normally, to look if an object is in an array, one needs to step through the array, potentially through the entire array. This yields a O(t) for the look-up. If this was Java, we would store the terms in a hashmap and the look-up operation would be constant, and then you are right, the algorithm would be only O(Nw).

I claim anyway that you only need to scan through the words in the document once. If you keep some memory of where you say terms in the document, you don't have to revisit words. An example:
a = 1, then we loop through j = 2, ... , w+1
when a = 2, we loop through j = 3, ... , w+2
which means we are repeating word 3, ... , w+1 because we have already seen these in the first loop.
Hence, this is unnecessary.
Also, if we know, because of the first loop that we made, that the first term is in position k, where k > 2 we will let a skip position 2, .., k-1 and instead start at k.

Does this make sense to you?

stevexff · Aug 13, 2008

lillyth said:
It could be that the check-up is constant in Perl

It is, if we use a hash to hold the terms. We can see if it is a term with exists $hash{$word} without having to iterate over an array of terms to find it.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

lillyth · Aug 13, 2008

In that case, I claim the algorithm has O(N) as complexity.

ishnid · Aug 13, 2008

You'd use a hash for the lookups, as you would in Java (though if I was using Java I'd use a HashSet rather than a Map). Much faster than searching an array.

You're right, there are some efficiencies to be had alright with regard to identifying what words are 'terms' (I didn't understand the distinction initially) which would probably reduce it to O(tw) alright (assuming 't' is the number of terms present in the document, rather than the size of the list of available terms).

prex1 · Aug 13, 2008

Well, lilith, perl does have hash lookup tables! [mad]

They are a native structure of the language, are called hash arrays and are distinguished from other types of structures (scalars and references, subs, list arrays) by prefixing their name with a % sign.
If I understand correctly your problem, this seems to me the way to go:
-with your terms extractor create a hash table [tt]%term[/tt] containings the terms (this is equivalent to your term.txt file) as the keys; the value of each key will be an incremental number ranging from 0 to the total number of unique terms -1, or t-1 with your symbols (this will be used as an index, called i or j below, identifying each unique term);
-an auxiliary array [tt]@term_values[/tt] will contain all the unique terms, the index in the array being i as above: this will allow to retrieve a term from its index;
-a two dimensioned array [tt]@cooccur[/tt] will contain for each couple of indexes i,j the count as defined by you; note that it is not necessary to increment the count for both i,j and j,i as the matrix will be symmetrical; however you need to decide whether you count as one or two when two equal terms occur within the window (the option 'one' is adopted below);
-a second hash table [tt]%terms_in_window[/tt] will contain all the terms occurring within the window; this is an evolving table (see below) where the key is the term and the value is the number of occurrences of that term in the window;
-another (!) auxiliary (linear) array [tt]@window[/tt], handled as a circular list, will contain all the i indexes of the terms occuring within the window, in the same order as they appear in the text, the positions occupied by words that are not terms containing -1.
Now the pseudocode using the above data structures would be as follows (m is the index in the circular list and w is the window size):

Code:

m=0;
window[0]=-1;
foreach word a in data.txt
  if window[m]>=0
    terms_in_window{term_values[window[m]]}--;
    if terms_in_window{term_values[window[m]]}==0
      delete terms_in_window{terms_values[window[m]]};
    endif
  endif
  if a is a term
    i=term{a};
    foreach key b in %terms_in_window
      j=term{b};
      if i<=j
        cooccur[i,j]+=terms_in_window{b};
      else
        cooccur[j,i]+=terms_in_window{b};
      endif
    endfor
    window[m]=term{a};
    terms_in_window{a}++;
  else
    window[m]=-1;
  endif
  m++;
  m=0 if m>=w;
endfor

This is of course not perl code! If you need it, come back after checking the above (though as you don't know of the existence of hash tables in perl, I wonder what use you could do of it...) [smile]

However note that:
-the [tt]delete[/tt] operation above is used to erase a hash key from a hash table
-the execution of something like [tt]terms_in_window{a}++;[/tt] (in perl) will create a hash element with a value of one, if that element was non existent before, or increment its value by one if it existed already.

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

creating co-occurrence matrix from raw text

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor