10 % of the wordfrequency 1

kognitio · Apr 4, 2004

Hello again,
thanks a lot for the recommendation with the threads... I still have one problem. I am more or less a Newbi on Perl and therefore I have no idea how to solve this problem: The app we have vectorize words from textfiles. The problem is that we get too much words. So the idea is (the app count the words within an file) just to take 10 % of the most frequent words of a file.
But how to do??
The part of interest of the code is:

Code:

$textcnt = 0; #Zaehlvariable fuer Name der Ausgabedatei
####################################
# absolute Wortliste wird erstellt # 
####################################

foreach $datei (@dir){
    next if (($datei eq ".") || ($datei eq "..") || ($datei eq ".directory"));  
    chomp $datei;

    open (F, "AusgabenTagger/$datei") || die ("cant open $*datei");      #hier muss das Directory angegeben werden!
    while ($line = <F>) {
	
	&lemmabereinigen;
	$wortliste{$lemma} = 0;
		
    }    
}	
$xcnt = 0;   
foreach $wort(keys(%wortliste)){
	$absowortliste[$xcnt] = $wort;                                  #erstellt Wort-array aus Wort-hash zum Vergleichen   
	$xcnt++;
}
open (WORTLISTE,">Wortliste.txt");
chomp (@absowortliste);
@sortwortliste = sort(@absowortliste);
foreach $wort (@sortwortliste){                                     #schreibt die Wortliste in Wortliste.txt
    print WORTLISTE "$wort\n";
}
###############################################
# Wortliste fuer jede Textdatei wird erstellt # 
###############################################
mkdir ("Vergleichshashes");	
foreach $datei (@dir){
    next if (($datei eq ".") || ($datei eq "..") || ($datei eq ".directory"));  
    chomp $datei;
    open (F, "AusgabenTagger/$datei") || die ("cant open $datei");      #hier muss das Directory angegeben werden!
    open (AUSGABE1,">>Ausgabevektoren");
    open (AUSGABE2,">Vergleichshashes/$datei");
    $lcnt = 0;
    while ($line = <F>) {
	
	&lemmabereinigen;
	$lemmaliste[$lcnt] = $lemma;
	$lcnt++;

    }
    $lcnt = 0;
    unshift @lemmaliste, @absowortliste; 

    @sortlemmaliste = sort (@lemmaliste);

    $vergleichslemma = $lemma;
    $wortanzahl = -1;
    $vcnt = 0;
    undef(%vergleichshash);
    $rcnt = 0;

    foreach $lemma (@sortlemmaliste){
	$vergleichshash{$lemma}++;
	if ($lemma eq $vergleichslemma){
	    $wortvektor[$vcnt] = ++$wortanzahl;
	    if ($wortanzahl == 1){
		if ($labelhash{$lemma}){
		    $relliste[$rcnt] = $lemma;
		    $rcnt++;
		}
	    }
	}
	else {		    
	    $vcnt++;
	    $wortanzahl = -1;
	    $wortvektor[$vcnt] = ++$wortanzahl;
	    $vergleichslemma = $lemma;
	}
    }

                         
    foreach $anzahl (@wortvektor){   

	print AUSGABE1 "$anzahl ";                 #schreibt den Vektor
    }
    foreach $rellemma(@relliste){   

	print AUSGABE1 "$rellemma ";               #schreibt als Label die relevanten Woerter
    }
    print AUSGABE1 "$datei\n";
    foreach $haeufigkeit(sort keys(%vergleichshash)){
	$vergleichshash{$haeufigkeit}-=1;
	print AUSGABE2 "$vergleichshash{$haeufigkeit}\;";   #schreibt Haeufigkeitsliste in jeweilige Datei
    }
   

    undef(@lemmaliste);
    undef(@sortlemmaliste);
    undef(%vergleichshash);
    undef(@wortvektor);
    undef(@relliste);

and the whole code is:

Code:

#! /usr/bin/perl

opendir (DR,"Dia-o-Kom") || die ("cant open Texte");
mkdir ("AusgabenTagger");
chdir ("Dia-o-Kom");
foreach (<*>){
    $textdatei = $_;
    next if (($textdatei eq ".") || ($textdatei eq "..") || ($textdatei eq ".directory"));
    $tmp=join('_', split(/ /,$textdatei)); 
    rename($textdatei, $tmp);
    $textdatei=$tmp;
    open (T,"$textdatei") || die ("cant open $textdatei");
    system ("../Tagger/cmd/tree-tagger-english $textdatei > ../AusgabenTagger/$textdatei");
	    print "$textdatei\n";
    
}
chdir ("..");
#opendir (DIR, $eingabe) || die ("cant open indirectory");
opendir (DIR,"AusgabenTagger") || die ("cant open AusgabenTagger");
while ($datei = readdir(DIR)){                                         #erstellt Array aus Directory-eintraegen
    $datei = $datei;
    $dir[$cnt] = $datei;
    $cnt++;
}

chomp @dir;

open(R, "<Relevante2") || die ("Cannot open Relevante\n"); 
@labelliste = <R>;
chomp (@labelliste);
foreach $label (@labelliste) {
    $labelhash{$label} = 1;
}

open(S, "<stoppwort.txt") || die ("Cannot open stoppwort_txt\n");       #erstellt Stoppwortliste aus stoppwort.txt  
@stoppwortliste = <S>;
chomp (@stoppwortliste);
foreach $stoppwort (@stoppwortliste) {
    $stoppworthash{$stoppwort} = 1;
}


$textcnt = 0; #Zaehlvariable fuer Name der Ausgabedatei
####################################
# absolute Wortliste wird erstellt # 
####################################

foreach $datei (@dir){
    next if (($datei eq ".") || ($datei eq "..") || ($datei eq ".directory"));  
    chomp $datei;

    open (F, "AusgabenTagger/$datei") || die ("cant open $*datei");      #hier muss das Directory angegeben werden!
    while ($line = <F>) {
	
	&lemmabereinigen;
	$wortliste{$lemma} = 0;
		
    }    
}	
$xcnt = 0;   
foreach $wort(keys(%wortliste)){
	$absowortliste[$xcnt] = $wort;                                  #erstellt Wort-array aus Wort-hash zum Vergleichen   
	$xcnt++;
}
open (WORTLISTE,">Wortliste.txt");
chomp (@absowortliste);
@sortwortliste = sort(@absowortliste);
foreach $wort (@sortwortliste){                                     #schreibt die Wortliste in Wortliste.txt
    print WORTLISTE "$wort\n";
}
###############################################
# Wortliste fuer jede Textdatei wird erstellt # 
###############################################
mkdir ("Vergleichshashes");	
foreach $datei (@dir){
    next if (($datei eq ".") || ($datei eq "..") || ($datei eq ".directory"));  
    chomp $datei;
    open (F, "AusgabenTagger/$datei") || die ("cant open $datei");      #hier muss das Directory angegeben werden!
    open (AUSGABE1,">>Ausgabevektoren");
    open (AUSGABE2,">Vergleichshashes/$datei");
    $lcnt = 0;
    while ($line = <F>) {
	
	&lemmabereinigen;
	$lemmaliste[$lcnt] = $lemma;
	$lcnt++;

    }
    $lcnt = 0;
    unshift @lemmaliste, @absowortliste; 

    @sortlemmaliste = sort (@lemmaliste);

    $vergleichslemma = $lemma;
    $wortanzahl = -1;
    $vcnt = 0;
    undef(%vergleichshash);
    $rcnt = 0;

    foreach $lemma (@sortlemmaliste){
	$vergleichshash{$lemma}++;
	if ($lemma eq $vergleichslemma){
	    $wortvektor[$vcnt] = ++$wortanzahl;
	    if ($wortanzahl == 1){
		if ($labelhash{$lemma}){
		    $relliste[$rcnt] = $lemma;
		    $rcnt++;
		}
	    }
	}
	else {		    
	    $vcnt++;
	    $wortanzahl = -1;
	    $wortvektor[$vcnt] = ++$wortanzahl;
	    $vergleichslemma = $lemma;
	}
    }

                         
    foreach $anzahl (@wortvektor){   

	print AUSGABE1 "$anzahl ";                 #schreibt den Vektor
    }
    foreach $rellemma(@relliste){   

	print AUSGABE1 "$rellemma ";               #schreibt als Label die relevanten Woerter
    }
    print AUSGABE1 "$datei\n";
    foreach $haeufigkeit(sort keys(%vergleichshash)){
	$vergleichshash{$haeufigkeit}-=1;
	print AUSGABE2 "$vergleichshash{$haeufigkeit}\;";   #schreibt Haeufigkeitsliste in jeweilige Datei
    }
   

    undef(@lemmaliste);
    undef(@sortlemmaliste);
    undef(%vergleichshash);
    undef(@wortvektor);
    undef(@relliste);
    
}	

sub lemmabereinigen {

	if ($line =~ /(<.*>)/){
		$line =~ s/<.*>//g;		#loescht sgml-tags				
	}	
	if ($line =~ /\[.*\]/){				
	    $line =~ s/\[.*\]//g;				#loescht trl-tags		
	}		
	if ($line =~ /[^\d\w"\+¸ˆ‰‹÷ƒ?\s]/){
    		$line =~ s/[^\d\w"\+¸ˆ‰‹÷ƒ?\s]//g;			#loescht Sonderzeichen		
	}
	if ($line =~ /[\w"+¸?‰ˆ‹ƒ÷\d-]*[\s]*[\w"+¸¸ˆ‰?‹÷ƒ\d-]*[\s]*(.*)/ ){     #matcht nur das Lemma
		$lemma = $1;
		chomp ($lemma);
		next if (length("".$lemma."") == 1);                   #entfernt einzelne Buchstaben
		next if (($lemma =~ /\d/) && ((length("".$lemma."") != 4)));#entfernt Zahlen ausser Jahreszahlen
		  #  next if ($labelhash{$lemma} != 1);
		    next if ($stoppworthash{$lemma} == 1);
		}
    }

Thanks a lot for helping and best wishes,
Stephan

uida1154 · Apr 4, 2004

I can't quite get through your code (mainly because the variables are in german) but the question is clear.

What I would do is create a hash like this:

Code:

my %wordcount=();
while my $currentword (@wordlist)
{
    $wordcount{$currentword}++;
}

I am thinking on a proper way to determining the top 10% though...

icrf · Apr 4, 2004

I'm afraid I'm in the same boat. With as much German as I've had in school, I should really do a lot better, but... Can you point out the parts/loops that actually count/report the words and maybe describe how it's doing it now? I'm going to take a guess, but I don't know. Here goes:

Code:

my @sorted_key_list = sort { $vergleichshash{$b} <=> $vergleichshash{$a} }keys(%vergleichshash)
foreach $haeufigkeit(@sorted_key_list[0..(@sorted_key_list/10)]) {

That would give you the words in descending order of frequency (I think). You could then slice off the first ten percent of them (0 to size/10) and use those as keys.

Viel Glueck.

________________________________________
Andrew - Perl Monkey

Coderifous · Apr 4, 2004

I would go w/ icrf's solution to determine the top 10% most used.

--jim

kognitio · Apr 14, 2004

thanks a lot for helping and sorry that I forgot to wrote the keywords in english... I was on a trip, that is the reason why the answer is that late. But I try to implement your recommendations now.
As I said: Thanks a lot

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

10 % of the wordfrequency 1

kognitio

Programmer

uida1154

Programmer

icrf

Programmer

Coderifous

Programmer

kognitio

Programmer

Similar threads

Part and Inventory Search

Sponsor