Threads to Read Data

kognitio · Apr 3, 2004

Hello,
we try to do a project at our university. One part is to vectorize data. The app works fine, but it is to slow, so I had the idea to create threads to optimize the processing.... but I do not know how to do that. The idea is, that the app build as much threads as files are there in the directory. Maybe someone has an idea how to do that. The code of the app is:

#!/usr/bin/perl -w

opendir (DR,"Dia-o-Kom") || die ("cant open Texte");
mkdir ("AusgabenTagger");
chdir ("Dia-o-Kom");
foreach (<*>){
$textdatei = $_;
next if (($textdatei eq ".") || ($textdatei eq "..") || ($textdatei eq ".directory"));
$tmp=join('_', split(/ /,$textdatei));
rename($textdatei, $tmp);
$textdatei=$tmp;
open (T,"$textdatei") || die ("cant open $textdatei");
system ("../Tagger/cmd/tree-tagger-english $textdatei > ../AusgabenTagger/$textdatei");
print "$textdatei\n";

}
chdir ("..");
#opendir (DIR, $eingabe) || die ("cant open indirectory");
opendir (DIR,"AusgabenTagger") || die ("cant open AusgabenTagger");
while ($datei = readdir(DIR)){ #erstellt Array aus Directory-eintraegen
$datei = $datei;
$dir[$cnt] = $datei;
$cnt++;
}

chomp @dir;

open(R, "<Relevante2") || die ("Cannot open Relevante\n");
@labelliste = <R>;
chomp (@labelliste);
foreach $label (@labelliste) {
$labelhash{$label} = 1;
}

open(S, "<stoppwort.txt") || die ("Cannot open stoppwort_txt\n"); #erstellt Stoppwortliste aus stoppwort.txt
@stoppwortliste = <S>;
chomp (@stoppwortliste);
foreach $stoppwort (@stoppwortliste) {
$stoppworthash{$stoppwort} = 1;
}

$textcnt = 0; #Zaehlvariable fuer Name der Ausgabedatei
####################################
# absolute Wortliste wird erstellt #
####################################

foreach $datei (@dir){
next if (($datei eq ".") || ($datei eq "..") || ($datei eq ".directory"));
chomp $datei;

open (F, "AusgabenTagger/$datei") || die ("cant open $*datei"); #hier muss das Directory angegeben werden!
while ($line = <F>) {

&lemmabereinigen;
$wortliste{$lemma} = 0;

}
}
$xcnt = 0;
foreach $wort(keys(%wortliste)){
$absowortliste[$xcnt] = $wort; #erstellt Wort-array aus Wort-hash zum Vergleichen
$xcnt++;
}
open (WORTLISTE,">Wortliste.txt");
chomp (@absowortliste);
@sortwortliste = sort(@absowortliste);
foreach $wort (@sortwortliste){ #schreibt die Wortliste in Wortliste.txt
print WORTLISTE "$wort\n";
}
###############################################
# Wortliste fuer jede Textdatei wird erstellt #
###############################################
mkdir ("Vergleichshashes");
foreach $datei (@dir){
next if (($datei eq ".") || ($datei eq "..") || ($datei eq ".directory"));
chomp $datei;
open (F, "AusgabenTagger/$datei") || die ("cant open $datei"); #hier muss das Directory angegeben werden!
open (AUSGABE1,">>Ausgabevektoren");
open (AUSGABE2,">Vergleichshashes/$datei");
$lcnt = 0;
while ($line = <F>) {

&lemmabereinigen;
$lemmaliste[$lcnt] = $lemma;
$lcnt++;

}
$lcnt = 0;
unshift @lemmaliste, @absowortliste;

@sortlemmaliste = sort (@lemmaliste);

$vergleichslemma = $lemma;
$wortanzahl = -1;
$vcnt = 0;
undef(%vergleichshash);
$rcnt = 0;

foreach $lemma (@sortlemmaliste){
$vergleichshash{$lemma}++;
if ($lemma eq $vergleichslemma){
$wortvektor[$vcnt] = ++$wortanzahl;
if ($wortanzahl == 1){
if ($labelhash{$lemma}){
$relliste[$rcnt] = $lemma;
$rcnt++;
}
}
}
else {
$vcnt++;
$wortanzahl = -1;
$wortvektor[$vcnt] = ++$wortanzahl;
$vergleichslemma = $lemma;
}
}

foreach $anzahl (@wortvektor){

print AUSGABE1 "$anzahl "; #schreibt den Vektor
}
foreach $rellemma(@relliste){

print AUSGABE1 "$rellemma "; #schreibt als Label die relevanten Woerter
}
print AUSGABE1 "$datei\n";
foreach $haeufigkeit(sort keys(%vergleichshash)){
$vergleichshash{$haeufigkeit}-=1;
print AUSGABE2 "$vergleichshash{$haeufigkeit}\;"; #schreibt Haeufigkeitsliste in jeweilige Datei
}

undef(@lemmaliste);
undef(@sortlemmaliste);
undef(%vergleichshash);
undef(@wortvektor);
undef(@relliste);

}

sub lemmabereinigen {

if ($line =~ /(<.*>)/){
$line =~ s/<.*>//g; #loescht sgml-tags
}
if ($line =~ /\[.*\]/){
$line =~ s/\[.*\]//g; #loescht trl-tags
}
if ($line =~ /[^\d\w"\+¸ˆ‰‹÷ƒ?\s]/){
$line =~ s/[^\d\w"\+¸ˆ‰‹÷ƒ?\s]//g; #loescht Sonderzeichen
}
if ($line =~ /[\w"+¸?‰ˆ‹ƒ÷\d-]*[\s]*[\w"+¸¸ˆ‰?‹÷ƒ\d-]*[\s]*(.*)/ ){ #matcht nur das Lemma
$lemma = $1;
chomp ($lemma);
next if (length("".$lemma."") == 1); #entfernt einzelne Buchstaben
next if (($lemma =~ /\d/) && ((length("".$lemma."") != 4)));#entfernt Zahlen ausser Jahreszahlen
# next if ($labelhash{$lemma} != 1);
next if ($stoppworthash{$lemma} == 1);
}
}

Thanks a lot for helping...
Stephan

kognitio · Apr 3, 2004

I forgot to mention that I use Perl 5.8.1.

icrf · Apr 3, 2004

The reality is really as simple as the idea (for the most part). Put all the processing for a file into a subroutine, then where you loop through the files, just create a new thread that calls that sub for the current file. After all have been spawned, you'd need to create a little loop to wait for them all to return, something like this:

Code:

foreach my $thr (threads->list)
{
        if ($thr->tid && !threads::equal($thr, threads->self))
        {
                $thr->join;
        }
}

The only real issue comes with have the child threads report back and share information. It's pretty easily done with threads::shared or one of the specific Thread:: data types, so no real worry, just something to add in. Take a look through perlthrtut for the newer ithreads details.

Multi-threading doesn't always make things faster. Some operating systems don't support threads, either (if so, try fork()ing processing instead). If you have a multi-CPU machine and you job is CPU-limited, then sure, it'll make a difference. It'll probably help if the bottleneck changes over the course of execution, too. You might also want to limit the number of concurrent operations if you have many to do at once. Rarely is more than ten at once a good idea. Using Thread::Queue is a good tool for keeping their count down.

________________________________________
Andrew - Perl Monkey

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Threads to Read Data

kognitio

Programmer

kognitio

Programmer

icrf

Programmer

Similar threads

Part and Inventory Search

Sponsor