Script optimization

BigBadDave · May 10, 2004

I need help in optimising the following script:

Code:

use strict;
use 5;
use CGI qw(:standard);
use POSIX qw(strftime mktime);
use DBI;

my $from = main::param ("from");
my $to   = main::param ("to");
my @from = split (/\//, $from);
$from = $from[2] . "-" . $from[1] . "-" . $from[0];
my @to = split (/\//, $to);
$to = $to[2] . "-" . $to[1] . "-" . $to[0];

my $vh = "";
my @vh = main::param ("vh");
foreach my $tmp (@vh) {
    $vh .= "`virtual host` = '$tmp' AND";
}
$vh = "" if ($vh =~ m/all/i);

my $data   = "";
my @dates  = (0);
my @urls   = (0);
my @times  = (0);
my @entry  = (0);
my @exit   = (0);
my @comps  = (0);
my @compsd = (0);

my $dbh = DBI->connect ("DBI:mysql:host=******;db=webstats;", "***", "", {PrintError => 0, RaiseError => 0});
$dbh->do ("CREATE TEMPORARY TABLE `tmp` SELECT `date`, `host name`, `url`, `time` FROM `access_log` WHERE $vh`date` >= '$from' AND `date` <= '$to' ORDER BY `date`, `host name`, `time`");
my $sth = $dbh->prepare ("SELECT DATE_FORMAT(`date`, '%y-%c-%e'), `url`, `time` FROM `tmp` ORDER BY `date`, `host name`, `time`");
$sth->execute ();
while (my @val = $sth->fetchrow_array ()) {
    $dates[$#dates+1] = $val[0];
    $urls[$#urls+1]   = $val[1];
    $times[$#times+1] = $val[2];
}
$sth->finish ();
$dbh->disconnect ();

my $i = 0;
foreach my $tmp (@urls) {
    my @str  = split (/:/, $times[$i]);
    my @str4 = split (/-/, $dates[$i]);
    my @str2 = split (/:/, $comps[$#comps]);
    my @str3 = split (/-/, $compsd[$#compsd]);

    my $tmpm = (localtime (mktime ($str[2],$str[1],$str[0],$str4[2],$str4[1],88) - mktime ($str2[2],$str2[1],$str2[0],$str3[2],$str3[1],88)))[1];
    my $tmph = (localtime (mktime ($str[2],$str[1],$str[0],$str4[2],$str4[1],88) - mktime ($str2[2],$str2[1],$str2[0],$str3[2],$str3[1],88)))[2];
    my $tmpd = (localtime (mktime ($str[2],$str[1],$str[0],$str4[2],$str4[1],88) - mktime ($str2[2],$str2[1],$str2[0],$str3[2],$str3[1],88)))[3];
    my $tmpt = (localtime (mktime ($str[2],$str[1],$str[0],$str4[2],$str4[1],88) - mktime ($str2[2],$str2[1],$str2[0],$str3[2],$str3[1],88)))[4];
    if ($tmpt > 1 || $tmpd > 1 || $tmph > 1 || $tmpm > 20) {
        $entry[$#entry+1] = $tmp;
        $exit[$#exit+1]   = $urls[$i-1] if ($urls[$i-1]);
    }
    $comps[$#comps+1]   = $times[$i];
    $compsd[$#compsd+1] = $dates[$i];
    $i++;
}

my $dbh = DBI->connect ("DBI:mysql:host=******;db=webstats;", "***", "", {PrintError => 0, RaiseError => 0});
$dbh->do ("CREATE TEMPORARY TABLE `entry` (`url` VARCHAR(255))");
$dbh->do ("CREATE TEMPORARY TABLE `exit` (`url` VARCHAR(255))");
$i = 0;
foreach my $tmp (@entry) {
    next if (!$tmp);
    $dbh->do ("INSERT INTO `entry` VALUES ('$tmp')");
    $dbh->do ("INSERT INTO `exit` VALUES ('" . $urls[$i-1] . "')");
    $i++;
}

print"Top 10 Entry URLS <table><tr><td>URL</td><td>Number</td></tr>";
my $sth = $dbh->prepare ("SELECT `url`, COUNT(`url`) AS `num` FROM `entry` GROUP BY `url` ORDER BY `num` DESC LIMIT 0, 10");
$sth->execute ();
while (my @val = $sth->fetchrow_array ()) {
    print "<tr><td>$val[0]</td><td>$val[1]</td></tr>\n";
}
$sth->finish ();
print "</table><br />\n\n";

print"Top 10 Exit URLS <table><tr><td>URL</td><td>Number</td></tr>";
my $sth = $dbh->prepare ("SELECT `url`, COUNT(`url`) AS `num` FROM `exit` GROUP BY `url` ORDER BY `num` DESC LIMIT 0, 10");
$sth->execute ();
while (my @val = $sth->fetchrow_array ()) {
    print "<tr><td>$val[0]</td><td>$val[1]</td></tr>\n";
}
$sth->finish ();
$dbh->disconnect ();
print "</table>";

There must be better ways of looping the data and matching criteria.

Regards
David Byng

http://www.byngdesigns.co.uk

davidbyng@hotmail.com

siberian · May 10, 2004

Well, my only comment is that I would just organize the data into a hashreference and ditch the temporary table and subsequent group look up, it feels like an extra step that is just not really required and you can not predict execution time since you are not in total control of how many DB accesses are required. I see that its probably easier but a well defined hash structure can do this just as easily and a lot faster.

uida1154 · May 10, 2004

Can you please indicate what should be optimized? Performance, readability, portibility, ...

siberian · May 10, 2004

Sure, basically your invoking a database backend to store counts and there is no need to do that.

I would just represent the data in a hashref and keep my counts there. Then you can directly access the hashref for output without having to use the database.

In the solution above you can never predict how many database hits you will make. This means your performance is entirely unpredictable. With a hashref its still unpredictable but its an order of magnitude faster to do in memory manipulations vs. database manipulations.

So basically I would parse the logfile and do something like (warning - Psuedo code ahead, please interpret for your own needs)

Code:

foreach @bla {
   $hash_ref->{ entry }{ $entryurl }++;
   $hash_ref->{ exit }{ $exiturl }++;
}

That gives you a hash with two keys, entry and exit. Each of those keys has 'n' number of keys, each key representing a unique url. This key has a value representing the number of hits.

Now you can do all sorts of sorting and manipulation on this raw, in memory data structure and never see the database.

You'll see a 10x improvement in performance easily I think, as long as you have enough memory to eat the logfile into RAM... If you don't then the database approach may be better, you did not really comment on your dataset sizing.

BigBadDave · May 11, 2004

I just need to speed up the processing as it only has 20,000 records for the test and takes over 20 seconds to process, but for the live implementation there will be over 8 million records.

Regards
David Byng

http://www.byngdesigns.co.uk

davidbyng@hotmail.com

siberian · May 11, 2004

8 million reocrds but what is the unique distribution of these records?

See, in yor implementation the size of the dataset is exactly proportional to the execution speed since EVERY record is hitting the database.

In my proposed solution the speed is far greater and depends on how many UNIQUE pages there are since you are only incrementing a counter for each page!

So, if you have 500 pages my solution will probably run incredibly fast and eat nearly zero memory.

If you have 5,000,000 records it will still run incredibly fast but will eat more memory.

Does this make sense?

Your code is slow because your database is the chokepoint. Try my way, it should only take 5 minutes and you'll see great execution times.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Script optimization

BigBadDave

Programmer

siberian

Programmer

uida1154

Programmer

siberian

Programmer

BigBadDave

Programmer

siberian

Programmer

Similar threads

Part and Inventory Search

Sponsor