Matching

carg1 · Sep 4, 2003

Hello everyone. I have this code, originally courtesy of FireMyst, and I modified it. Originally it counted instances of user IP addresses from the records in log files. Now I want to get the addresses the users visited. I've got it pulling out the addresses and counting them, but I don't want the entire address. With the entire address it counts every different page visited on the same site. What I want is just everything from http:// up to the first backslash afterward. For example, if the address was

http://us.a1.yimg.yahoo.com/TestImage/Folder3/babybottle.jpg,

I want it to be able to match the "

http://us.a1.yimg.yahoo.com"

portion so it can count that instead. That way it counts the occurrences of sites in lieu of individual pages on the sites. If I could match it that way, correct me if I'm wrong, but it would also print that way, right? Here's the code:

Code:

my %IPs;   #Added in this HASH
my $User;
my $Dd;
my $Tt;
my $Ap;
my $Dest;
my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst);
#my $Choices;
#my $PickNumber;

print &quot;Which file: &quot;;
$TheDB = <STDIN>;
chomp($TheDB);

# Open the database file but quit if it doesn't exist
open(INDB, $TheDB) or die &quot;The database $TheDB could &quot; .
  &quot;not be found.\n&quot;;

  while(<INDB>) {
    $TheRec = $_;
    chomp($TheRec);
  	($User, $Dd, $Tt, $Ap, $Dest) = split(/\s+/, $TheRec, 5);
        $SuccessCount++;
if (exists $IPs{$Dest}) { #If the key exists, add 1 
   $IPs{$Dest}++;
}#End of if
 else { #Otherwise, basically initialize the count to 1
   $IPs{$Dest} = 1;
}#End of else
  } # End of while(<INDB>)
  
  if($SuccessCount == 0) { print &quot;No records found.\n&quot; }
  else { 
  print &quot;$SuccessCount records found.\n&quot; 
  }

  
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);
$RMonth = $mon + 1; #Adds one to month to obtain the correct month
$SYear = ($year % 100); #Runs modulo arithmetic on the year to get the correct short year
@DateMod = ($RMonth, $mday, $SYear, $hour, $min);

#Stores the value of Ops, converted into a readable format, into the variable $Stuff
$Filename = sprintf(&quot;%02d%02d%02d%02d%02d&quot;, @DateMod);


open(OUTDB, &quot;>>d:\\hits\\countsite${Filename}.txt&quot;) or die &quot;Can't do it.&quot;;
$key =~ s/http:\/\/*.*\///ig; 

    foreach $key(sort keys %IPs)  #Print out each key
    {print (&quot;IP Address: &quot; .$key. &quot;\t\t\tNumber of instances: &quot; . $IPs{$key} . &quot;\n&quot;);
     print OUTDB (&quot;IP Address: &quot; .$key. &quot;\t\t\tNumber of instances: &quot; . $IPs{$key} . &quot;\n&quot;);}

     print &quot;Program finished.\n&quot;;

Any ideas folks?

raklet · Sep 4, 2003

The way you would match urls is like this:

=~ m#(

http://.+)/#;

$url = $1;

carg1 · Sep 4, 2003

I'm afraid it didn't work, raklet.

raklet · Sep 4, 2003

Where are you trying to match urls at in your code? The only regex I see in the entire script you provided is a substition regex that replaces

http://something

with nothing.

$key =~ s/http:\/\/*.*\///ig;

carg1 · Sep 4, 2003

Sorry for the linguistic confusion, I really meant substituting. Oh and that $key =~ s/http:\/\/*.*\///ig; line was an early, unsuccessful attempt that I forgot to delete. Also, I just tried your first method again, I was implementing it wrong before. It sort of worked. On some of the URLs it did cut off everything after the first slash, but on some it didn't.

raklet · Sep 4, 2003

Would you mind posting your latest piece of relevant code along with an explanation of what your are trying to substitute? Sorry, I don't quite follow what you are up to yet.

carg1 · Sep 4, 2003

I'm sorry, I'm pretty much new at Perl, I've only been at it for about 2 weeks.

Code:

my %IPs;   #Added in this HASH
my $User;
my $Dd;
my $Tt;
my $Ap;
my $Dest;
my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst);
#my $Choices;
#my $PickNumber;

print &quot;Which file: &quot;;
$TheDB = <STDIN>;
chomp($TheDB);

# Open the database file but quit if it doesn't exist
open(INDB, $TheDB) or die &quot;The database $TheDB could &quot; .
  &quot;not be found.\n&quot;;

  while(<INDB>) {
    $TheRec = $_;
    chomp($TheRec);
  	$TheRec =~ m/(http:\/\/*.*)\/{1,1}.+/g;
     $Addy = $1;
  	($User, $Dd, $Tt, $Ap, $Dest) = split(/\s+/, $TheRec, 5);
     $SuccessCount++;
    
if (exists $IPs{$Dest}) { #If the key exists, add 1 
   $IPs{$Addy}++;
}#End of if
 else { #Otherwise, basically initialize the count to 1
   $IPs{$Addy} = 1;
}#End of else
  } # End of while(<INDB>)
  
  if($SuccessCount == 0) { print &quot;No records found.\n&quot; }
  else { 
  print &quot;$SuccessCount records found.\n&quot; 
  }


($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime(time);
$RMonth = $mon + 1; #Adds one to month to obtain the correct month
$SYear = ($year % 100); #Runs modulo arithmetic on the year to get the correct short year
@DateMod = ($RMonth, $mday, $SYear, $hour, $min);

#Stores the value of Ops, converted into a readable format, into the variable $Stuff
$Filename = sprintf(&quot;%02d%02d%02d%02d%02d&quot;, @DateMod);


open(OUTDB, &quot;>>d:\\hits\\countsite${Filename}.txt&quot;) or die &quot;Can't do it.&quot;;


    foreach $key(sort keys %IPs)  #Print out each key
    {print (&quot;Website: &quot; .$key. &quot;\t\t\tNumber of instances: &quot; . $IPs{$key} . &quot;\n&quot;);
     print OUTDB (&quot;Website: &quot; .$key. &quot;\t\t\tNumber of instances: &quot; . $IPs{$key} . &quot;\n&quot;);}

     print &quot;Program finished.\n&quot;;

So that's my latest code. My code takes a line from a log file:

19.5.18.2 08/19/03 0009:01:22 PASSED

http://a772.g.akamai.net/7/772/51/9cb36b97c77474/www.apple.com/t/2002/us/en/i/1.1h.gif

It then reads that line, prints out only the URLs and ignores the rest. From there it counts every time that URL appeared in the log, so if the above address appeared 50 times, it would output:

Website:

http://a772.g.akamai.net/7/772/51/9cb36b97c77474/www.apple.com/t/2002/us/en/i/1.1h.gif

Number of instances: 50

Now the problem comes when people visit several different pages on the same site. The person may have only been on that one site, but gone to 20 different pages. It counts every page as 1 instance. In order to fix that, what I am trying to do is take a URL, like the abovementioned:

http://a772.g.akamai.net/7/772/51/9cb36b97c77474/www.apple.com/t/2002/us/en/i/1.1h.gif

and remove everything after the third slash, so the final output would be:

http://a772.g.akamai.net

That way, it would keep its count by each site instead of by each page. I originally figured the easiest way to do it would be with a substitution. In pseudocode, read the URL, when you reach the third slash, replace that slash and everything after it with nothing. Basically a quick and dirty delete. I don't know if that's the easiest way, I already see that matching would've been far easier, but somehow substituting was my first inclination. I hope this makes the goal of my script clearer. And thank you for your help

raklet · Sep 4, 2003

Okay, I am guessing that this is the relevant section of code you are trying to puzzle out.

while(<INDB>) {
$TheRec = $_;
chomp($TheRec);
$TheRec =~ m/(http:\/\/*.*)\/{1,1}.+/g;
$Addy = $1;
($User, $Dd, $Tt, $Ap, $Dest) = split(/\s+/, $TheRec, 5);
$SuccessCount++;

if (exists $IPs{$Dest}) { #If the key exists, add 1
$IPs{$Addy}++;

That being the case, there are several things you want to change.

First, you are using the pattern match operators incorrectly.

*.* says match zero or more of the preceeding character followed by zero or more of any character. So, \/*.* says match zero all the way to infinite "/" followed by zero to infinite of any character. You then add a bunch of stuff that isn't necessary. Also, you don't need to specify "g" at the end. You use the global command when you are doing substitutions.

The next problem I see is with

if (exists $IPs{$Dest})

This will never evaluate to true because you have not defined $IPs{$Dest} anywhere previously. It doesn't exist. And because it does not exist, you will never be able to build a hash out of $IPs{$Addy}. I am assuming that $Dest is the url you are looking for. Just evaluate to see if $Dest exists instead of $IPs{$Dest}. Here is how I would recommend you set up this portion of the code:

while(<INDB>) {
$TheRec = $_;
chomp($TheRec);
$TheRec =~ m/(http:\/\/.+)\//;
$Addy = $1;
($User, $Dd, $Tt, $Ap, $Dest) = split(/\s+/, $TheRec, 5);
$SuccessCount++;

if (exists $Dest) { #If the key exists, add 1
$IPs{$Addy}++;

Hope that helps.

duncdude · Sep 5, 2003

$_ = "

http://us.a1.yimg.yahoo.com/TestImage/Folder3/babybottle.jpg";

m/(http:\/\/[^\/]+)/;

print "the http part is: $1\n";

carg1 · Sep 5, 2003

Actually, if I tell it to use if (exists $Dest) it tells me that it's not part of a hash or array. duncdude's matching expression works exactly as I need it to. The thing is, when it counts, I infer that it's telling me every individual instance of an address instead of how many times it was visited. As in it says every site appeared once, which I know for 110% fact is not the case. Thanks for your help guys, I appreciate it.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Matching

carg1

MIS

raklet

MIS

carg1

MIS

raklet

MIS

carg1

MIS

raklet

MIS

carg1

MIS

raklet

MIS

duncdude

Programmer

carg1

MIS

Similar threads

Part and Inventory Search

Sponsor