Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Looking for suggestions on logging sys messages.

Status
Not open for further replies.

whn

Programmer
Oct 14, 2007
265
0
0
US
I inherited a Perl program that makes sure a remote host is not ping-able within 60 seconds. A piece of simplified source codes is list below:
Code:
my $pingFailures = 0;
my $maxWaitTime = 60; # In reality, this is passed in
while ($pingFailures < 4) { # to make sure ping() failed 4 times
    # The return value of ping() is either 1 (failed) or 0 (successful)
    if($self->ping()) { 
        ++$pingFailures;
        $self->logMsg("$pingFailures Ping Failure[s] Detected\n");
    }
    else {
        $pingFailures = 0;
    }
    my $now = time();
    my $delta = $now - $startTime; # $startTime is passed in
    if($delta >= $maxWaitTime) {
        $self->logMsg("Error: Deadline Expired: still pinging after $delta seconds.\n"); 
        die;
    }
    sleep 1; # $sleep is passed in
}
In one of our nightly run logs, we see something like this:
Code:
Oct  1 02:55:48 Max waiting time for no ping: 60 seconds
Oct  1 02:55:51 1 Ping Failure[s] Detected
Oct  1 02:55:55 2 Ping Failure[s] Detected[COLOR=red]
Oct  1 02:55:59 3 Ping Failure[s] Detected
Oct  1 02:58:48 Error: Deadline Expired : still pinging after 180 seconds.[/color]
And we don't understand why there is such a huge time gap in between those two red lines.

The actual ping cmd used in ping() is:
Code:
ping -c 1 -W 3 $ip

The ping command will wait 3 seconds at the most plus 1 second sleeping time. That's why we see the gap of ping failures is 4 seconds, which makes perfect sense.

To some extent, I managed to duplicate the problem with following log info:
Code:
Nov  1 02:55:48 Max waiting time for no ping: 60 seconds
Nov  1 02:55:51 1 Ping Failure[s] Detected
Nov  1 02:55:55 2 Ping Failure[s] Detected
Nov  1 02:55:59 3 Ping Failure[s] Detected
Nov  1 02:56:50 Error: Deadline Expired : still pinging after 62 seconds
The above message also makes sense -- the process died as soon as $delta is larger than 60.

So the question remains: why didn't the process die sooner on Oct 1st? How could this process wait for so long to die?

My guess is that the host was too busy. But how can I prove that? How can I write a program that is smart enough to log every piece of info whenever this kind of error happens again?

I know I may get some kind of info from sys logs under /var. However, this kind of error happens about once a month. The logs under /var are not preserved well.

I hope I have made myself clear and your help is highly appreciated.
 
my $maxWaitTime = 60; # In reality, this is passed in


if($delta >= $maxWaitTime) {
$self->logMsg("Error: Deadline Expired: still pinging after $delta seconds.\n");
die;
}

Based on above, if $maxWaitTime passed in was not 60, then $delta was not reached, if so die did not happen.

Oct 1 02:55:59 3 Ping Failure Detected
Oct 1 02:58:48 Error: Deadline Expired : still pinging after 180 seconds.

Also, 58-55 (rounding off) is ~ 3 min, which would 3x60 = 180 secs, so it seems value of $maxWaitTime or $delta could be the possible culprits.
 
Thank you, Max.

my $maxWaitTime = 60;

The above passed-in value is right. I have made sure that.

The 180 seconds time difference is coming from

Oct 1 02:55:48 Max waiting time for no ping: 60 seconds
......
Oct 1 02:58:48 Error: Deadline Expired : still pinging after 180 seconds

which is also correct.

These two numbers together don't make sense at all. This just happened last Friday and I checked logs under /var. Did not find anything wrong.

I want to log something here (in red):
Code:
if($self->ping()) {
  ++$pingFailures;
  $self->logMsg("$pingFailures Ping Failure[s] Detected\n");    
}
else {
  [b][COLOR=red]# log some system/network info here[/color][/b]
  $pingFailures = 0;
}

My guess is that the testing host must have some kind of hiccups. But I don't know how to approve it.
 
I would recommend priting out the values of $now, $starttime and $delta. This will ensure that their is truly a $delta.
Also, the value of $now, does it require a chomp?
 
Thank you, Max. I have tried all your suggestions before I came here posting the question. Everyting looks fine and I simply have no way to reproduce it. However, during the nightly run, this problem does show up once or twice a month.

Thanks again.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top