I inherited a Perl program that makes sure a remote host is not ping-able within 60 seconds. A piece of simplified source codes is list below:
In one of our nightly run logs, we see something like this:
And we don't understand why there is such a huge time gap in between those two red lines.
The actual ping cmd used in ping() is:
The ping command will wait 3 seconds at the most plus 1 second sleeping time. That's why we see the gap of ping failures is 4 seconds, which makes perfect sense.
To some extent, I managed to duplicate the problem with following log info:
The above message also makes sense -- the process died as soon as $delta is larger than 60.
So the question remains: why didn't the process die sooner on Oct 1st? How could this process wait for so long to die?
My guess is that the host was too busy. But how can I prove that? How can I write a program that is smart enough to log every piece of info whenever this kind of error happens again?
I know I may get some kind of info from sys logs under /var. However, this kind of error happens about once a month. The logs under /var are not preserved well.
I hope I have made myself clear and your help is highly appreciated.
Code:
my $pingFailures = 0;
my $maxWaitTime = 60; # In reality, this is passed in
while ($pingFailures < 4) { # to make sure ping() failed 4 times
# The return value of ping() is either 1 (failed) or 0 (successful)
if($self->ping()) {
++$pingFailures;
$self->logMsg("$pingFailures Ping Failure[s] Detected\n");
}
else {
$pingFailures = 0;
}
my $now = time();
my $delta = $now - $startTime; # $startTime is passed in
if($delta >= $maxWaitTime) {
$self->logMsg("Error: Deadline Expired: still pinging after $delta seconds.\n");
die;
}
sleep 1; # $sleep is passed in
}
Code:
Oct 1 02:55:48 Max waiting time for no ping: 60 seconds
Oct 1 02:55:51 1 Ping Failure[s] Detected
Oct 1 02:55:55 2 Ping Failure[s] Detected[COLOR=red]
Oct 1 02:55:59 3 Ping Failure[s] Detected
Oct 1 02:58:48 Error: Deadline Expired : still pinging after 180 seconds.[/color]
The actual ping cmd used in ping() is:
Code:
ping -c 1 -W 3 $ip
The ping command will wait 3 seconds at the most plus 1 second sleeping time. That's why we see the gap of ping failures is 4 seconds, which makes perfect sense.
To some extent, I managed to duplicate the problem with following log info:
Code:
Nov 1 02:55:48 Max waiting time for no ping: 60 seconds
Nov 1 02:55:51 1 Ping Failure[s] Detected
Nov 1 02:55:55 2 Ping Failure[s] Detected
Nov 1 02:55:59 3 Ping Failure[s] Detected
Nov 1 02:56:50 Error: Deadline Expired : still pinging after 62 seconds
So the question remains: why didn't the process die sooner on Oct 1st? How could this process wait for so long to die?
My guess is that the host was too busy. But how can I prove that? How can I write a program that is smart enough to log every piece of info whenever this kind of error happens again?
I know I may get some kind of info from sys logs under /var. However, this kind of error happens about once a month. The logs under /var are not preserved well.
I hope I have made myself clear and your help is highly appreciated.