Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Repeating cron jobs after network outage

Status
Not open for further replies.

jhumkey

Programmer
Sep 16, 2005
7
US
AIX 5.2 (not sure about ML#, I'm the programmer wasting time trying to prove there IS a problem, not the Tech group that should be fixing it.)

I'm in USA, with a AIX box in Cent.Amer. (So comm's go down between sites frequently.)

Occasionally, when networking has been down (country to country, but was fine internally at the remote site, and the remote site never went down) upon reestablishing communications, the remote box seems to "repeat" all the cron jobs skipped while networking was down.

Instead of just getting the 6:37pm job run, I seem to get the 20 jobs not run while the lines were down 20mins, all run at once at 6:37pm. (NO, NOTHING in the jobs uses networking.) These are '* * * * * /usr/local/bin/foo' type jobs that should run once a minute, complete in <10 seconds, and have "problem" when they run overlapping. . . We know they're running multiple at once, since temp files are colliding. And it only happens at network restoration time. (Yes, I could use flags to block simultaneous runs, but that's evading the problem. Should cron be running "skipped" jobs and trying to "play catchup"??? I haven't seen this since very old Sys-V Unix days.)

Has anyone else experienced "repeated" cron jobs, or cron jobs not running (or running all at once) around network outages on AIX 5.2?

Sorry I can't give more info, some is proprietary, and without root, I'm stumbling in the dark a bit.

Thanks,

jkh
 
And FYI, its one of a pair of HACMP boxes. Not sure how to check for HA version #.
 
I know you said the jobs don't do any networking, but have you checked to see if any of the filesystems they access are NFS mounted?


Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

 
They're not (accessing NFS mounts). The SA's use NFS mounts incoming, manually invoked, for pgm updates and such. But the core application does not. There is a "shared array" that HA controls (onely one of the 2 hosts at a time can utilize it). Why do you ask? What would NFS mounts have to do with cron? Or were you just suggesting they'd interrupt the job running?

Only thing I can find on the web (on duplicate cron jobs) is if you use ntpdate (which we're not, we're using ntp) to update the time, and regress back across a minute boundary. So with ntp, that shouldn't be an issue.

Never mind. I'll figure out something. Maybe the network hiccup is tanking the box (cpu utilization wise) and our normal <5 second process is stretching out execution time to overlap with the next activation.

Thanks though.

jkh
 
possibly your just getting the notifications of the jobs it ran while the comm was out
 
I was suggesting that a reads, writes, or launches from an NFS mount would behave the way you described, and might not appear at first glance to have anything to do with networking. If 20 minutes of network down equals 20 minutes of delay on these jobs, and it's reproducible, then cleary the jobs or something they are dependent on IS using the network.

Temp file collisions are easily avoided by using the process id in the file name (e.g. /tmp/$$.temp). You should make this standard practice to avoid race conditions like the one you've encountered.

If there are some scripts that depend on others running first, you should make yet another script that runs them in the correct order, then cron that script instead.


Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

 
Is it not more like them cron jobs do get launched, but stay hung when that link goes down, and then all complete at the same time ?

Look at either NFS, as suggested before, or having a YP client set to a server on the other side of the link.

Cheers
 
Some additional knowledge. One of the scripts that seems to "repeat" is a "find" across a local filesystem for "core" files, it moves/renames found cores and sends and email. The find/move/rename has NOTHING (not not even NFS) to do with networking. Now . . . without sendmail the service running . . . I wonder how long that single email is hanging on before giving up when networking is down??? That could explain some of it.

And (using oslevel as suggested before wasn't enough. Using 'oslevel -r' I see) my patch level is 5200-04. So we're not completely at the base release.

Thanks for all the suggestions. I'll look into the "email not going through" hanging the initial job as the primary culprit at the moment.
 
How is that find coded?

You can limit the find to one filesystem (if others are mounted 'downhill') with -xdev:

find / -xdev

will only scan root FS, not /usr, /home, /tmp, ...

-or-

use fstype to scan only certain types of filesystems:

find /var -fstype jfs -o -fstype jfs2

will only scan files in jfs or jfs2 filesystems from /var downward.


HTH,

p5wizard
 
Does it send an email to a local or remote user? Do all of the other "repeaters" send emails?



Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

 
Sure you're not running a ypbind, connecting via tcpip to a remote host ?
 
No. I'm sure I'm not crossing filesystems, nor using networking via ypbind to get to any remote box. (We're not using ANY part of the "yp" suite as far as I can tell.) I would point out that someone (one of the SAs) left a hard linked NFS mount from that box to another remote box. I use NOTHING to do with that mount. But, on AIX w/NFS, we've proven in the past that ALL scripts have problems executing and fail mysteriously, if the hard NFS mount is lost. EVEN IF YOU'RE NOT USING ANY FILE/PATH ASSOCIATED WITH THE MOUNT.
So #1, we're reactivating sendmail the server, to stop the backing up of singular sendmail processes. And #2 have broken the offending NFS mount and #3 reminded SAs to use soft NFS mounts (yet again). That could very well explain all our problems. (I've hated NFS for years. It's been nothing but grief from 15 years ago on Unix SysV for me. Seems to be not much better on modern AIX boxes.)
Thanks for all the suggestions.
 
Well, then change your find to find -fstype jfs so someone else's stray NFS mount won't bother your script...


HTH,

p5wizard
 
Multiple scripts will fail once the NFS mount goes bad, and none of them actually use the NFS mount. The NFS mount is used by SA's only to transfer upgrade files (NIM or something). So none of the scripts (mine or the other app-dev's) touch anything about it. But once NFS dies. . . all scrips go kafluey. Not sure why. Other than that the default Hard link with 1000 retries (or 1000 seconds, whichever it is) tanks performance on the machine and ends up killing seemingly unrelated things. Either way, dropping the NFS link seems to have fixed all the problems. So remember . . . "Hard NFS links BAD! Soft links are your friend." Thanks for all the suggestions.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top