Repeating cron jobs after network outage

jhumkey · Sep 16, 2005

AIX 5.2 (not sure about ML#, I'm the programmer wasting time trying to prove there IS a problem, not the Tech group that should be fixing it.)

I'm in USA, with a AIX box in Cent.Amer. (So comm's go down between sites frequently.)

Occasionally, when networking has been down (country to country, but was fine internally at the remote site, and the remote site never went down) upon reestablishing communications, the remote box seems to "repeat" all the cron jobs skipped while networking was down.

Instead of just getting the 6:37pm job run, I seem to get the 20 jobs not run while the lines were down 20mins, all run at once at 6:37pm. (NO, NOTHING in the jobs uses networking.) These are '* * * * * /usr/local/bin/foo' type jobs that should run once a minute, complete in <10 seconds, and have "problem" when they run overlapping. . . We know they're running multiple at once, since temp files are colliding. And it only happens at network restoration time. (Yes, I could use flags to block simultaneous runs, but that's evading the problem. Should cron be running "skipped" jobs and trying to "play catchup"??? I haven't seen this since very old Sys-V Unix days.)

Has anyone else experienced "repeated" cron jobs, or cron jobs not running (or running all at once) around network outages on AIX 5.2?

Sorry I can't give more info, some is proprietary, and without root, I'm stumbling in the dark a bit.

Thanks,

jkh

hirschaj · Sep 16, 2005

I have never heard of such an issue with cron. What Maintenance Level is your AIX 5.2 at?

Jim Hirschauer

http://www.aixexpert.com

jhumkey · Sep 16, 2005

oslevel just shows
5.2.0.0

jhumkey · Sep 16, 2005

And FYI, its one of a pair of HACMP boxes. Not sure how to check for HA version #.

hirschaj · Sep 16, 2005

ouch, that is the first release level. You might want to ask the SA's to patch that thing.

Jim Hirschauer

http://www.aixexpert.com

RodKnowlton · Sep 16, 2005

I know you said the jobs don't do any networking, but have you checked to see if any of the filesystems they access are NFS mounted?

Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

jhumkey · Sep 16, 2005

They're not (accessing NFS mounts). The SA's use NFS mounts incoming, manually invoked, for pgm updates and such. But the core application does not. There is a "shared array" that HA controls (onely one of the 2 hosts at a time can utilize it). Why do you ask? What would NFS mounts have to do with cron? Or were you just suggesting they'd interrupt the job running?

Only thing I can find on the web (on duplicate cron jobs) is if you use ntpdate (which we're not, we're using ntp) to update the time, and regress back across a minute boundary. So with ntp, that shouldn't be an issue.

Never mind. I'll figure out something. Maybe the network hiccup is tanking the box (cpu utilization wise) and our normal <5 second process is stretching out execution time to overlap with the next activation.

Thanks though.

jkh

plamb · Sep 16, 2005

possibly your just getting the notifications of the jobs it ran while the comm was out

RodKnowlton · Sep 19, 2005

I was suggesting that a reads, writes, or launches from an NFS mount would behave the way you described, and might not appear at first glance to have anything to do with networking. If 20 minutes of network down equals 20 minutes of delay on these jobs, and it's reproducible, then cleary the jobs or something they are dependent on IS using the network.

Temp file collisions are easily avoided by using the process id in the file name (e.g. /tmp/$$.temp). You should make this standard practice to avoid race conditions like the one you've encountered.

If there are some scripts that depend on others running first, you should make yet another script that runs them in the correct order, then cron that script instead.

Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

rambug · Sep 19, 2005

Is it not more like them cron jobs do get launched, but stay hung when that link goes down, and then all complete at the same time ?

Look at either NFS, as suggested before, or having a YP client set to a server on the other side of the link.

Cheers

jhumkey · Sep 21, 2005

Some additional knowledge. One of the scripts that seems to "repeat" is a "find" across a local filesystem for "core" files, it moves/renames found cores and sends and email. The find/move/rename has NOTHING (not not even NFS) to do with networking. Now . . . without sendmail the service running . . . I wonder how long that single email is hanging on before giving up when networking is down??? That could explain some of it.

And (using oslevel as suggested before wasn't enough. Using 'oslevel -r' I see) my patch level is 5200-04. So we're not completely at the base release.

Thanks for all the suggestions. I'll look into the "email not going through" hanging the initial job as the primary culprit at the moment.

p5wizard · Sep 21, 2005

How is that find coded?

You can limit the find to one filesystem (if others are mounted 'downhill') with -xdev:

find / -xdev

will only scan root FS, not /usr, /home, /tmp, ...

-or-

use fstype to scan only certain types of filesystems:

find /var -fstype jfs -o -fstype jfs2

will only scan files in jfs or jfs2 filesystems from /var downward.

HTH,

p5wizard

RodKnowlton · Sep 22, 2005

Does it send an email to a local or remote user? Do all of the other "repeaters" send emails?

Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

rambug · Sep 22, 2005

Sure you're not running a ypbind, connecting via tcpip to a remote host ?

jhumkey · Sep 23, 2005

No. I'm sure I'm not crossing filesystems, nor using networking via ypbind to get to any remote box. (We're not using ANY part of the "yp" suite as far as I can tell.) I would point out that someone (one of the SAs) left a hard linked NFS mount from that box to another remote box. I use NOTHING to do with that mount. But, on AIX w/NFS, we've proven in the past that ALL scripts have problems executing and fail mysteriously, if the hard NFS mount is lost. EVEN IF YOU'RE NOT USING ANY FILE/PATH ASSOCIATED WITH THE MOUNT.
So #1, we're reactivating sendmail the server, to stop the backing up of singular sendmail processes. And #2 have broken the offending NFS mount and #3 reminded SAs to use soft NFS mounts (yet again). That could very well explain all our problems. (I've hated NFS for years. It's been nothing but grief from 15 years ago on Unix SysV for me. Seems to be not much better on modern AIX boxes.)
Thanks for all the suggestions.

p5wizard · Sep 25, 2005

Well, then change your find to find -fstype jfs so someone else's stray NFS mount won't bother your script...

HTH,

p5wizard

jhumkey · Sep 27, 2005

Multiple scripts will fail once the NFS mount goes bad, and none of them actually use the NFS mount. The NFS mount is used by SA's only to transfer upgrade files (NIM or something). So none of the scripts (mine or the other app-dev's) touch anything about it. But once NFS dies. . . all scrips go kafluey. Not sure why. Other than that the default Hard link with 1000 retries (or 1000 seconds, whichever it is) tanks performance on the machine and ends up killing seemingly unrelated things. Either way, dropping the NFS link seems to have fixed all the problems. So remember . . . "Hard NFS links BAD! Soft links are your friend." Thanks for all the suggestions.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Repeating cron jobs after network outage

jhumkey

Programmer

hirschaj

MIS

jhumkey

Programmer

jhumkey

Programmer

hirschaj

MIS

RodKnowlton

MIS

jhumkey

Programmer

plamb

MIS

RodKnowlton

MIS

rambug

IS-IT--Management

jhumkey

Programmer

p5wizard

IS-IT--Management

RodKnowlton

MIS

rambug

IS-IT--Management

jhumkey

Programmer

p5wizard

IS-IT--Management

jhumkey

Programmer

Similar threads

Part and Inventory Search

Sponsor