Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Odd host file corruption happening....

Status
Not open for further replies.

pmcmicha

Technical User
May 25, 2000
353
Using SCO Unixware 7.1.1

The problem is happening on an overnight reboot at 0105. I am getting the following error message in the syslog:

Dec 15 01:05:45 <nodename> rpcbind: netdir_getbyname failed on tcp for host <nodename>
Dec 15 01:05:46 <nodename> rpcbind: netdir_getbyname failed on udp for host <nodename>
Dec 15 01:05:46 <nodename> rpcbind: Found 2 errors with network configuration files. Continuing.

When the system comes up the host file is corrupted. Now the odd thing about this is that this error only comes up for that day in the syslog. Once the host file is restored, it does not happen again. Another thing is the the /etc/netconfig file is correct. It has the correct ownership, permissions, filesize that it needs to have.

Does anyone have any ideas on this, because I am out and I do not have a lot of experience trying to track this kind of an issue down. Thanks in advance.
 
First, I'd run fsck -ofull in single-user mode just in case. Maybe /etc/badtrk too.

My next guess would be that it's happening as you go multi-user. Let's hope so anyway, because that's easier to find.

First question: when the host file is corrupted, does it have a date and time stamp? If so, look in /etc/rc2.d/messages for a similar time stamp- that might mean that the rc2.d script that ran just before that time is what corrupted your hosts file. That still doesn't say HOW, but at least it narrows the field.

If you can't see anything obvious there, then the next thing I'd do is boot single user, confirm that hosts is OK, and then start running the rc2.d scripts manually, checking hosts after each one.

If it's already corrupt at single user, see if maybe one of the shutdown scripts is killing it- one of the K scripts could be at fault.

See if you are not already familiar with rc2.d scripts.

Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
I have been able to narrow it down some what since my post. The main problem is that I am unable to boot to single user mode with these boxes as they are scattered across the US and we are only experiencing the issue with about 1 or maybe 2 a day. We do update the boxes on a regular basis, but since all the boxes receive the same updates, then this rules out that idea. What I have been able to find though is this:

When the box comes up, the host file is copied to /var/tmp, I found that this copy was correct. It did have a timestamp of about one minute earlier, so I don't think one of the shutdown scripts is at fault. I have been through all of the /etc/rc2.d/S* scripts and they appear to be working correctly.
 
So this is happening on more than one box???

So- I think you are now saying that you know the time when this happens. That ought to be the most important clue if it is consistent.

Have you run fsck on these boxes yet?
Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
Yes this is happening on more than one box. But I am unable to find out what the issue is and why it is only affecting certain boxes and not others.

Yes, that is very important. We are also having problems with our DHCP server, which is being corrected. Since the rpc script that runs /usr/sbin/rpcbind doesn't take place until after the DHCP connection attempt, possibly something in here could be corrupting them?

Fsck runs at every reboot. This isn't a full fsck mind you since each box has about 43 GB on them.
 
Oh, DHCP is undoubtedly the key here- it has to mess with /etc/hosts so that's what's going wrong. Why, of course, is a different problem.


All patches in place here? No way to just give these poor machines one address they can keep? Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
To the best of my knowledge we have all patches in place as we update each machine on a regular basis, usually about 1 or 2 times a month.

We could give each machine just one address to keep, but there a lot of machines. About 5000 to be exact. And once DHCP is fixed, then they would all have to be modified back.

Thanks for the input though, it has really helped.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top