Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help - nodes wrongly seen as 'down' 1

Status
Not open for further replies.

MelchiorKD

IS-IT--Management
Sep 19, 2001
31
0
0
US
I am running NNM 6.2 on W2K on a test server and NNM 6.1 on Solaris on a production server. The W2K server is to replace the Solaris server. Each have an internet container that houses our Internet access routers and policy based internet routers. The test server sees these routers go 'down', when in fact they are not. The production server is accurate in its reporting of these routers status ( I know that the production server is the accurate one). These are the only devices this is happening with, everything else within NNM is accurate in its reporting. The time that this occures varies, but it mostly at night, stopping at around 8:30 AM and may happen once or twice during the work day. It can happen anywhere from a handful of times to over 50 times in a day. There are differences in the two servers sice one is NNM 250 and the test is NNM enterprise with many more interfaces to monitor. So the polling, discovery, etc is slightly different. I would think that if this was the cause, it wouldn't be the same deiveces going 'down' everytime and would be happening to other devices as well. I am at a loss and this is the only issue I have with the box before it can go into production. Any thoughts would be appreciated.
 
I have seen this problem before and it is usually due to DNS lookups taking to long. If the netmon polling queue gets too backed up, it falls behind in status polling and reports nodes as down when it just had not had time to ping them. I had this hapen on my last job every night when network backups were running because they were loading down the DNS server. Start checking there.
 
I have a local host file on the box with all the devices within NNM contaied in the file. I run checkdns.ovpl and it returns devices that are in the local host file, but not that of the routers being seen as 'down'. But, then again it is the middle of the day. How can I improve the dns resolution?
 
Two ways I use.

1) Set a a cache only DNS server on you management server and use that as you primary DNS server.

2) I wrote a script that runs an ovtopodump, parses the output and writes it to the etc/hosts file. I can send it to you but it wil need to be modified. Do you know perl?
 
Wait a minute. Running the cache only DNS server will only work for Solaris, which leads me to the question, why are you switching to W2K?
 
We are switching to W2K for the reason of lack og Unix support within my company. We only have two Unix guys that are already loaded with work, so for less downtime we went with W2K.
 
Well then, the question arises of why the Solaris server is ok with DNS and the W2K server is not. Do an nslookup on the Solaris system and get the response time, and do the same on the W2K system and see if there is a difference. I do not know W2K well enough to t/s DNS problems other than to ensure they are set up correctly for the TCP protocol. If none of that works, I may be able to port my autohosts file to W2K, but unless you know a little perl (or someone else who does), I will not be able to test it.
 
Thanks for your help, I'll check out the DNS on the Solaris box. There is one guy here that knows Perl pretty well.
 
I created a file on the test server (ipNoLookup, a feature with NNM 6.2) that specifies no to resolve names to IP addresses for certain IP's or blocks of IP's. Got the same result.
 
OK...by any chance are the routers in question running HSRP with a virtual address?
 
Here's the our basic internet setup:
_____________Internet________________
| | |
________|__|_______ ___________|__________
| | __ | |
| Internet 1 |_____ | | Internet 2 |
|___________________| | | |______________________|
| | | |
| | | |
| | | |
________|__________ | | ___________|__________
| | |____|___| |
| Policy 1 |__________| | Policy 2 |
|___________________| |______________________|
| ___________ |
| | Policy | |
________|__________ | HSRP | ___________|__________
| ||___________| | |
| Pix 1 |______________| Pix 2 |
|___________________| Failover |______________________|
| | | |
DMZ | | DMZ
Inside Inside


I am not 100% sure, but I think it is the policy routers that are being seen as 'down', subsequently the internet routers are too.
 
I see. HSRP Routers need to be configured in a certain way. As follows:

1) Place the virtual addresses in the netmon.noDiscover file.

2) Unmanage all but the loopback interfaces on the HSRP router pair and ensure communication to the management server is via the loopback. This can be done via a command on the router.

This might be configured correctly on the production system and not on the test. Also I have seen posts concerning putting a bogus SNMP community name for the routers. See if that has been done as well.
 
I'll give that a shot.....sorry my diagram didn't come out as well posted as it did when i finished it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top