mac address inconsistency after failover

breadknife · Apr 17, 2003

I have a w2k advanced server cluster running msx 2k and file apnd print sharing. There are 2 virtual IP,s, one homed on each node. When either failover, remote sites lose connectivity and random machines on the local sites also lose connectivity. The remote sites I can understand to some point as there is an issue with gratuitous-arp requests, but locally I cannot see why some machines could continue to connect and others not. I looked at the arp table on a machine which cannot connect and it had the incorrect mac address for the IP, so deleted this entry. You can then ping the ip a few times before it stops responding again - check the arp table and it has pulled an incorrect mac again! Could this be the cluster sending out invalid info, or a case of the switches being too clever for my own good? On the local LAN, there is a cisco 3550 which the servers all connect to (on the public side) and this uplinks to a bank of 5 cisco 2950 switches (not clustered switches - in fact never had a console cable near them). Please help cos I've run out of ideas,

Thanks

victorv · Apr 23, 2003

hi,

I have a similar problem with 2 IBM 235 with Broadcom
Gigabit Ethernet adapter and a Northell switch, Cisco router ecc.

I need 2 virtual server ( file server ) and they
are on two different networks
192.168.1.0 255.255.255.0
1.4.0.0 255.255.252.0

Having the above nic the capability of V-LAN, and
the switch too, we have implemented the 2 virtual-server
on the same adapter on 2 different VLANs.

After some days of test, failover, ecc...
we put the servers in production.
Just one of the servers, presents the following problem:

1) a client access the share on the cluster
and entering arp -a the address of the virtual
server is associated with mac-address A

2) a failover appens: the virtual server goes to server 2

3) the client access well to the share
arp tells that virtual-server is on mac-address B

..... 5 minutes ....

4) the client does not access more to the share
entering arp -a the result is that in its the arp-tables
virtual-server is on mac-address A (the original)
(this is false once the virtual server is now in
the second node of the cluster).

Who is the guilty ???

I have explained the problem to the network manager and,
obviously he said that the network is not.

I have asked to IBM and they told me that is a OS problem.

I surfed for days in MS sites, without success.

I have concluded that I am the guilty !

I have installed the last driver for the nic, SP3,
but nothing is changed. I have downloaded a sniffer
and I have analyzed tcp-packets travelling in the network
during the story.

I have reached this conclusion: the switch use mac-address
of nics to route fastly packets to the right port, but is
does not broadcast them to the network.

I have disactivated VLAN on the Broadcom nic, giving to
it just an ip-address. I have installed a supplementary
nic (a normal 10/100) on each server, and all is going
without problem.

I don't know how much our situations are similar,
answer me your configuration, but I hope are useful
to you my conclusions.

bye

GiaBetiu · Apr 23, 2003

Well guys, here is about something named "gratuitous ARP request".
When failover is happening for a virtual server, the MS cluster is sending such package (broadcast).
Many switches are not allowing ARP broadcasts to cross them. Check the documentation of your switch and activate the passing for "gratuitous ARP requests".

Then, it will work.

Gia Betiu
giabetiu@chello.nl
Computer Eng. CNE 4, CNE 5, MCSE Win2K

victorv · Apr 23, 2003

In my case, "gratuitous ARP request" pass the
switch-es, once after failover, the client recives
the changes. Some minutes later "someone" tell him
that address is returned back, but it is false.
Moreover, using just 1 nic for each ip-address,
switch pass well "gratuitous ARP request" and
"nobody" sends "wrong" packets around network.

bye

gbaker · Apr 29, 2003

A workaround suggested by M$oft: place a hub between the cluster nodes and the switch. This solves the gratuitous arp problem, as the hub will pass broadcasts, but the switch will keep directing packets to the port that the hub is attached to. The hub cares nothing about MAC addresses and will continue passing packets to the proper node.
This worked well for me, and I've seen no decrease in performance.

breadknife · May 30, 2003

Thanks chaps. It turns out the actual problem was with the grouped network card drivers (BASC) in on e of the nodes. Once i re-installed them it all works fine again.

Cheers

YannL · Aug 12, 2003

Can you please be more specific with "grouped network card drivers (BASC)" ? What was the problem ?

jmigueldelrio · Oct 31, 2003

Hi, everybody:
I have the same symptoms victorv described but I'm not using VLANs:
1) after resource group failover, arp caches on the nodes of the LAN are updated correctly (with the MAC of the resource group's new owner)
2) some minutes later, "someone" tells the nodes the resource group's virtual IP belongs to the original node (arp caches are updated with the previous owner's MAC).
Any idea?
Regards.

flaurijssens · Dec 18, 2003

What I've seen, is that the failed node still broadcasts arp requests for the client over the network. When a client receives those requests, it updates its arp cache immediately with the wrong MAC address for the cluster IP. Our first workaround for this issue was to reboot the failed node. After a while, we found out that disabling and enabling the cluster network card did the trick as well.

I've done quite a few searches on the internet on this, the only thing I´ve found so far that those cases had in common is the network card: Broadcom gigabit NICs.

I suspect the driver of those NICs (or the teaming driver, perhaps) causes this behaviour.

jmigueldelrio · Jan 7, 2004

Thanks, flaurijssens:
we are having some other unpleasant surprises with Broadcom GB NIC's in our Dell systems. In fact, according to the Dell forums, we are not alone: many people are having connectivy problems with this NIC's.
I have opened a support case with Dell.

flaurijssens · Jan 8, 2004

We were able to solve the problem.

On this particular cluster, both nodes had two Broadcom Gigabit Fiber NICs. On each node, both NICs were teamed in a failover configuration.

We removed the teaming software entirely (uninstalled all Broadcom software in Add/Remove Programs) and removed the teaming protocol driver from the Network Connection settings. Then, we disabled one NIC on each node.

Failover works as expected now.

jmigueldelrio · Jan 8, 2004

Thanks for your input.
We did the same: disable teaming & uninstall the Control Suite. This solved some of the problems but not all.
In fact, we are having problems with 1 machine (no cluster, no dual NIC, just 1 NIC Intel and 1 NIC Broadcom): some minutes/hours after reboot, ping to this machine fails.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

mac address inconsistency after failover

breadknife

Technical User

victorv

Vendor

GiaBetiu

MIS

victorv

Vendor

gbaker

MIS

breadknife

Technical User

YannL

Technical User

jmigueldelrio

IS-IT--Management

flaurijssens

MIS

jmigueldelrio

IS-IT--Management

flaurijssens

MIS

jmigueldelrio

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor