Slow file access on Clustered volumes - 6.5

ITsmyfault · Mar 15, 2004

Hi:

NW 6.5 SP1.1
3 node cluster using an IBM FastT600
Servers are IBM x345's w/Xeon 2.8GHz
QLogic 2340 HBA's
McData Sphereon 4300 switches
Novell MPIO driver
Each server has 2 HBA's, switches are meshed via "E" port.

Workstations are Win2k, all patched.. using Client 4.9 SP1

Clustering and multipath appear to be working very well. I can see all 8 paths to data from each server and I can pull out the fiber in the middle of a file xfer and not have it be interrupted. HBA's fail over, life is good.
Or is it?
The only two things on this new cluster that are troublesome are the file access and the amount of time it takes to join the cluster when a node comes up. My 5.1 cluster has almost instant file access and nodes can join the cluster in about 10-15 seconds. The 6.5 cluster needs around a minute or more.
But file access time is the bigger issue. Opening folders is slow, and when I select files sometimes they select right away, sometimes it takes up to 7 seconds to select (after I click on them). I have disabled Hyperthreading on the processors in the bios. While using the server migration tool I got throughput of >12GB per hour on our 100Mb lan - which I think all but rules out duplex issues on the servers.. Server NIC's are Intel e1000's, set 100/full. Switches are Baystack 450's also set 100/full.

I can access files on my 5.1 servers nearly instantly. Some of our older NT hosts running client 4.83 SP1 have better file "selection" but are still slow opening files and folders. I can also access files on the mirrored (local) SYS volume on any of the cluster members quickly. The slow response seems to be SAN based.

Any ideas what I should be looking at?

TIA - Joe

marvhuffaker · Mar 16, 2004

Joe, this is pretty a pretty complex problem, and I wasn't sure what to say at first.. But I had a brainstorm late last night about how to go about troubleshooting your problem.. I have a suspicion that your hardware is not configured right. But to confirm this you need to eliminate some of the components and see if the slowness problem still exists.

for example.. take it out of the cluster and go straight from the NW6 server to the storage box. Is it still slow?

Take out the redundant HBA's. Is it still slow?

Try using a traditional volume instead of NSS. does the slow go away?

Put the data on the local server. Is accessing the data still slow?

I think that doing this you might be able to at least determine if it's a hardware problem or something in the software, and at what point the problem exists. Then you can know where to start looking to find a solution.

Marvin Huffaker MCNE, CNE
Marvin Huffaker Consulting

http://www.redjuju.com

ITsmyfault · Mar 17, 2004

Hi Marvin -

Thanks for the thoughts! It is an ugly problem to be sure.. none of the sysops on the support forums will touch it! I think it is a config issue too, but am not sure which piece. The most obvious thing to me (but I know very little about storage) is that NSS wants to use 4k blocks; the fastT is set for 64k per IBM and cannot go as low as 4k. Not sure how that would affects reads and writes.
I have found lots of stuff on enabling/disabling cache on the client & server and issues with oplocks etc but i have not had much luck there either. I did find a tid about slow file access due to oplocks (10091561), a general performance tuning piece (10012765), monitoring NSS performance (10068489), and several others but the more I think about the degree of slowness, it is feeling more like a hardware/config issue. I will try simplifying the whole thing and see if I can ID where the problem starts.
Thanks, - Joe

ITsmyfault · Mar 22, 2004

FWIW
Disabled the 2nd port on the FastT after reviewing the device logs on it.. lots of controllers running diagnostics (which if I'm not mistaken causes failover). I brought all Luns onto one controller and offlined the other. Suddenly my speed problems are gone. I am opening a hardware case w/IBM this morning.
As an aside, once I started really working the clusters (prior to going to one controller) the servers were abending left and right.. just doing basic junk like setting up DNS, configuring things in iManager. It was scary! Once I went to one port on the storage server, the Netware servers have been rock solid.

ITsmyfault · Mar 24, 2004

Since many folks it seems are quick to post problems but you seldom hear the solution, I'm writing myself back, again.

There is a happy ending.

The SAN issue is fixed. We found that the controllers on the main LUN were failing back and forth non-stop..this was the cause of our slowness and abends etc as you can imagine it would be.
The IBM hardware group had me go *back* to the IBMSAN.CDM and their latest QL2300.HAM driver. (IBM software group had advised me to use latest QL2300 from Novell and Netware's MPIO driver) Anyhow I config'd all that and reset the servers but the controller kept on failing over and back and over.. so I powered off everything.. for maybe 10min. Brought up the storage server by itself. Left it up for 15min. When it looked fine I added one server. It came up, the storage was fine.. and I added the other servers at 15min intervals. Now the whole thing is fine.. access is fast, servers are stable and life is good.

mt59601 · Apr 3, 2004

Just curious, did IBM come up with a solution for you? I have almost the exact same configuration and very similar problems. IBM had me go back to 6.0 from 6.5 as there are issues with the IBMSAN driver. I'm still fighting this, any info would be appreciated.

ITsmyfault · Apr 3, 2004

IBM basically did come up with our solution, although I'm not sure anyone really knew what they were doing. The software guys seemed to keep confusing Netware with Unix.. chief difference here being that Unix boxes can't have all paths visible or they freak. Netware boxes are fine with it. As a result the IBM hardware guys kept telling me the config could not work as shown (which got me steamed as they supposedly QA'd my design before selling me the gear..)
Anyhow the 2 main issues we had were:
1 - I could not initialize a LUN from nssmu. I got stuck on that until I called Novell TS who just had me create a pool from iManger.. which worked fine. iMgr (and nssmu as it turns out) will go back and init the drive for you so you can start by making either a pool or a volume.. and it will make the other stuff automatically.
2 - the multipathing bit.. we went from IBMSAN to Novell's MPIO and then back to the IBMSAN.CDM. The Novell MPIO driver appears to trip up the controllers in the FastT and the combo was the source of our problems. We went back to the latest IBMSAN and QL2300.HAM from IBM's support site and then did a cold start of the whole system and now we're solid. Below are the key bits of our config:
Driver versions we're using:
IBMSAN.CDM v1.06.05 Nov 4, 2002
QL2300.HAM v6.50.18 July 11, 2002

Startup.ncf (typical)
## MPIO is off by dflt, just making sure here
SET MULTI-PATH SUPPORT=OFF
## PSM Files
LOAD ACPIDRV.PSM
## load IBM MPIO driver
LOAD IBMSAN.CDM
##
LOAD IDECD.CDM
LOAD SCSI.CDM
## load ide & LSI onboard SCSI drivers
LOAD IDEATA.HAM SLOT=10006
LOAD LSIMPTNW.HAM SLOT=10015
LOAD LSIMPTNW.HAM SLOT=10016
## config per IBM docs - IBMSAN readme.
LOAD QL2300.HAM SLOT=3 /LUNS /ALLPATHS /PORTNAMES /XRETRY=400 /CONSOLE
LOAD QL2300.HAM SLOT=4 /LUNS /ALLPATHS /PORTNAMES /XRETRY=400 /CONSOLE

*Important not to load QLCSFTE.CDM per IBM docs.
IBMSAN.CDM should be the 2nd file to load per IBM docs
-----------

switches are zoned for 'one big zone' so all paths are visible and we configured storage partitioning on the FastT - group the HBA's to allow/block LUN access.
We have the new clusters nearly in production. Migrated our production printing environment over to them last night as a MOF. Clusters are running very well.

Hope this helps!

- Joe

mt59601 · Apr 13, 2004

Would you be willing to provide me with your IBM case #? I've had IBM on site for a week and they can't figure this out. Also, did IBM indicate to you any potential problems using 6.5? Have you moved to production yet and is your performance still good?
Appreciate the help.

Thanks.

Dustin Temple
dtemple@state.mt.us

ITsmyfault · Apr 13, 2004

Hi Dustin:

It was case #54291L6Q. Mail me offline if you need something. joe(AT)ardsley.com

- Which version of Storage Manager are you on? We're at 8.3.
- Do you have your LUNs configured as "IBMSAN" (as opposed to Netware failover)?
- If you're using QLogic HBA's, give them a call and see if someone will go over their BIOS settings with you. They were very nice on the phone. Trouble is their monitoring app only runs over IPX, and of course I have gotten rid of IPX.. some of the settings that come by dflt from IBM are, well, not optimal in many cases.
- If the CPU's are Xeon, make sure that Hyperthreading is disabled in the bios. this *kills* performance on Netware.

To answer your other Q's -
1 - not in production yet, but only bc I cannot back them up yet.

Installing SyncSort on Friday.

Hope to be in production by the 20-21st. EOM for sure.
2 - Have migrated Zen 4.0.1, Printing, DNS/DHCP over already. You need a beta TCP/IP patch to fix PXE on 6.5.. that was a fun one to find!

Had some issues with printing. See TID 10080373 concerning LPR names. If you print over IP, Netware uses LPR printing from Unix basically. LPR names matter ('passthrough' is not a good default for many popular models..) NDPS fails over very well. On DNS, do *not* use the older DNS/DHCP console to set it up. Re-install it from the new 65 server's public directory.. what else?
3 - Performance that I can test has been very good. Migrating data to the new servers over a 100Mb LAN I got up past 12MB/sec which is *flying*. Client 4.9SP1a helps speed & function noticeably as well as remapping after failovers. Drive mapping is quick, file opening and saving to SAN appear quick (I make a point of checking non-cached files) Once backup is in and happy I'm just going to alter the login scripts, move the cert auth over along with the SLPDA and be done. (I hope!)

hth - Joe

SirCam · Oct 12, 2005

We have gone to 6.5 and have lost failover on our FastT600. I think it's because our arrays are still set to Netware-IBMSAN not Netware-Failover. Did IBM mention the differences between these? We also use multi-path support=on, lsimpe.cdm not ibmsan in our startup.ncf. I am now going to change the host type to netware-failover, but trying to determine if this will break anything before i do. do you have any comments on this?

Thanks, Cam.

mt59601 · Oct 13, 2005

I had this exact issue. Make sure hyper-threading is disabled on all of your cluster nodes.

ITsmyfault · Oct 13, 2005

Hi Cam:
We have all of our luns on netware failover as host type and are using the LSIMPE. Failover is fine. We're on 6.5.2 currently. IBMSAN is not really supported on 6.5 - it was a stopgap in NW6. IBM reccomends using the Netware MPIO.
I would second the suggestion to disable hyperthreading.

Since I wrote all that stuff over a year ago, a lot happened.
- We zoned single initiator, using an HBA for disk and another for tape. The IBM GS SAN consultant we work with considers this a serious "best practice".
- We implemented Syncsort Backup Express which gave us LAN free backup with amazing reliability and great speed. (The tape library is on the fabric)

We have since added another shelf and moved to Storage manager 9 along with the matching firmware and driver revs that go with that.

File versions we're running:
LSIMPE is ver 1.00.04 May 7, 2004
LSIMPTNW is v3.04.25 Dec 12, 2003
QL2300 is v6.80.03 March 22, 2004

If it helps, our startup.ncf looks like this:
----
### SERVER MYSERVER
### UPDATED JSP 12/04/04
### ENABLE NOVELL MPIO DRIVER
SET MULTI-PATH SUPPORT=ON
###
LOAD ACPIDRV.PSM
######## End PSM Drivers ########
### LOAD LSI MPIO DRIVER
LOAD LSIMPE.CDM
### LOAD IDE & SCSI DRIVERS
LOAD IDECD.CDM
LOAD SCSIHD.CDM AEN
## LOAD SYNCSORT PREFERRED TAPE DRIVER
LOAD NWTAPE.CDM
######## End CDM Drivers ########
LOAD IDEATA.HAM SLOT=10006
LOAD LSIMPTNW.HAM SLOT=10015
LOAD LSIMPTNW.HAM SLOT=10016
### THESE SETTINGS PER IBM README ON QL2300.HAM
LOAD QL2300.HAM SLOT=3 /LUNS /ALLPATHS /PORTNAMES /XRETRY=400 /QUALIFIED /CONSOLE
LOAD QL2300.HAM SLOT=4 /LUNS /ALLPATHS /PORTNAMES /XRETRY=400 /QUALIFIED /CONSOLE
######## End HAM Drivers ########
# set paramters
Set maximum physical receive packet size = 2048

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Slow file access on Clustered volumes - 6.5

ITsmyfault

IS-IT--Management

marvhuffaker

MIS

ITsmyfault

IS-IT--Management

ITsmyfault

IS-IT--Management

ITsmyfault

IS-IT--Management

mt59601

MIS

ITsmyfault

IS-IT--Management

mt59601

MIS

ITsmyfault

IS-IT--Management

SirCam

MIS

mt59601

MIS

ITsmyfault

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor