Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

High failure rate for SCSI hot-swappables?

Status
Not open for further replies.
Oct 22, 2001
431
US
I'm working with an organization that has roughly 30 HP LPr rackmount servers, and thus 60-70 hot-swappable HP-brand hard drives, a mix of 9GB and 18GB SCSI UW 10K RPM. The 18GBs are listed as a million hour MTBF, translating to 110 or so years. Roughly 15 have failed here in the last three years. My first question is, does anyone know of a good (and of course preferably free) drive fault analysis utility to use on them to try to nail down the specific fault, and my second question is, does anyone have any ideas why the failure rate would be so high?
-Steve
 
FYI - All the MTBF numbers are just a guess by the manufacturers and aren't usually very accurate. Even different drives from different mfg. (OEM like seagate, quantum, western digital) have different life spans. for instance, one seagate model drive might be a workhorse and different seagate drive is troubled with high failures. You didn't indicate which HP drive you are utilizing, so I can't venture an opinion on your particular drive without that info. So.....

My guess on the failure rate is possibly high heat or power loads on the servers. Are the drives in a temp controlled environment? Most of my clients keep their server room at about 60 F. Just a guess.

There is software that will let you know what is wrong with the drives but if there is no one experienced with using this type of software/program the data would probably be mis-interpreted. Then it would be of no use anyway. Not what you probably wanted to hear, but I hope this helps sheds some light on your problem.

Sue Krautbauer
HP Channel Manager
C-Tech Incorporated
Compaq / Dell / HP / IBM Servers and Parts
131 Cheshire Lane, Suite 100, Minnetonka, MN 55305
952.249.6807 (Direct) | 888.551.3633 (Toll-free) | 952.249.6859 (Fax)
suek@c-techonline.com
AOL IM: HPsuek
 
HPSuek,
Thanks for the response. The space is temp/humidity controlled (70 on the temp, don't know what the humidity level stays at). I'm not sure what you mean about a high power load on the server - heavy disk I/O, or that the power supply is getting maxed out?
-Steve
 
Ah, ok. No, I can't see where the power supply would have a high load; there are no additional expansion cards, only the one or two drives, and really nothing beyond a "vanilla" server configuration.
-Steve
 
Ok then, when you get a chance, pull out the hot swap drive and let me know what all the p/n are....both the HP, but more importantly, the OEM model#. Seagates usually start with ST, IBM with two numerals and then a alpha character, etc. Many times listed as the bare#. Maybe that will shed some light on the subject.

Sue Krautbauer
HP Channel Manager
C-Tech Incorporated
suek@c-techonline.com
AOL IM: HPsuek
 
One thing that you might look at is the SCSI backplane that the drives connect to. We had lots of problems here that appeared to be drive related that turned out to be failures in the backplane of the lp200r servers. One HP tech told me that they had a "silent recall" on exactly that problem and said to pull the drives and look at the left side of the backplane from the front of the server. If you see a version 6, get it replaced.
 
I ran into the same thing and had to get the backplane replaced on about 10 lp2000r's.

Just so you know, the main fan over the memory on the lp2000's was also a silent recall. If the center of the fan is not metal then you have an older fan and can get it replaced by just calling HP. We did this on about 32 servers.

Dana Hallenbeck
 
I just had another drive go bad (now being detected during SCSI detection, but an "inaccessible boot device" STOP error during Win2k OS loading) so I've got a few p/n for you now:

ST318203LC, lot A-01-0021-1, 18GB drive, p/n 9L8006-037, s/n LR746911
ST318203LC, lot A-01-0021-1, 18GB drive, p/n 9L8006-037, s/n LR743292
ST39102LC, lot A-01-9952-7, 9GB drive, p/n 9J8006-047, s/n LJW23119
ST318203LC, lot A-01-0021-2, 18GB drive, p/n 9L8006-037, s/n LR709357
ST318203LC, lot A-01-0016-2, 18GB drive, p/n 9L8006-037, s/n LR558625

And for the servers, all are HP NetServer LPr PIII/550, p/n D9133-60200 /D9133A with serial numbers
US94800001
US93900734
US94700702
US94401179
are four of the 32 that are currently out of the rack where I can see the serial numbers.

There's a variety of symptoms (which is what makes this problem so frustrating). I've got a drive that doesn't get detected by the controller, mostly drives (non-system) that get detected by the controller and by the OS but the user cannot open the drive or (system) that simply hang during OS bootup, a 9GB that gets detected as a 4.3GB, and just now one that produces an inaccessible boot device STOP error on boot. There are at least three servers that have developed drive problems and I think there are several others that did so before I assumed this job.

Regarding the SCSI backplane, can you be more specific as to where I should look for the version 6? The screwiness of this problem could easily be caused by something other than the HDs themselves.

-Steve
 
Something else you may want to look at is the cables. I noticed with the lp2000r's that some of my drive problems were the result of a single strand of the SCSI 3 cable being severed. There is a lot of sheet metal and sharp edges in the HP server boxes...
Take a close look at the cables, or just try new ones to see if you get the same results. I have had it happen on brand new machines...just out of the box. The 9 Gig drive showing up as a 4 Gig sounds like a cable problem.

Dana Hallenbeck
 
Dana,
Can you tell me what to look for as far as a backplane serial or part number? I ran all the drives above through a single server, so I'd doubt that it's an ongoing problem, to me the backplane sounds like more of a culprit.
-Steve
 
Hi,

We have deployed about 400 LP2kR's over the past 18 months and have been through all the pain which people here are.

A quick summary of what happened with us:

1) Backplane problems - HP identified that there was a problem with one of the components on the LP2000R backplanes, this problem caused false reports of failures and occasional container corruption (if using NetRAID-4M cards). The 'dodgy' backplanes can be identified by looking through the Chassis at the long serial number, the fauly backplanes will have a '-01-' just after the 1st number.

2) SCSI cabling - The original LP2000r's had incorrect 'flat' scsi cabling between the two backplanes and the controller, HP replaced this with twisted cabling.

3) Cooling Fans - As somebody has mentioned above there's an issue with the processor cooling fan, this only effected > 1GHz models and not faster machine. If motor housing on the fan is plastic and not metal HP will swap this out.

4) NetRAID-4M BIOS - All our LP2000r's came with NetRAID-4M controllers, HP made us aware that the BIOS revision 4584 had issues when installed in the LP2000r platform which meant that false failures were being reported and in extreme cases containers were being lost. Unfortunately due to how HP did our original build we had to redeploy the OS when we did the BIOS upgrade as the Windows drivers and FAST (flexible Array Storage Tool) must be of matching versions.

5) Faulty drives - There was a batch of 36GB drives release by IBM that had been contaminated during production, HP ran a serial number checking tool against our servers and replaced the faulty drives..

Hope this helps, I think we've seen all the major problems there are with the LP2000r platform but have been lucky to have a great HP account manager.
 
Drive firmware is also suspect...especially on the 9 and 18 gig drives
 
Any serial numbers or version information to check out on the 9's and 18's to determine that?
-Steve
 
Here's a link you will need to know the product number of the drives D6019A you can see the firmware in NetRaid assist or Cntrl M by clicking on Drive and the selecting the correct f key for info...I think f3 maybe? Can't remember exactly which f key it is. There is also a good article regarding mass storage that was just recently released you might want to check out

also make sure your bios and netraid controllers are up to the latest revs. I would also look for any service notes for you model server. Also the thing that one person mentioned "SCSI cabling - The original LP2000r's had incorrect 'flat' scsi cabling between the two backplanes and the controller, HP replaced this with twisted cabling." This wasn't a matter of incorrect cabling but the cable was updated to a redesigned type to eliminate noise problems.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top