vintagedon
IS-IT--Management
For a couple of weeks now, I've been chasing some PCIe RAID port resets that are sent to my LSI 84016E RAID controller that I cannot seem to solve. For several weeks, the SAN performed great, and then began having regular and erratic Event 129 Reset to Device errors.
Anytime that the RAID is under load of any kind (it serves media for my XBMC HTPCs in the house, among other things), drive activity to the RAID volume will lock up for 60 seconds, and then resume, seemingly randomly. It's only current function is to serve media files out to HTPCs around the house via Windows File Sharing, however, prior to this, it was serving as a iSCSI mount for VMWare ESXi nodes. Total free space is around 40%.
The event log always shows an Event ID 129 with the message "Reset to device, \Device\RaidPort1, was issued" with the provider as megasas. The RAID card logs show NO ERRORS when this happens. The last troubleshooting I did was to unhook the drives from the RAID card and do a surface scan of each one with Hitachi's tool (around 6 hours per drive) and all came back clean.
This is a custom built SAN with the following specs:
[ul]
[li]CPU: FX-6100 Motherboard: ASUS M5A97 (current) MSI 970A-G43 (prior)[/li]
[li]RAM: 32GB DDR3-1600[/li]
[li]RAID Card: LSI 84016E in PCIex16 slot[/li]
[li]Power Supply: Corsair Professional Series HX 750[/li]
[li]OS Drive: 128GB Crucial M4 SSD[/li]
[li]RAID Drives: 16 x 2TB Hitachi 7200RPM (3Gbps/6Gbps mixed w/14 drives in RAID6, 2 drives in RAID1)[/li]
[li]OS: Win7 Ultimate (current) Server2008R2 (prior)[/li]
[/ul]
Drive Models for the Drives:
[ul]
[li]HDS5C302 Deskstar 6gbps 32MB = x4[/li]
[li]HDS72302 Deskstar 6gbps 64MB = x4[/li]
[li]HDS72202 Deskstar 3gbps 32MB = x2[/li]
[li]HUA72202 Ultrastar 3Gbps 64MB cache = x3[/li]
[/ul]
History and Troubleshooting:
[ul]
[li]RAM Tests come back clean[/li]
[li]Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean[/li]
[li]Cables swapped on RAID card with new cables[/li]
[li]Motherboard replaced[/li]
[li]RAID card replaced with identical model[/li]
[li]RAID card Firmware updated (both cards)[/li]
[li]Fan attached to heatsink on RAID card for better temperature regulation[/li]
[li]OS Changed from Server2008R2 to Win7 Ultimate[/li]
[li]Power supply tested via a tester and multimeter. All rails holding steady voltage, even under load drive load[/li]
[li]Can replicate error/reset by using CrystalDiskMark3. Lockup/reset SEEMS to happen on the write cycle[/li]
[li]Cannot replicate error/reset using HDTunePro or IOMeter, even allowing them to run 1 hour+[/li]
[li]IOMeter does not cause the error even on write cycles (see the CrystalDiskMark3 entry above)[/li]
[li]Have tried DirectIO and Cached IO on the RAID card[/li]
[li]Have tried NQC on and off[/li]
[li]Errors happen to both the RAID1 and RAID6 virtual drives, suggesting it's not limited to a single virtual drive or set of physical drives[/li]
[li]RAID card consistency check comes back clean[/li]
[li]RAID card Read Patrol comes back clean[/li]
[li]Chkdsk on both virtual drives comes back clean[/li]
[li]sfc /scannow comes back clean (See above: OS replaced)[/li]
[li]Virus checks come back clean (See above: OS replaced)[/li]
[li]No errors in RAID card log[/li]
[li]RAID card log shows no correctable errors, or other errors or alarms[/li]
[li]MegaCLI shows no errors or SMART errors[/li]
[/ul]
Full text, including the details tab from the Windows Event Viewer:
So at this point, I'm out of ideas. The only thing I haven't replaced is the CPU, RAM (but it comes clean on a RAM check), the drives, or the Power Supply. I'm loathe to continue to replace parts indiscriminately. Google isn't much help either.
Note: I did break down and order a new power supply this weekend: a 1000W Gold Certified. It''ll be here Tuesday.
Any ideas that I may have missed?
Anytime that the RAID is under load of any kind (it serves media for my XBMC HTPCs in the house, among other things), drive activity to the RAID volume will lock up for 60 seconds, and then resume, seemingly randomly. It's only current function is to serve media files out to HTPCs around the house via Windows File Sharing, however, prior to this, it was serving as a iSCSI mount for VMWare ESXi nodes. Total free space is around 40%.
The event log always shows an Event ID 129 with the message "Reset to device, \Device\RaidPort1, was issued" with the provider as megasas. The RAID card logs show NO ERRORS when this happens. The last troubleshooting I did was to unhook the drives from the RAID card and do a surface scan of each one with Hitachi's tool (around 6 hours per drive) and all came back clean.
This is a custom built SAN with the following specs:
[ul]
[li]CPU: FX-6100 Motherboard: ASUS M5A97 (current) MSI 970A-G43 (prior)[/li]
[li]RAM: 32GB DDR3-1600[/li]
[li]RAID Card: LSI 84016E in PCIex16 slot[/li]
[li]Power Supply: Corsair Professional Series HX 750[/li]
[li]OS Drive: 128GB Crucial M4 SSD[/li]
[li]RAID Drives: 16 x 2TB Hitachi 7200RPM (3Gbps/6Gbps mixed w/14 drives in RAID6, 2 drives in RAID1)[/li]
[li]OS: Win7 Ultimate (current) Server2008R2 (prior)[/li]
[/ul]
Drive Models for the Drives:
[ul]
[li]HDS5C302 Deskstar 6gbps 32MB = x4[/li]
[li]HDS72302 Deskstar 6gbps 64MB = x4[/li]
[li]HDS72202 Deskstar 3gbps 32MB = x2[/li]
[li]HUA72202 Ultrastar 3Gbps 64MB cache = x3[/li]
[/ul]
History and Troubleshooting:
[ul]
[li]RAM Tests come back clean[/li]
[li]Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean[/li]
[li]Cables swapped on RAID card with new cables[/li]
[li]Motherboard replaced[/li]
[li]RAID card replaced with identical model[/li]
[li]RAID card Firmware updated (both cards)[/li]
[li]Fan attached to heatsink on RAID card for better temperature regulation[/li]
[li]OS Changed from Server2008R2 to Win7 Ultimate[/li]
[li]Power supply tested via a tester and multimeter. All rails holding steady voltage, even under load drive load[/li]
[li]Can replicate error/reset by using CrystalDiskMark3. Lockup/reset SEEMS to happen on the write cycle[/li]
[li]Cannot replicate error/reset using HDTunePro or IOMeter, even allowing them to run 1 hour+[/li]
[li]IOMeter does not cause the error even on write cycles (see the CrystalDiskMark3 entry above)[/li]
[li]Have tried DirectIO and Cached IO on the RAID card[/li]
[li]Have tried NQC on and off[/li]
[li]Errors happen to both the RAID1 and RAID6 virtual drives, suggesting it's not limited to a single virtual drive or set of physical drives[/li]
[li]RAID card consistency check comes back clean[/li]
[li]RAID card Read Patrol comes back clean[/li]
[li]Chkdsk on both virtual drives comes back clean[/li]
[li]sfc /scannow comes back clean (See above: OS replaced)[/li]
[li]Virus checks come back clean (See above: OS replaced)[/li]
[li]No errors in RAID card log[/li]
[li]RAID card log shows no correctable errors, or other errors or alarms[/li]
[li]MegaCLI shows no errors or SMART errors[/li]
[/ul]
Full text, including the details tab from the Windows Event Viewer:
Code:
Reset to device, \Device\RaidPort1, was issued.
- <Event xmlns="[URL unfurl="true"]http://schemas.microsoft.com/win/2004/08/events/event">[/URL]
- <System>
<Provider Name="megasas" />
<EventID Qualifiers="32772">129</EventID>
<Level>3</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2013-10-22T17:32:26.936828400Z" />
<EventRecordID>21077</EventRecordID>
<Channel>System</Channel>
<Computer>SAN.xxxxxxxx.local</Computer>
<Security />
</System>
- <EventData>
<Data>\Device\RaidPort1</Data>
<Binary>0F001800010000000000000081000480040000000000000000000000000000000000000000000000000000000000000001000000810004800000000000000000</Binary>
So at this point, I'm out of ideas. The only thing I haven't replaced is the CPU, RAM (but it comes clean on a RAM check), the drives, or the Power Supply. I'm loathe to continue to replace parts indiscriminately. Google isn't much help either.
Note: I did break down and order a new power supply this weekend: a 1000W Gold Certified. It''ll be here Tuesday.
Any ideas that I may have missed?