Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Event ID 129: Reset to device, \Device\RaidPort1, MegaSAS (RAID 84016E) 1

Status
Not open for further replies.

vintagedon

IS-IT--Management
Oct 27, 2013
7
0
0
US
For a couple of weeks now, I've been chasing some PCIe RAID port resets that are sent to my LSI 84016E RAID controller that I cannot seem to solve. For several weeks, the SAN performed great, and then began having regular and erratic Event 129 Reset to Device errors.

Anytime that the RAID is under load of any kind (it serves media for my XBMC HTPCs in the house, among other things), drive activity to the RAID volume will lock up for 60 seconds, and then resume, seemingly randomly. It's only current function is to serve media files out to HTPCs around the house via Windows File Sharing, however, prior to this, it was serving as a iSCSI mount for VMWare ESXi nodes. Total free space is around 40%.

The event log always shows an Event ID 129 with the message "Reset to device, \Device\RaidPort1, was issued" with the provider as megasas. The RAID card logs show NO ERRORS when this happens. The last troubleshooting I did was to unhook the drives from the RAID card and do a surface scan of each one with Hitachi's tool (around 6 hours per drive) and all came back clean.

This is a custom built SAN with the following specs:

[ul]
[li]CPU: FX-6100 Motherboard: ASUS M5A97 (current) MSI 970A-G43 (prior)[/li]
[li]RAM: 32GB DDR3-1600[/li]
[li]RAID Card: LSI 84016E in PCIex16 slot[/li]
[li]Power Supply: Corsair Professional Series HX 750[/li]
[li]OS Drive: 128GB Crucial M4 SSD[/li]
[li]RAID Drives: 16 x 2TB Hitachi 7200RPM (3Gbps/6Gbps mixed w/14 drives in RAID6, 2 drives in RAID1)[/li]
[li]OS: Win7 Ultimate (current) Server2008R2 (prior)[/li]
[/ul]

Drive Models for the Drives:

[ul]
[li]HDS5C302 Deskstar 6gbps 32MB = x4[/li]
[li]HDS72302 Deskstar 6gbps 64MB = x4[/li]
[li]HDS72202 Deskstar 3gbps 32MB = x2[/li]
[li]HUA72202 Ultrastar 3Gbps 64MB cache = x3[/li]
[/ul]

History and Troubleshooting:
[ul]
[li]RAM Tests come back clean[/li]
[li]Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean[/li]
[li]Cables swapped on RAID card with new cables[/li]
[li]Motherboard replaced[/li]
[li]RAID card replaced with identical model[/li]
[li]RAID card Firmware updated (both cards)[/li]
[li]Fan attached to heatsink on RAID card for better temperature regulation[/li]
[li]OS Changed from Server2008R2 to Win7 Ultimate[/li]
[li]Power supply tested via a tester and multimeter. All rails holding steady voltage, even under load drive load[/li]
[li]Can replicate error/reset by using CrystalDiskMark3. Lockup/reset SEEMS to happen on the write cycle[/li]
[li]Cannot replicate error/reset using HDTunePro or IOMeter, even allowing them to run 1 hour+[/li]
[li]IOMeter does not cause the error even on write cycles (see the CrystalDiskMark3 entry above)[/li]
[li]Have tried DirectIO and Cached IO on the RAID card[/li]
[li]Have tried NQC on and off[/li]
[li]Errors happen to both the RAID1 and RAID6 virtual drives, suggesting it's not limited to a single virtual drive or set of physical drives[/li]
[li]RAID card consistency check comes back clean[/li]
[li]RAID card Read Patrol comes back clean[/li]
[li]Chkdsk on both virtual drives comes back clean[/li]
[li]sfc /scannow comes back clean (See above: OS replaced)[/li]
[li]Virus checks come back clean (See above: OS replaced)[/li]
[li]No errors in RAID card log[/li]
[li]RAID card log shows no correctable errors, or other errors or alarms[/li]
[li]MegaCLI shows no errors or SMART errors[/li]
[/ul]

Full text, including the details tab from the Windows Event Viewer:

Code:
Reset to device, \Device\RaidPort1, was issued.

- <Event xmlns="[URL unfurl="true"]http://schemas.microsoft.com/win/2004/08/events/event">[/URL]
- <System>
  <Provider Name="megasas" /> 
  <EventID Qualifiers="32772">129</EventID> 
  <Level>3</Level> 
  <Task>0</Task> 
  <Keywords>0x80000000000000</Keywords> 
  <TimeCreated SystemTime="2013-10-22T17:32:26.936828400Z" /> 
  <EventRecordID>21077</EventRecordID> 
  <Channel>System</Channel> 
  <Computer>SAN.xxxxxxxx.local</Computer> 
  <Security /> 
  </System>
- <EventData>
  <Data>\Device\RaidPort1</Data> 
  <Binary>0F001800010000000000000081000480040000000000000000000000000000000000000000000000000000000000000001000000810004800000000000000000</Binary>

So at this point, I'm out of ideas. The only thing I haven't replaced is the CPU, RAM (but it comes clean on a RAM check), the drives, or the Power Supply. I'm loathe to continue to replace parts indiscriminately. Google isn't much help either.

Note: I did break down and order a new power supply this weekend: a 1000W Gold Certified. It''ll be here Tuesday.

Any ideas that I may have missed?
 
Baffling...
Have you pathpinged from the wks to the server? you should have very few if any lost packets. Extremely rare LSI based adapters die/malfunction, yours seem to being functioning, as it does "unfreeze" after a given time. If the adapter is faulty they generally freeze completely, cause a reboot, or fail during boot. A number of times I have seen temporary freezes which were related to network issues, or software issues on the wks side

As to the power supply, a 750 should do it but it might be close, depending on which video card have, so an upgrade to the 1000 is a reasonable effort.

Have you tried going down to a bare bones setup, disconnect CDRom/DVD and any other devices.

Have you swapped in a very basic video card,(non nvidia), using the Windows standard video driver?

"Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean" From past experience, testing raid drives off LSI based controllers often does not find faults which only show up when they are connected to the raid adapter. That said, if it was the raid controller or drives, the adapter logs should show it, but I have been blessed with a few drives which had issues with the drive electronics which did not show up in the raid adapter logs. Never tried it but I wonder if the Dell diagnostic iso would run on the machine, it tests drives on their LSI based Perc adapters.



........................................
Chernobyl disaster..a must see pictorial

"Computers in the future may weigh no more than 1.5 tons."
Popular Mechanics, 1949
 
Although I neglected to mention it, the SAN doesn't have a CD/DVD drive, uses an old Matrox VGA card (just enough to get local video if I want to), and the only other card in it is a Quad Intel/Pro 1000 GB NIC (which I have taken out and it still does the random freeze-ups). So it's very bare bones to begin with and runs headless 99% of the time. Any diagnostic CDs like MemTest86+, I boot off USB.

As for pathpinging, this happens not only over the network, but I can reproduce it locally with benchmarking software (putting the volume under load), so I ruled out network issues early on.

The new power supply will arrive tomorrow, and it's a 1000W, but I have the SAN hooked up to a UPS, and it's not drawing over 500W, even running full tilt. The power supply is a Corsair that doesn't spin the fan unless it's over 50% load, and it rarely spins, so I know it's not drawing much. Still, could be some line fluctuation that I'm not seeing.

It definitely is baffling, as I've replaced the RAID card with an identical model, and it does the same thing (now I have two 16 port raid cards, an 8 port, and a 4 port). My last three thoughts are:

[ol 1]
[li]the power supply[/li]
[li]the OS drive, the Crucial SSD, is running as AHCI, and I'm wondering if somehow it's interfering. Googing doesn't turn up a lot, but there are some[/li]
[li]those four HDS5C302 Deskstar 6gbps are 'Coolspin' variable speed drives (I had no idea when I stuck them in there ... I found this out a couple of days ago)[/li]
[/ol]

On (2), the drives will say 7200RPM in the specs, but when you get into the description, these are viable speed drives and have power consumption management. I'm wondering if they're somehow spinnign down, slower, or simply choking. Regardless, I ordered four more 7200RPM Ultrastar drives today, which will give everything a fixed 7200RPM spindle.

Anyway, I'm still baffled. The most maddening thing is that there's no errors in the RAID card, and read patrols and consistency checks come back ... a pun if I may ... consistently good. But narrowing it down, there's nothing much left.

If all else fails, I'm going to create a JBOD disk copy the 12TB of data I have on it off, and destroy the RAID, recreate it, test the crap out of it, and then move the data back *shrug*

Another note here is that although the SMART tests came back clean (and those were done by hooking the drives directly to the motherboard SATA ports), remember I also did surface scans straight from the motherboard SATA ports, bypassing the RAID. I'll see if the Dell Diagnostic CD will do anything.
 
You have done your frustrating homework alright, if I had this issue at a client I would not have any hair left...

What gets me is your getting the event from Windows OS, and not from the raid adapter.

Suppose you did it already, but is SSD firmware at this version?

As to the SSDs, I had one go bad, every test I threw at it passed but I still had issues until replaced. You could try pulling a raid 1 member drive, let it degrade, test, see if the issue happens, reinstall/let rebuild, then do the same to the other raid 1 member. Odd it happens to both arrays, if it was some issue with the "cool spin" or a firmware issue on the raid 5, then it should not show up on the raid 1, which kind of points to the OS or the raid 1 drives, which would cause the issue on both arrays.

Wonder if you run process explorer would you find anything related to the issue?

........................................
Chernobyl disaster..a must see pictorial

"Computers in the future may weigh no more than 1.5 tons."
Popular Mechanics, 1949
 
Technome:

Thanks for the continued ideas. And I'm *so* glad this isn't a client, and is rather my home solution. I'd probably have bitten the bullet and rebuilt the whole SAN at this point. Definitely getting some grey hairs with this.

The SSD firmware is an interesting note, and I hadn't considered it. Since the power supply arrives tonight, I'll replace it first, do some testing, and then if that doesn't solve it, move to flashing the SSD firmware. Another possibility might be changing the SSD to IDE mode instead of ACHI.

RAID1: I've done better than pulling the RAID1 drives, I've pulled the RAID1 volume all together, since they are just mirrored and can be read normally. Unfortunately, same thing. Sigh.

RAID Rebuilds: Actually did this on the RAID6 when I pulled the one 5400RPM drive that snuck in the array a few days back (stupid Samsung drive that didn't have the RPM on the label), thinking that was it. Replaced with a Hitachi Ultrastar and that hasn't seemed to fix it.

Process Explorer: Checked with this, and I can't see anything around the time of the device reset, although I could be missing it. Seems weird though that it would persist even though I moved from Server2008R2 to Win7Ultimate (trying to rule out an OS issue). This re-install of the OS is only a couple of weeks old and is very sparse. I haven't added much.

Last night while watching movies via XBMC (the XBMC pulls from the SAN for video), it happened 3 times in a 10 minute period (video freezes for 60 seconds, then restarts, check the error log, same "Reset to device"" error), and then didn't happen again all night. Never the same video, never the same spot.

So this is my current plan and unfortunately, it's just a elimination list (if one item doesn't help, move to the next):

[ol 1]
[li]Replace the power supply tonight with the new 1000w. Benchmark the array to try to replicate the device reset (tonight)[/li]
[li]Update the firmware on the SSD. Benchmark to replicate again (tonight)[/li]
[li]Backup the OS with Acronis for easy rollback and reinstall Windows with the SSD drive set to IDE instead of ACHI. Benchmark to replicate (tonight)[/li]
[li]Swap out the SSD with a replacement, image the OS to it from the backup in #3. Benchmark to replicate (tomorrow)[/li]
[li]A drive at a time, replace the Coolspin drives with Ultrastar fixed 7200RPM drives, letting the RAID rebuild each time. Benchmark again at each drive replacement (this week).[/li]
[li]Move all the data off the RAID to a JBOD array, destroy the RAID, re-create and benchmark to try to replicate, move data back.[/li]
[li]Flush SAN down the toilet[/li]
[/ol]

 
I thought maybe there was a compatibility of the SSD model and the raid adapter, considering there have been many between SSDs and raid cards, checked google, there seems to be no issues with this combo except for the older firmware on the drives causing an issue.

"Flush SAN down the toilet"
Do not do that, then you will have to call in a sewer guy to find the blockage... with your luck it will cause a baffling blockage which can't be found, causing you to replace the entire sewer line, trees, lawn and sidewalk.
This issue is becoming your new (unpaid/costly)career; your supposed to save those careers until you retire. I have a friend who takes offending hardware upstate and preforms capital punishment with multiple calibers, buck shot does wonders.

Please report the results.


........................................
Chernobyl disaster..a must see pictorial

"Computers in the future may weigh no more than 1.5 tons."
Popular Mechanics, 1949
 
So, it's late, and I'm tired, but I thought I'd give an update.

New power supply arrived. Replaced, benchmarked, same issue.
Image backup of the system, then set SATA to IDE on the motherboard, and re-installed Windows 7. Tested fresh out of the box ... same issue.

Interestingly enough, the error changed, although the source (megasas) is the same: "Reset to device, \Device\RaidPort2, was issued" Not sure why it incremented one (it's late, my brain isn't working). I wonder if this is actually Port 2 on the controller, a drive on port 2, or the device itself?

Tomorrow, just for fun, I'm going to try installing Windows to a SATA drive, and take the SSD completely out of the question.

Long Term Plan
The four new drives arrive tomorrow (the 31st). The plan is to swap out, one at a time, the 'Coolspin' drives, allowing the RAID to rebuild each time (on 75% rebuild, it only takes ~11 hours). Since it's RAID6, I'd do two at a time, but I'm a little worried since I'm having issues anyway. Each drive, I'll replace them with the Ultrastars. This will leave me with all fixed 7200RPM drives. I don't know that the Coolspin drives are causing issues, but *shrug*

Finally, over the weekend, I intend to create a 12TB JBOD and RichCopy everything to it, destroy the RAID volume, recreate it, and benchmark and see if that fixes it. If not, I'll sell the 84016E's on eBay and get a 9650SE-16 or maybe even an Adaptec 71605E.
 
"I wonder if this is actually Port 2 on the controller"
As the card was replaced, same issue occurs, it would be astronomical a second card would have an issue; again, the card appears to be functioning as it should, the freeze being traffic congestion, card is waiting on whatever is causing the bottleneck.

"or the device itself?"
What make/model?



........................................
Chernobyl disaster..a must see pictorial

"Computers in the future may weigh no more than 1.5 tons."
Popular Mechanics, 1949
 
Just an update, as this seems to be fixed.

This week, I spent most of the week swapping out the few "Coolspin" Hitcachis in the array for Hitachi Ultrastars, making all 16 drives non-variable speed 7200RPMs. This did not fix the timeout problems.

At my wits end, I took the drives I took out and made a large JBOD and copied all the SAN stuff out to the JBOD and was preparing tonight to destroy the array, test it thouroughly, and then move everything back, assuming there was some underlying issue with the volume.

However, last week, I won some SFF-8087 to SATA breakout cables on eBay for like $2 each. Considering they are expensive, I always try to pick up a spare set when I catch them on auction at a decent price. On a whim (although I have replaced the RAID cables *twice* already), I swapped out the cables again for these new cables because these were sleeved, heatwrapped and had latches on the SATA ends, so everything was cleaner in the case.

The problem has disappeared, and no matter what I throw at the array I can't replicate the timeouts any more.

It appears that I had a bad RAID cable in BOTH of the sets of cables I purchased for this SAN. What are the odds of two sets of 4 breakout cables having an issue? LOL Of course, there's no real way to test cables except throw them in a solution and put them under load.

Too soon to tell, but it's promising. I'll continue to test the array over the next week, but I may have found the issue.

As Always,
Don
 
"What are the odds of two sets of 4 breakout cables having an issue? "
Wow, lets hope this does it !

........................................
Chernobyl disaster..a must see pictorial

"Computers in the future may weigh no more than 1.5 tons."
Popular Mechanics, 1949
 
Unfortunately, after 2 days of no issues, the problem has returned.

At a loss, so my last step is to destroy the RAID, recreate it, and then copy everything back from the JBOD.

At this point, I have no idea what the problem could even be.
 
Another update, as I had somewhat of a possible epiphany.

As I stated, I had already moved all the data off to a JBOD array and was planning on destroying the virtual disk and doing some extended testing, which I did tonight. Not only did re-creating the virtual drive not have any effect, I was even able to replicate the Event 129s if I made a virtual drive with a single hard drive in RAID0, so it was basically a pass through drive. MBR, GPT, stripe sizes, RAID cache, nothing had any effect, nor got rid of the Event 129.

Going back over all the troubleshooting, I thought ... the only thing I haven't replaced is the CPU. What if the problem is some odd incompatibility between the RAID card and the AMD chipset? Backing up the SAN, I reinstalled Windows really quick and didn't install the chipset drivers. Still the 129s. But I knew Windows would install it's own chipset drivers, so this didn't mean anything.

Well, I happened to have a little mATX motherboard with an i3-2100 sitting around, so I swapped this in, and sure enough ... not only could I not replicate the 129s, even after running CrystalDiskMark in a loop on sequential (a "sure fire" way to trigger the 129s) for an hour, but my RAID throughput jumped about 30%.

Hesitant to say "this might be it", I'm still encouraged. If nothing else, I can truly say, at this point, that I have swapped EVERYTHING.

Copying everything back to the new virtual drive, and will report back for more in the RAID CARD SAGA.

At least I get points for being tenacious.

VintageDon
 
Donald...

"At least I get points for being tenacious."
I gave you a star, will not make up for all the work, at least it is something. [dazed]

Check this out as to power management settings....

........................................
Chernobyl disaster..a must see pictorial

"Computers in the future may weigh no more than 1.5 tons."
Popular Mechanics, 1949
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top