If you have multiple SSA adapters accessing the same drawer, or multiple hosts accessing the same drawer, make sure the SSA diags, run from cron, do not run at the same time, offset them by, say, 5 minutes. This is the sort of false error you get if you try to run multiple instances of diags on the same loop at the same time.
If you can rule that out, use the link speed SSA service aid to find the affected disk, or count 12 devices from port A1 on the card logging the error to find the suspect link.
There is a bypass card between every 4 disks so the problem could be a disk or the bypass card or the cables from that bypass card if another adapter / host is connected at that point.
No, sorry, my last post assumed (obviously wrongly) that your SSA was in a 7133 drawer.
D6PAA is a link speed problem (link running at 20MBps rather than 40MBps). This could be caused by a bad disk but is more likely to be a cable or disk seating / noise problem.
As you have a 7025 you will have 6pack backplanes, the slow link is 12 disks from the A1 port on the adapter that reported the problem.
In 7025 machines people tend to have internal or the first backplane as SCSI for the rootvg, then SSA on the other 6packs.
So you probably have 2 full backplanes and then the SSA loop goes back to the adapter (port A2).
If you have 2 adapters connected to the disks it is probably a SSA cron diag problem (see last post, separate them by 5 minutes). If you only have the one adapter then the disks were probably just busy when the diags ran (they normally run every hour – check the cron – do you get the error every hour or at least every time the SSA diags run ?).
If the link verification works but you keep getting the D6012 errors then reseat the disk that is 12 hops from A1 and / or reseat the cable after that disk, noise on the loop can cause these errors so reseating the disk / cable may fix it.
Make sure your SSA filesets, adapter and disk firmware are up to date, they can resolve some problems the old code suffered from.
Adapter code:
#SSA warning : Deleting the next two lines may cause errors in redundant
# SSA warning : hardware to go undetected.
01 5 * * * /usr/lpp/diagnostics/bin/run_ssa_ela 1>/dev/null 2>/dev/null
0 * * * * /usr/lpp/diagnostics/bin/run_ssa_healthcheck 1>/dev/null 2>/dev/null
# SSA warning : Deleting the next line may allow enclosure hardware errors to go undetected
30 * * * * /usr/lpp/diagnostics/bin/run_ssa_encl_healthcheck 1>/dev/null 2>/dev/null
# SSA warning : Deleting the next line may allow link speed exceptions to go undetected
30 4 * * * /usr/lpp/diagnostics/bin/run_ssa_link_speed 1>/dev/null 2>/dev/null
#
From this I see the enclosure healthcheck runs every 30 minutes and the link speed check runs every 4 hours and 30 minutes, so whenever the speed check runs the enclosure health check is trying to run at the same time. This may be the cause of the problem. If your SSA disks are all internal to the 6F1 I'm not sure you need to run the enclosure healthcheck (because you don't have an enclosure) but it may still serve some purpose so I'd move the time rather than commenting (hashing / pounding / #ing) it out. The cron entries seem to be the default but the enclosure healthcheck could be the issue if it is taking longer than normal because of the lack of an enclosure.
Try moving one or the other by 5 minutes (change one of the 30s to 35).
What does the Link Speed Service Aid show (diag, task selection, SSA Service Aids, Link Speed, ssa0) ? If it shows 40 for all links you do not have a real / solid problem.
The error message generated every 4:30 am was removed but the 5:01 am messages still there. This run the run_ssa_ela cron job. What is the use of this job?
Ah, so you changed the cron timing and now only get the error at 5:01. So it probably was a cron / SSA diag / timing problem.
At 5:01 the cron runs ela: error log analysis. This does not report real time errors but just 'reminds' you of old errors. It analyses the error report and reminds you if there are any old errors that you might still have to deal with.
So the next step is to 'remove' the old errors and then check that you don't get any more reported.
There 'should' be two ways to do this...
1. Clear the error report: this, I am sure, will work. Copy the error report somewhere safe then clear it with
errclear 0
(to copy the error report – just in case you need to refer to it in the future – use:
errpt –a > errpt.old
or similar.
2. Log a repair action. This should work OK but if AIX or the SSA filesets are very old it may not. To do this run diag, advanced diags, system verification, select all SSA resources, ssa0, ssa1, enclosure0, etc. ( by highlighting them and hitting enter (a plus '+' sign appears on the left), then hit F7 to commit, when if is finished and has reported all the old problems it should give you the option to log a repair action. If it does not (early 4.3.3 or earlier) hit F10 to exit then (again) diag, task selection, log a repair action, select all SSA resources (highlight and hit enter, then F7 to commit). This will add an errpt entry called REPLACED_FRU for each resource and from that point on ela should ignore all 'old' errors.
Good luck, let me / us know how you get on.
I have tried to adjust the time of the crontab line 5:01 am and still giving me the error. The confusing there is that it gives me now a pdisk error but not pointing to the right pdisk if we do a reference on the previous error D6012..It should be pdisk8. Please see the error and link verification. All disks are still in good state..
tnx.
glhostprddisk6 D61407C8 0 11 Good
glhostprddisk1 94D4EAC3 1 10 Good
glhostprddisk9 D61457DD 2 9 Good
glhostprddisk0 AA74FC1C 3 8 Good
glhostprddisk11 D6151891 4 7 Good
glhostprddisk4 D612F46D 5 6 Good
glhostprddisk3 D612EA69 6 5 Good
glhostprddisk7 D6145061 7 4 Good
glhostprddisk5 D6137370 8 3 Good
glhostprddisk2 94D4F153 9 2 Good
glhostprddisk10 D6147F92 10 1 Good
glhostprddisk8 D61457BE 11 0 Good
New errors:
Message 1:
From root Thu Mar 24 05:15:04 2005
Date: Thu, 24 Mar 2005 05:15:03 +1000
From: root
To: ssa_adm
Subject: ssa0
Thu Mar 24 05:15:03 EET 2005
Error Log Analysis has detected error(s) that may require your attention.
ssa0 SRN 49000 IBM SSA 160 SerialRAID Adapter (14109100)
Message 2:
From root Thu Mar 24 05:15:05 2005
Date: Thu, 24 Mar 2005 05:15:05 +1000
From: root
To: ssa_adm
Subject: pdisk10
Thu Mar 24 05:15:05 EET 2005
Error Log Analysis has detected error(s) that may require your attention.
pdisk10 SRN 31000 SSA160 Physical Disk Drive
49000 Description: A RAID array is in the
Degraded state because a disk drive is not
available to the array, and a write command
has been sent to that array.
Action: Refer to the SSA Adapters: User's
Guide and Maintenance Information.
Could be a disk was taken out of the system without the required actions being performed. This could be an old error that you don’t need to worry about.
Are any of the disks part of a raid array ?
31000 Description: The disk drive has been
reset by the adapter. The disk drive might
be going to fail.
Action: Refer to the user's or service
guide for the unit containing the disk drive.
31000 can be caused by all sorts of things but is not necessarily a real error
The hop count of 12 (D6012) is the link between pdisk10 and 8. Hop 1 is from the adapter to pdisk6, etc. So hop 12 is between pdisk10 and pdisk8.
At least you have sorted out the cron jobs so you should be seeing real errors now.
Have you updated the filesets and firmware ? If not then you will never stand a good chance of finding where the problem really is. The old filesets and firmware were not able to pin point errors, so you could be chasing a “false” error for ages.
If you can shut down the system then remove and replace pdisk8 and pdisk10 a couple of times to make sure they are making a good connection with the backplane. (After getting the filesets and firmware upto date) then you should either have a solid disk fault or the errors will go away.
As always, take a couple of backups first ;-)
SSA problems are not easy to diagnose over a forum, but I don’t think your data is at risk, so far…..take a backup anyway….
You MUST update the filesets and firmware to stand a chance of finding a loop problem, See the above links for 4.3.3 to 5.3 updates. If your system is pre-4.3.3 you have a problem and must contact IBM for older filesets. The SSA stuff is not an RS6k / pSeries product so you will need to contact the storage / SSA support people in IBM to get help on pre-AIX 4.3.3 problems.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.