SSA ERROR 2

Midrange · Mar 7, 2005

Hi there, just need to know if anyone have encountered this problem.

I got an SRN D6012 error on my system. The SSA disk is internal to model 6F1. I've checked the SSA guide but the SRN was not on the list.

thanks.

ogniemi · Mar 8, 2005

it is in this document:

ftp://ftp.software.ibm.com/storage/7133/pdfs/SC2__UKUserG.pdf

DukeSSD · Mar 8, 2005

If you have multiple SSA adapters accessing the same drawer, or multiple hosts accessing the same drawer, make sure the SSA diags, run from cron, do not run at the same time, offset them by, say, 5 minutes. This is the sort of false error you get if you try to run multiple instances of diags on the same loop at the same time.
If you can rule that out, use the link speed SSA service aid to find the affected disk, or count 12 devices from port A1 on the card logging the error to find the suspect link.
There is a bypass card between every 4 disks so the problem could be a disk or the bypass card or the cables from that bypass card if another adapter / host is connected at that point.

Midrange · Mar 17, 2005

Hi Ogniemi/Dukessd,

i've checked the disk thru link verification and they are all in good state even the number 12 disk.

do you think i still need toi replace the 12th disk?

tnx.

DukeSSD · Mar 17, 2005

No, sorry, my last post assumed (obviously wrongly) that your SSA was in a 7133 drawer.
D6PAA is a link speed problem (link running at 20MBps rather than 40MBps). This could be caused by a bad disk but is more likely to be a cable or disk seating / noise problem.
As you have a 7025 you will have 6pack backplanes, the slow link is 12 disks from the A1 port on the adapter that reported the problem.
In 7025 machines people tend to have internal or the first backplane as SCSI for the rootvg, then SSA on the other 6packs.
So you probably have 2 full backplanes and then the SSA loop goes back to the adapter (port A2).
If you have 2 adapters connected to the disks it is probably a SSA cron diag problem (see last post, separate them by 5 minutes). If you only have the one adapter then the disks were probably just busy when the diags ran (they normally run every hour – check the cron – do you get the error every hour or at least every time the SSA diags run ?).
If the link verification works but you keep getting the D6012 errors then reseat the disk that is 12 hops from A1 and / or reseat the cable after that disk, noise on the loop can cause these errors so reseating the disk / cable may fix it.
Make sure your SSA filesets, adapter and disk firmware are up to date, they can resolve some problems the old code suffered from.
Adapter code:

http://www-1.ibm.com/support/docvie...*&uid=ssg1S1002252&loc=en_US&cs=utf-8&lang=en

Disk firmware:

http://www-1.ibm.com/support/docvie...*&uid=ssg1S1002345&loc=en_US&cs=utf-8&lang=en

Please let us know how you get on or what you find in cron for the SSA diag / or from
lsdev –C |grep SSA
Good luck.

Midrange · Mar 17, 2005

Hi dukessd,

here's what included on my crontab..

#SSA warning : Deleting the next two lines may cause errors in redundant
# SSA warning : hardware to go undetected.
01 5 * * * /usr/lpp/diagnostics/bin/run_ssa_ela 1>/dev/null 2>/dev/null
0 * * * * /usr/lpp/diagnostics/bin/run_ssa_healthcheck 1>/dev/null 2>/dev/null
# SSA warning : Deleting the next line may allow enclosure hardware errors to go undetected
30 * * * * /usr/lpp/diagnostics/bin/run_ssa_encl_healthcheck 1>/dev/null 2>/dev/null
# SSA warning : Deleting the next line may allow link speed exceptions to go undetected
30 4 * * * /usr/lpp/diagnostics/bin/run_ssa_link_speed 1>/dev/null 2>/dev/null
#

DukeSSD · Mar 18, 2005

From this I see the enclosure healthcheck runs every 30 minutes and the link speed check runs every 4 hours and 30 minutes, so whenever the speed check runs the enclosure health check is trying to run at the same time. This may be the cause of the problem. If your SSA disks are all internal to the 6F1 I'm not sure you need to run the enclosure healthcheck (because you don't have an enclosure) but it may still serve some purpose so I'd move the time rather than commenting (hashing / pounding / #ing) it out. The cron entries seem to be the default but the enclosure healthcheck could be the issue if it is taking longer than normal because of the lack of an enclosure.

Try moving one or the other by 5 minutes (change one of the 30s to 35).
What does the Link Speed Service Aid show (diag, task selection, SSA Service Aids, Link Speed, ssa0) ? If it shows 40 for all links you do not have a real / solid problem.

Midrange · Mar 20, 2005

Thanks DukeSSD,

I will try to adjust the 30 to 35 and update if same problems encountered.

Midrange · Mar 22, 2005

Hi DukeSSD,

The error message generated every 4:30 am was removed but the 5:01 am messages still there. This run the run_ssa_ela cron job. What is the use of this job?

thanks.

DukeSSD · Mar 22, 2005

Ah, so you changed the cron timing and now only get the error at 5:01. So it probably was a cron / SSA diag / timing problem.
At 5:01 the cron runs ela: error log analysis. This does not report real time errors but just 'reminds' you of old errors. It analyses the error report and reminds you if there are any old errors that you might still have to deal with.
So the next step is to 'remove' the old errors and then check that you don't get any more reported.
There 'should' be two ways to do this...
1. Clear the error report: this, I am sure, will work. Copy the error report somewhere safe then clear it with
errclear 0
(to copy the error report – just in case you need to refer to it in the future – use:
errpt –a > errpt.old
or similar.
2. Log a repair action. This should work OK but if AIX or the SSA filesets are very old it may not. To do this run diag, advanced diags, system verification, select all SSA resources, ssa0, ssa1, enclosure0, etc. ( by highlighting them and hitting enter (a plus '+' sign appears on the left), then hit F7 to commit, when if is finished and has reported all the old problems it should give you the option to log a repair action. If it does not (early 4.3.3 or earlier) hit F10 to exit then (again) diag, task selection, log a repair action, select all SSA resources (highlight and hit enter, then F7 to commit). This will add an errpt entry called REPLACED_FRU for each resource and from that point on ela should ignore all 'old' errors.
Good luck, let me / us know how you get on.

DukeSSD · Mar 22, 2005

How was that for a prompt reply !!

DukeSSD · Mar 22, 2005

Why did I get a star for each post ? Is it because I'm on the MVP list or just because I'm so great !!?

DukeSSD · Mar 22, 2005

Wow, lots of stars :-0 !

Midrange · Mar 23, 2005

HI DukeSSD,

I have tried to adjust the time of the crontab line 5:01 am and still giving me the error. The confusing there is that it gives me now a pdisk error but not pointing to the right pdisk if we do a reference on the previous error D6012..It should be pdisk8. Please see the error and link verification. All disks are still in good state..

tnx.

glhostprd

disk6 D61407C8 0 11 Good
glhostprd

disk1 94D4EAC3 1 10 Good
glhostprd

disk9 D61457DD 2 9 Good
glhostprd

disk0 AA74FC1C 3 8 Good
glhostprd

disk11 D6151891 4 7 Good
glhostprd

disk4 D612F46D 5 6 Good
glhostprd

disk3 D612EA69 6 5 Good
glhostprd

disk7 D6145061 7 4 Good
glhostprd

disk5 D6137370 8 3 Good
glhostprd

disk2 94D4F153 9 2 Good
glhostprd

disk10 D6147F92 10 1 Good
glhostprd

disk8 D61457BE 11 0 Good

New errors:
Message 1:
From root Thu Mar 24 05:15:04 2005
Date: Thu, 24 Mar 2005 05:15:03 +1000
From: root
To: ssa_adm
Subject: ssa0

Thu Mar 24 05:15:03 EET 2005
Error Log Analysis has detected error(s) that may require your attention.
ssa0 SRN 49000 IBM SSA 160 SerialRAID Adapter (14109100)

Message 2:
From root Thu Mar 24 05:15:05 2005
Date: Thu, 24 Mar 2005 05:15:05 +1000
From: root
To: ssa_adm
Subject: pdisk10

Thu Mar 24 05:15:05 EET 2005
Error Log Analysis has detected error(s) that may require your attention.
pdisk10 SRN 31000 SSA160 Physical Disk Drive

DukeSSD · Mar 27, 2005

49000 Description: A RAID array is in the
Degraded state because a disk drive is not
available to the array, and a write command
has been sent to that array.
Action: Refer to the SSA Adapters: User's
Guide and Maintenance Information.

Could be a disk was taken out of the system without the required actions being performed. This could be an old error that you don’t need to worry about.
Are any of the disks part of a raid array ?

31000 Description: The disk drive has been
reset by the adapter. The disk drive might
be going to fail.
Action: Refer to the user's or service
guide for the unit containing the disk drive.

31000 can be caused by all sorts of things but is not necessarily a real error

The hop count of 12 (D6012) is the link between pdisk10 and 8. Hop 1 is from the adapter to pdisk6, etc. So hop 12 is between pdisk10 and pdisk8.

At least you have sorted out the cron jobs so you should be seeing real errors now.

Have you updated the filesets and firmware ? If not then you will never stand a good chance of finding where the problem really is. The old filesets and firmware were not able to pin point errors, so you could be chasing a “false” error for ages.

If you can shut down the system then remove and replace pdisk8 and pdisk10 a couple of times to make sure they are making a good connection with the backplane. (After getting the filesets and firmware upto date) then you should either have a solid disk fault or the errors will go away.

As always, take a couple of backups first ;-)

SSA problems are not easy to diagnose over a forum, but I don’t think your data is at risk, so far…..take a backup anyway….

You MUST update the filesets and firmware to stand a chance of finding a loop problem, See the above links for 4.3.3 to 5.3 updates. If your system is pre-4.3.3 you have a problem and must contact IBM for older filesets. The SSA stuff is not an RS6k / pSeries product so you will need to contact the storage / SSA support people in IBM to get help on pre-AIX 4.3.3 problems.

DukeSSD · Mar 27, 2005

Just because I like th pretty pink stars....

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

SSA ERROR 2

Midrange

Vendor

ogniemi

Technical User

DukeSSD

Technical User

Midrange

Vendor

DukeSSD

Technical User

Midrange

Vendor

DukeSSD

Technical User

Midrange

Vendor

Midrange

Vendor

DukeSSD

Technical User

DukeSSD

Technical User

DukeSSD

Technical User

DukeSSD

Technical User

Midrange

Vendor

DukeSSD

Technical User

DukeSSD

Technical User

Similar threads

Part and Inventory Search

Sponsor