Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Quorum lost

Status
Not open for further replies.

J1gh2

MIS
Jul 7, 2004
109
GB
Hi Folks

Being new to AIX disk management, I changed a failed SSA disk but when I added a new disk in I didn`t add it as the same hdisk that I removed (I thought it didn`t matter as long another a new disk is added to the mirror). I am led to believe that, as a result of this, I couldn`t see it to enable fast write cache. Few days later we started receiving the SSA_DEGRADED_ERROR from the adapter. I then removed the new disk that I added and re-added as the same name as the hdisk that had failed originally but the error is still occuring.

Also quorum was lost and I need to varyoff and varyon on to enable quorum. However, I can`t varyoff because the file systems in the volume group are mounted and I can`t unmount the file systems because they are busy.

Does anyone know how I can enable quorum without having to varyoff the volume group (I have truied running varyonvg but the quorum is still 1)? Is this likely to get rid of the performance problem and the error?

Thanks a lot

Jj
 
changing quorum requires varyoff/varyon of the vg, period.

It should be possible to add the disk back into whatever sort of arrangement you had without altering quorum, but it takes some knowledge of your arrangement (mirrored jbod, some sort of raid array), etc.

If the data was on mirrored disk, it'll take some cleanup of the LVM to get rid of the defunct PV, and the LVs that were on it, but is still totally possible without offlining anything.

If the disk was a member of a raid, it's a piece of cake.
 
The SSA_DEGRADED_ERROR indicates that at least one physical disk (pdisk) of a raid array has failed and there is no hot spare available to replace it.


When you say that you didn't add it in as the same hdisk, do you mean pdisk? or was the disk in use as a system disk (hdisk).

If the former, you can use the tools (as root) in "smitty ssaraid" to identify the pdisk you add to the drawer, and add it to the failing array. "List/Identify SSA Physical Disks->List Array Candidate Disks" to find the unassigned pdisk, "List All Defined SSA RAID Arrays" to find the degraded array (status in column 4), and "Change/Show Use of an SSA Physical Disk" to change the unassigned pdisk to a hot spare. "List/Identify SSA Physical Disks->List Rejected Array Disks" should tell you which pdisk has failed, causing the degraded array state.

If the latter, it sounds like you have TWO problems now. The "smitty ssaraid" tools above, along with others in the menu, can help you find out at least some of what's going on with regard to the new error.

Can you give a more detailed explanation of your configuration, and the changes you described in your original post?


Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L

 
Hi Rod

Thanks for your time and suggestions and sorry for the delay in getting back to you to thank you.

I am new to this environment and the outgoing AIX administrator left no documentation so I basically have nothing to go by. I was told that we only use raid 0 but when I look at the "list of defined SSA raid arrays" I see "8DAD3DB21EF54CK 8DAD3DB21EF54CK free unknown 0 raid_5".

This is the raid array on the adapter that is showing the SSA_DEGRADED_ERROR, and this is the state that it`s in:

SSA RAID Manager ssa0
SSA RAID Array 8DAD3DB21EF54CK
Connection Address / Array Name 8DAD3DB21EF54CK
RAID Array Type raid_5
State unknown
Size of Array 0
Member Disks 0090D612D45C00D
Percentage Rebuilt Not Rebuilding
Enable Use of Hot Spares no +
Choose Hot Spare only from Preferred Pool no +
Allow Page Splits no +
Current Use Unused

However, when I try to to change the current use to change current use to System disk and enable use of hot spares, it says:

ssachg: Object 8DAD3DB21EF54CK is broken
ssaraid: Change method failed for 8DAD3DB21EF54CK


As to your question "When you say that you didn't add it in as the same hdisk, do you mean pdisk? or was the disk in use as a system disk (hdisk).", yes, I meant pdisk. I removed pdisk8 which was mapped to hdisk10. However, when I added hdisk10 back its now mapped to pdisk39 and it`s on the secondary apapter (ssa1 instead of ssa0 which has the problem).

smitty ssaraid -> List/Identify SSA Physical Disks->List Array Candidate Disks" does not return anything even though pdisk8 is not assigned.

Any further thoughts will be much appreciated.

Thanks very much

Jj

 
That looks pretty broken. The Member Disks entry is normally a list of pdisks. I'll bet that long string is the serial identifier of the removed disk, and I hope that there were pdisks listed off screen to the right (raid 5 requires at least 3 pdisks).

What does "List/Identify SSA Physical Disks->List Disks in an SSA RAID Array" have to say about array 8DAD3DB21EF54CK?



Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L

 
An array of any kind should be showing up with an hdisk name, not the serial number. That sounds like the hdisk lvm entry was deleted, but the array itself wasn't deleted in the SSA array manager.

From the perspective of getting the thing back to being online and correct the fastest, I would seriously suggest back everything up, then kill everything (the raid volume, delete the pdisks, everything), run a cfgmgr, and build a new array, then restore.

It might be possible to salvage what is there, but, and no offense intended to J1gh2, it's not a job for someone without knowledge and experience, imao.
 
Chapter11,

One point: An array can be defined and not have an hdisk, that's what the Current Use of "Unused" means.

But....

J1gh2,

Chapter11's right about the need for someone more qualified than yourself to work on this. AIX disk management is quite a topic to get up to speed with, even BEFORE you add in SSA RAID management, so if this is a production server with data at risk you should probably hire this job out.

As I said before, this looks pretty broken. It might even require some ODM manipulation to straighten out, which would give us some sort of "Arcane AIX" tri-fecta. Go ahead and post the output I ask for, but I've really only got one possible fix that would be safe for you to try yourself, and I'm not even going to mention it unless the output looks like I'm hoping it will.

Backing it up, tearing it all down, and rebuilding is probably going to be the best approach, and it's best that you hire someone to do that for you. When (if) you do, be sure it's understood that they are to explain what they're doing as well, so you can get a little further up the learning curve.



Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L

 
Guys

Thanks very much for your help. Now I know that size of the challenge. I am working on a live system so the last thing I want is to take the system down inadvertently.

Rod:
List/Identify SSA Physical Disks->List Disks in an SSA RAID Array shows:

8DAD3DB21EF54CK 8DAD3DB21EF54CK free unknown 0 raid_5

and executing that shows:

pdisk8 0090D612D45C00D member unknown 36.4GB Physical disk


Chapter11:
"If the data was on mirrored disk, it'll take some cleanup of the LVM to get rid of the defunct PV, and the LVs that were on it, but is still totally possible without offlining anything."

Any tips on how I can do this without a great risk of making the situation worse?

Thanks


 
J1gh2,

As I stated before, a raid 5 array requires at least 3 disks. Since you had no Array Candidate Disks, it sounds like there's no hope for this one. Not that anything would be lost. Whatever may have been on pdisk8 has already ceased to be accessible.

I'll leave it to Chapter11 to describe the LVM cleanup, if (s)he can. I've been fortunate enough (knock wood) not to have to deal with it so far.



Rod Knowlton
IBM Certified Advanced Technical Expert pSeries and AIX 5L

 
A few more questions:

is pdisk8 the "new" disk that you added into the system? If it is, that's where that raid serial number w/o an hdisk entry is coming from - that disk was a member of a raid5 in its previous use, and wasn't properly wiped prior to you receiving it.

second, you're doing mirroring if I'm reading things right: are you doing this at the LVM level, or as an SSA RAID1 setup?

I guess, to start over mentally, post the following outputs from the relevant volume group:

lsdev -Cc pdisk

lsvg -p ${VG}

lsvg -l ${VG}
 
Thanks Chapter 11

I think you hit the nail right on the head when you said that the serial number of pdisk8 is coming from it`s previous usage (as a member of a raid 5 array) because first of all, I understand that our raid should 0 not 5. I think that`s exactly what has happened because I followed the standard procedure to change the disk....

Anyway, here goes the requested info.

dev2:/>lsdev -Cc pdisk
pdisk0 Available 20-58-7133-12-P SSA160 Physical Disk Drive
pdisk3 Available 20-58-7133-10-P SSA160 Physical Disk Drive
pdisk4 Available 20-58-7133-03-P SSA160 Physical Disk Drive
pdisk5 Available 20-58-7133-02-P SSA160 Physical Disk Drive
pdisk6 Available 20-58-7133-08-P SSA160 Physical Disk Drive
pdisk7 Available 20-58-7133-04-P SSA160 Physical Disk Drive
pdisk9 Available 20-58-7133-05-P SSA160 Physical Disk Drive
pdisk10 Available 20-58-7133-01-P SSA160 Physical Disk Drive
pdisk12 Available 20-58-7133-14-P SSA160 Physical Disk Drive
pdisk13 Available 20-58-7133-15-P SSA160 Physical Disk Drive
pdisk14 Available 20-58-7133-16-P SSA160 Physical Disk Drive
pdisk1 Available 20-58-7133-09-P SSA160 Physical Disk Drive
pdisk19 Available 30-58-7133-04-P SSA160 Physical Disk Drive
pdisk39 Available 30-58-7133-12-P SSA160 Physical Disk Drive
pdisk16 Available 30-58-7133-01-P SSA160 Physical Disk Drive
pdisk17 Available 30-58-7133-02-P SSA160 Physical Disk Drive
pdisk18 Available 30-58-7133-03-P SSA160 Physical Disk Drive
pdisk20 Available 30-58-7133-05-P SSA160 Physical Disk Drive
pdisk22 Available 30-58-7133-07-P SSA160 Physical Disk Drive
pdisk23 Available 30-58-7133-08-P SSA160 Physical Disk Drive
pdisk24 Available 30-58-7133-09-P SSA160 Physical Disk Drive
pdisk25 Available 30-58-7133-10-P SSA160 Physical Disk Drive
pdisk26 Available 30-58-7133-11-P SSA160 Physical Disk Drive
pdisk27 Available 30-58-7133-12-P SSA160 Physical Disk Drive
pdisk60 Available 30-58-7133-06-P SSA160 Physical Disk Drive
pdisk21 Available 30-58-7133-07-P SSA160 Physical Disk Drive
pdisk2 Available 20-58-7133-06-P SSA160 Physical Disk Drive
pdisk28 Available 30-58-7133-06-P SSA160 Physical Disk Drive
pdisk29 Available 30-58-7133-04-P SSA160 Physical Disk Drive
pdisk30 Available 30-58-7133-03-P SSA160 Physical Disk Drive
pdisk31 Available 30-58-7133-13-P SSA160 Physical Disk Drive
pdisk32 Available 30-58-7133-15-P SSA160 Physical Disk Drive
pdisk33 Available 30-58-7133-10-P SSA160 Physical Disk Drive
pdisk34 Available 30-58-7133-01-P SSA160 Physical Disk Drive
pdisk35 Available 30-58-7133-08-P SSA160 Physical Disk Drive
pdisk36 Available 30-58-7133-05-P SSA160 Physical Disk Drive
pdisk37 Available 30-58-7133-14-P SSA160 Physical Disk Drive
pdisk38 Available 30-58-7133-16-P SSA160 Physical Disk Drive
pdisk40 Available 30-58-7133-02-P SSA160 Physical Disk Drive
pdisk41 Available 30-58-7133-11-P SSA160 Physical Disk Drive
pdisk43 Available 30-58-7133-09-P SSA160 Physical Disk Drive
pdisk44 Available 20-58-7133-01-P SSA160 Physical Disk Drive
pdisk45 Available 20-58-7133-02-P SSA160 Physical Disk Drive
pdisk46 Available 20-58-7133-03-P SSA160 Physical Disk Drive
pdisk47 Available 20-58-7133-04-P SSA160 Physical Disk Drive
pdisk48 Available 20-58-7133-05-P SSA160 Physical Disk Drive
pdisk49 Available 20-58-7133-06-P SSA160 Physical Disk Drive
pdisk50 Available 20-58-7133-07-P SSA160 Physical Disk Drive
pdisk51 Available 20-58-7133-08-P SSA160 Physical Disk Drive
pdisk52 Available 20-58-7133-09-P SSA160 Physical Disk Drive
pdisk53 Available 20-58-7133-10-P SSA160 Physical Disk Drive
pdisk54 Available 20-58-7133-11-P SSA160 Physical Disk Drive
pdisk55 Available 20-58-7133-12-P SSA160 Physical Disk Drive
pdisk56 Available 20-58-7133-13-P SSA160 Physical Disk Drive
pdisk57 Available 20-58-7133-14-P SSA160 Physical Disk Drive
pdisk58 Available 20-58-7133-15-P SSA160 Physical Disk Drive
pdisk59 Available 20-58-7133-16-P SSA160 Physical Disk Drive
pdisk11 Available 20-58-7133-13-P SSA160 Physical Disk Drive


dev2:/>lsvg -p vg18
vg18:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk10 active 543 34 04..00..00..00..30
hdisk41 active 543 34 04..00..00..00..30

Please note that hdisk10=pdisk39 but was previouly pdisk8.

dev2:/>lsvg -l vg18
vg18:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
lv1501 jfs 128 256 2 open/syncd /u1501
loglv17 jfslog 1 2 2 open/syncd N/A
lv3501 jfs 128 256 2 open/syncd /u3501
lv5201 jfs 71 142 2 open/syncd /u5201
lv5301 jfs 71 142 2 open/syncd /u5301
lv5401 jfs 71 142 2 open/syncd /u5401
lv5202 jfs 32 64 2 open/syncd /u5202
lv3502 jfs 7 14 2 open/syncd /u3502


Thanks a lot
 
Ok, I think I'm beginning to see what needs to be done:

smitty - devices - ssa raid arrays

select "list all defined ssa raid arrays"

You should see an array that is operating in a degraded state. If not, go down to "CHECKVGS" and we'll have to go down another path.

If there *is* an array that is degraded, do this:

change/show use of an ssa physical disk - you'll need to select which adapter it is connected to, select the new disk (pdisk39), and make it an array candidate.

then, change member disks of an array - add a disk to an array. if you get a dialog that says "there are no items of this type", then all of your ssa arrays are fine.

"CHECKVGS"
For all of your vgs, do an "lsvg -l ${VG}", and look for mirrored LVs that are stale. If every mirrored LV is syncd, then you've got no live data that isn't mirrored. If you *do* have a stale lv somewhere, stop and post the specifics, otherwise proceed to check on hotspares.

"HOTSPARE"
now, if you saw a list of arrays earlier (from list all defined arrays), and they were all "good", the next thing to do is to check them out and see if they are configured to use hot spares: change/show characterstics of an array, for each array, look and check the "Enable use of hot spares" option. If any are "no", then stop here, and we'll have to figure out what's next. If any *do* have hotspare use enabled, now select "change use of multiple ssa disks", for each adapter check the output and see if you have any hotspare disks. If you have none, I would suspect that your previous disk failure failed over to an existing hotspare, and now all you have to do is make the newly added disk the new hotspare - you can do that the same way you made it an array candidate earlier.



I know there's a lot here, if you get confused or overwhelmed, just post what you've done and we'll go from there.
 
Chapter 11
There one array defined is in an unknown state as below.

8DAD3DB21EF54CK 8DAD3DB21EF54CK free unknown 0 raid_5

I went on to change use of pdisk39 to Array Candidate Disk but use of hot spares is set to no and I when it won`t change to yes because array "ssachg: Object 8DAD3DB21EF54CK is broken".

Below are the disks that we have to play with at the moment:
# SSA physical disks that are members of arrays.
# Disks in Loop B are:
pdisk15 0090D612D45C00D member good 36.4GB disk
################################################
# SSA physical disks that are hot spares.
# Disks in Loop B are:
pdisk8 00609441BF2100D spare good 18.2GB disk

#################################################
# SSA physical disks that are free.
# Disks in Loop A are:
pdisk39 0040AA7431AB00D free good 36.4GB disk

It`s looking more and more like your earlier suggestion to backup, tear down and start again...

Thanks a lot
 
I forgot I should have been more clear - working under the assumption that that broken array is being imported by pdisk39's previous incarnation as a raid5 member, that particular entry should have been ignored.

Without that one entry, you have no raid arrays defined, correct?

Did you do a check to see if you had any stale logical volumes?

I'm beginning to wonder if you've actually got any data at risk right now, and if you're dealing with a spare disk.
 
Without that one entry, you have no raid arrays defined, correct?"

Yes, you are correct.

Did you do a check to see if you had any stale logical volumes?

Yes, as far as I can see, there are no stale logical volumes

I'm beginning to wonder if you've actually got any data at risk right now, and if you're dealing with a spare disk.

Well, the vg that I dealing with right now is usually mirrored and has two disks. I had to break hte mirror (unmirrorvg vg -> reducevg) prior to removing the disk. So the vg is currently running on one disk and, of course, the adapter is throwing SSA_DEGRADED_ERROR every hour (even when I add another disk to the vg and mirror it back) so I am just concerned that if the other disk pops, things might get a little hairer for me....

Thanks a lot anyway
 
Aha! You're not using raid mirroring, you're using LVM mirroring.

Change the use of the SSA disk to AIX System Disk, add the disk to the volume group, and do mirrorvg.

The SSA_DEGRADED_ERROR is going to point to something else - your SSA loop is probably broken someplace.

Go into "diag", "task selection", "SSA service aids"

The "Link Verification", "Configuration Verification", and "Physical Link Verification" options should be able to point you to where the problem might be. Remember that SSA is a loop, so each disk should be have two connection paths.

Example output of link verification:
[tt]
Physical Serial# Adapter Port
A1 A2 B1 B2 Status

icebox:pdisk12 9401EEDA 0 15 Good
icebox:pdisk11 9401E43C 1 14 Good
icebox:pdisk10 9401E3B7 2 13 Good
icebox:pdisk14 94BE4383 3 12 Good
icebox:pdisk13 9401EF1C 4 11 Good
icebox:pdisk9 9401A2BD 5 10 Good
icebox:pdisk8 94016755 6 9 Good
icebox:pdisk7 94012AA1 7 8 Good
icebox:pdisk4 AA1D1549 8 7 Good
icebox:pdisk5 AA1D1552 9 6 Good
icebox:pdisk1 AA1D021B 10 5 Good
icebox:pdisk3 AA1D1533 11 4 Good
icebox:pdisk2 AA1D14E6 12 3 Good
icebox:pdisk6 AA74064C 13 2 Good
icebox:pdisk0 35E5EB02 14 1 Good
icebox:pdisk15 D61364F0 15 0 Good
[/tt]

The physical link verification output will be a little more cryptic, but if you have any lines of question marks appearing, that's a bad thing.
 
Hi Chapter11

I have been here a few times already. Link verification shows that all the disks are good and have two connection paths.

However, I have found that hdisk10 (pdisk39) is attached to ssa1 loop a instead of ssa0 loop b (SSA0 is the adapter with the SSA_DEGRADED_ERROR). So, essentially, both disks are attached to ssa1 as shown below. I have tried several times to get hdisk10 to attach to ssa0 but I am yet to succeed. I remove the disks but once I add it in and run config manager, the new disk automatically attaches to ssa1.

Disk hdisk10
Disk type hdisk
Disk interface ssa
Description SSA Logical Disk Drive
Status Available
Location 30-58-L
Location Label []
Parent ssar
Size in Megabytes 36446
adapter_a ssa1
adapter_b none
primary_adapter adapter_a +
Connection address 0040AA7431AB00D


........................................................................................................................
Disk hdisk41
Disk type hdisk
Disk interface ssa
Description SSA Logical Disk Drive
Status Available
Location 30-58-L
Location Label []
Parent ssar
Size in Megabytes 36446
adapter_a ssa1
adapter_b none
primary_adapter adapter_a +
Connection address 0040AA74B99200D
Physical volume IDENTIFIER 0008833454c8fe460000000000000000


Perhaps you may know how I can change the adapter that hdisk10 is attached to.

Best regards
 
Unless you're using redundant looping, that's a matter of cabling the drive to the correct adapter.

At this point I'm going to have to recommend getting an experienced admin to get some hands-on time with your machine.
 
OK. I will source further help from management.

I greatly appreciate your help and support.

Thanks very much
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top