Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

rootvg SSA disk error

Status
Not open for further replies.
Nov 6, 2001
77
US
Hi

I'm getting Temp disk errors on an SSA disk that is part of my rootvg.

rootvg consists of hdisk0 and hdisk4 and rootvg is not mirrored. / , /var, /usr and an application filesystem are on hdisk4. In the SSA drawer I have 2 disks not being used, hdisk5 and hdisk6. Can I add hdisk5 to rootvg, migrate the data from hdisk4 to hdisk5 then reducevg removing hdisk4?

Thanks for any help
 
Why don't you mirror rootvg if you have unused disks? If your OS disk goes bad your server will crash.

TO answer your question, sure you can. Temp disk errors are usually not a big deal. When you start taking perm disk errors you need to worry more.

I would mirror rootvg if that server is important.


Jim Hirschauer
 
Thanks,

That's my other option- mirror rootvg. I just started this job 2 months ago, and don't know why it's not mirrored. I'm just trying to figure out the way with least impact since they don't want to shutdown the server.
 
Mirroring will probably be the best way to go. Remember to turn off quorum checking when you mirror. Disabling quorum checking on rootvg requires a reboot, but the reboot can be done at your convenience. No need to reboot the server right away.


Jim Hirschauer
 
I would first create a temp VG with a huge FS in it and run a job for a day or so to fill/clean the FS. Disks that have been sitting idle for some time can start behaving badly. No use starting a rootvg mirror with disks of uncertain quality.

If the disks turn out bad during the stress test, it is easier to have them replaced if they are still out of any productive VG...



HTH,

p5wizard
 
run errpt -a and paste the stanza for the error you're seeing to this thread please.
 
The last error we received was on June 3, and all the errors are Temp, but there have been 9 since the middle of April. I will be added a new disk to the vg today and migrating the data off, so hopefully it will be fixed by the end of the week

LABEL: SSA_DISK_ERR3
IDENTIFIER: 8BDD5B42

Date/Time: Fri Jun 3 08:17:33
Sequence Number: 116045
Machine Id: 00026197A400
Node Id: openhub
Class: H
Type: TEMP
Resource Name: pdisk1
Resource Class: pdisk
Resource Type: 4000mbC
Location: 00-01-P
VPD:
Manufacturer................IBM
Machine Type and Model......DFHCC4B1
Part Number.................89H4941
ROS Level and ID............9590
Serial Number...............681AAD0D
EC Level....................488651
Device Specific.(Z2)........RAMSC095
Device Specific.(Z3)........89H4941
Device Specific.(Z4)........97260
Probable Causes
DASD MEDIA

Failure Causes
DASD MEDIA

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
F000 0100 24B8 7A18 0000 0000 1709 0080 0009 0000 01D9 0000 05BA 0476 0076 0000


 
Note: SSA disks have a pdisk number and an hdisk number, they are not necessarily the same number - not if you also have scsi disks. So be careful when replacing SSA disks and modifying VGs (migratepv, reducevg, ...) to accommodate the replacement procedure. Your CE *should* know about that...

I guess you already know about that - but...


HTH,

p5wizard
 
pdisk and hdisk number aren't usualy the same even if you don't have scsi disks. There's a way to make dummy disks if you absolutely have to have the pdisks and hdisks be the same number. It's a pain in the rearend though and it's faster to just type ssaxlate and find out what hdisk is mapped to what pdisk. The stanza above has this:
Resource Name: pdisk1

so run ssaxlate pdisk1 and see what hdisk is associated with pdisk1 (if you have an ssa raid, you will not have hdisks associated with pdisks however, you'll have just one raid disk)

when replaceing ssa disks, just like any other disks, rmdev the hdisk and the pdisk, pull the bad drive, put in a new one and run cfgmgr. it'll create the pdisk and the hdisk AND they will have the same numbers as the old ones had unless you put other disks in too. AIX will assign the disks the lowest availble number when cfgmgr runs, and those numbers should be the ones that the bad disk had.

ssa disk err3 isn't necessarily critical. It's a temp error and could be caused by a lot of things. If you have a HARDWARE support contract, you might want to call in and ask the center to decode the sense data and let you know if the disk needs to be replaced or not.

 
We've already put the call in to check, and since we've received 9 Temp errors, they do recommend replacing the disk. Better to be at our convenience then actually causing the system to go down since it's in rootvg. I did do the relationship between hdisk and pdisk and this turns out to be pdisk4. Yesterday, I added another disk to rootvg and migrated the original disk to the new disk. The migration was successful, or so I thought, there were 2 errors in the errpt
LABEL: JFS_FSCK_REQUIRED
IDENTIFIER: CD546B25

Date/Time: Mon Jun 13 13:51:18
Sequence Number: 116048
Machine Id: 00026197A400
Node Id: openhub
Class: O
Type: INFO
Resource Name: SYSPFS

Description
FILE SYSTEM RECOVERY REQUIRED

Recommended Actions
PERFORM FULL FILE SYSTEM RECOVERY USING FSCK UTILITY


and

LABEL: JFS_META_WRITE_ERR
IDENTIFIER: D2A1B43E

Date/Time: Mon Jun 13 13:51:18
Sequence Number: 116047
Machine Id: 00026197A400
Node Id: openhub
Class: U
Type: PERM
Resource Name: SYSPFS
Resource Class: NONE
Resource Type: NONE
Location: NONE
VPD:

Description
FILE SYSTEM CORRUPTION

Probable Causes
I/O ERROR ON FILE SYSTEM CONTROL DATA

Recommended Actions
PERFORM FULL FILE SYSTEM RECOVERY USING FSCK UTILITY
CHECK ERROR LOG FOR ADDITIONAL RELATED ENTRIES


We're having trouble getting the ok from the apps owner to umount the fs to run the fsck, since this fs will greatly effect many other applications and systems. We've spoken to IBM, and they think it's more than likely caused by the bad disk.
 
It might be caused by the bad disk. I'd still like you to run fsck on it. As the app owner said why they won't allow you to unmount? If you need fire power you might try telling them that you are getting critical errors in the errpt, and if they don't allow you to unmount and do the necessary maint, they stand to lose their data.

 
I work in a hospital and this fs is for the interfaces to other areas such as the lab, admitting, surgery, etc. so they can't just bring it down during the day. The other issue on my end is this os is 4.3.3 therefore no longer supported. We have contracted for support in case we run into any issues, but the support person is only available during certain hours.
 
Alright that makes sense why they won't bring it down. Can you get maint time out of them in the middle of the night?

If so all you need to do is unmount the filesystem and run
fsck -y on the filesystem. If you get errors, run it again. if you get errors the 3rd time, the fs structure isn't fixable and you need to make sure you hve a good backup because you're probably going to have to recreate it before it becomes unreadable.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top