any help to recover a bad hdisk

cawxy · Jul 22, 2003

Hi,

One hdisk is damaged. I am looking for help to recover the data on the bad hdisk before going to a data recovery company. From diag, I can do "Display/Alter Sector" but couldn't run "Disk to Disk Copy" since some sectors of the bad hdisk are unreadable. Is there any AIX tool which can copy the GOOD sectors of the one disk (skipping the bad ones)?

I assume the answer is 99.99% NO but maybe some genius will help. Thanks.

arvibm · Jul 23, 2003

Hi,

U can recover the data incase of mltiple disk failure in RAID-5 Array.Search for Document id: "MIGR-39144" on IBM site.But if it is a case of single disk failure then sorry there is no chance to recover the data.

Regards

arvind

cawxy · Jul 23, 2003

Arvind,

Thanks for your help. My hard disk is a IBM 2104 TL1 model connected to a RS6000 machine using AIX 4.3.3. I am not sure if it's RAID-5 array multiple disk or not. Is there any info/doc I can find for help?

York

bonsky · Jul 23, 2003

it depends wether you are using a SCSI raid adapter to connect to your 2104 or a plain scsi adapter. You can check it via lsdev also or lsattr if you can point all your scsi adapters.
well, have you tried doing a file system by file system copy instead of the whole disk. Hopefully it might work, just to isolate the bad data from good ones.At least you can still salvage the good ones as long as you can determine which among which..good luck!

arvibm · Jul 23, 2003

Hi York,

Here is the procedure to recover data incase of multiple disk failure in RAID-5.

ServeRAID - Recovering from multiple disk DDD failures

Recovering from multiple disks defunct in a ServeRAID environment
The ServeRAID controller is designed to tolerate a single disk failure when configured with redundant RAID levels. There is no guarantee that any multiple disk failure can be recovered with the data intact. IBM is providing these steps since they offer the highest possibility of a successful recovery, if under the rare circumstances, multiple disk drives are marked defunct within the same array.

Resource requirements

IBM ServeRAID Support CD(1)
ServeRAID Command Line Diskette (available on the IBM ServeRAID Support CD(2) or can be downloaded)
ServeRAID-3x or ServeRAID-4x Controller(s)
These procedures depend on certain logging functions enabled in the BIOS/Firmware of the ServeRAID controller which was first implemented in version 4.0 of the IBM ServeRAID Support CD. The ServeRAID controller must have previously been flashed using any 4.x version of the IBM ServeRAID Support CD or diskette(s) prior to the failure.
DUMPLOG.BAT and CLEARLOG.BAT for DOS/Windows and/or the version of DUMPLOG and CLEARLOG appropriate for your operating system.

(1) The IBM ServeRAID Support CD software is backward compatible. If you boot to a newer version of the IBM ServeRAID Support CD which prompts you to upgrade BIOS/Firmware, you should cancel out of the BIOS/Firmware update until the system is recovered. Upgrading software levels while in a failed state is not recommended unless otherwise directed by IBM support.
(2) Diskette images of the ServeRAID downloadable diskettes are also on the IBM ServeRAID Support CD in the IMAGES directory. You can build a diskette of the ServeRAID Command Line utilities from this image. For more information see the README.TXT file in the IMAGES directory on the IBM ServeRAID Support CD.

Recovery steps ServeRAID systems with multiple disk failures

Capture the ServeRAID logs using DUMPLOG.BAT. There are two methods of capturing these logs, depending on the situation. The first method is used when the operating system is located on the failed logical drive and the second is used when any other logical drive has failed and the operating system is still accessible. These logs should be sent to your IBM Support Center as necessary for root cause analysis. This is the best evidence to determine what caused the failure.
Use Method #1 if the operating system logical drive is off-line. Copy the DUMPLOG.BAT and CLEARLOG.BAT files to the root of the ServeRAID Command Line diskette or the A:\ directory. Boot to the ServeRAID Command line diskette and run the DUMPLOG command using the following syntax:

DUMPLOG <FILENAME.TXT> <Controller#>

Use Method #2 when a data logical drive is off-line and the operating system is online. Copy or extract the DUMPLOG utility appropriate for your operating system to a local directory or folder. Run the DUMPLOG command following the instructions on the above Website (under Resource Requirements) appropriate for the operating system to capture ServeRAID logs.
Use ServeRAID Manager to determine the first disk in the failed array to be marked Defunct under these two conditions: the operating system is accessible (use (a) below) and the operating system is not accessible (use (b) below).
When the operating system is accessible, determine the order the disks were marked Defunct using the following steps:
Open ServeRAID Manager and note the hard disk drive(s) that are defunct
In ServeRAID Manager highlight the system hostname with Defunct (DDD) drives
Right click the system hostname then and choose Save printable configuration and event logs. These logs are saved into the installation directory of ServeRAID Manager, usually "Program Files\RAIDMAN." The log files are saved as RAIDx.LOG where x is the controller number.
Open the correct RAIDx.LOG text file into any standard text editor for the controller with Defunct Drives. Go to the last page of the RAIDx.LOG file and you will see a list called ServeRAID defunct drive event log for controller x. This portion of the log will list by date and time stamp all the disk drives marked Defunct by the order they went off-line with the most recent failure shown at the bottom of the list. Determine which disk failed first. The first drive marked Defunct should be at the at the top of list.

IMPORTANT NOTICE: Since the "ServeRAID defunct drive event log" is not cleared, there may be entries from earlier incidents of defunct drives that does not pertain to the problem currently being worked. Review the list of defunct drives carefully from the bottom of the list to the top and identify the point where the first drive associated with this incident is logged. The date and time stamps are usually the strongest indicators of where this point is.

There is no guarantee that the "ServeRAID defunct drive event log" will list the drives in the exact order the disks failed under certain circumstances. One example is when an array is setup across multiple ServeRAID channels. In this configuration, the ServeRAID controller issues parallel I/O's to disk devices on each channel. In the event of a catastrophic failure, disks could also be marked defunct in parallel. This could impede the reliability of the date and time stamps as the ServeRAID controller writes events from multiple channels operating in parallel to a single log.

Detach or remove the first hard disk marked Defunct from the backplane or cable (after the system is powered off). This hard disk will need to be replaced.
Exit ServeRAID Manger and power the server off
When the operating system is not accessible, determine the order the disks were marked Defunct using the following steps:
Boot to the IBM ServeRAID Support CD
Highlight Local in ServeRAID Manager, right click and select Save printable configuration and event logs. You will be prompted for a blank floppy diskette to be inserted into Drive A:.
Insert a diskette and ServeRAID Manager will save a RAIDx.LOG file to the diskette. The log files are saved as RAIDx.LOG where x is the controller number.
Take the diskette from Drive A: to another working system and open the RAIDx.LOG text file into any standard text editor. Go to the last page of the RAIDx.LOG file and you will see a list called ServeRAID defunct drive event log for controller x. This portion of the log will list by date and time stamp all the disk drives marked Defunct by the order they went off-line with the most recent failure shown at the bottom of the list. Determine which disk failed first. The first drive marked Defunct should be at the at the top of list.

IMPORTANT NOTICE: Since the "ServeRAID defunct drive event log" is not cleared, there may be entries from earlier incidents of defunct drives that doesn't pertain to the problem currently being worked. Review the list of defunct drives carefully from the bottom of the list to the top and identify the point where the first drive associated with this incident is logged. The date and time stamps are usually the strongest indicators of where this point is.

There is no guarantee that the "ServeRAID defunct drive event log" will list the drives in the exact order the disks failed under certain circumstances. One example is when an array is setup across multiple ServeRAID channels. In this configuration, the ServeRAID controller issues parallel I/O's to disk devices on each channel. In the event of a catastrophic failure, disks could also be marked defunct in parallel. This could impede the reliability of the date and time stamps as the ServeRAID controller writes events from multiple channels operating in parallel to a single log.

Detach or remove the first hard disk marked Defunct from the backplane (or cable after the system is powered off). This hard disk will need to be replaced.
Exit ServeRAID Manger and power the server off
While the server is powered off, reseat the PCI ServeRAID controller(s). Reseat the SCSI cable(s) and the disks against the backplanes. Reseat the power cables to the backplane and SCSI backplane repeater options if they are present. As you are reseating the components, visually inspect each piece for bent pins, nicks, crimps, pinches or other signs of damage. Take extra time to ensure that each component snaps or clicks into place properly.
Power on the system and boot to the IBM ServeRAID Support CD
Using ServeRAID Manager, undefine all Hot Spare drives to avoid an accidental rebuild from starting
Using ServeRAID Manager, set each Defunct Hard Drives in the failed array to an "Online" state, (except the first drive marked Defunct) as listed in the "ServeRAID defunct drive event log." The failed logical drives should change to a critical state. If there are problems bringing drives online, or if a drive initially goes online then fails to a Defunct state soon after, see the Hardware Troubleshooting sections below before proceeding. The logical drives must be in a critical state before proceeding.
In this step we will attempt to access the critical logical drive(s). If you are still in ServeRAID Manager, exit and restart the system.
If the operating system logical drive is now critical, attempt to boot into the installed operating system. (If you are prompted to perform any CHKDSK activities or file system integrity tests, choose to skip these tests)
If the data is on the critical logical drive, boot into the operating system and attempt to access the data. (If you are prompted to perform any CHKDSK activities or file system integrity tests, choose to skip these tests)
If the system boots into the operating system, run the appropriate command to do a READ-ONLY file system integrity check on each critical logical drive. If the file system checker determines the file system does not have any data corruption, the data should be good.

For Windows NT/2K systems, run CHKDSK in READ ONLY MODE at a command prompt (NO PARAMETERS) for each critical logical drive. If CHKDSK does not report data corruption, the data should be intact.

Run the IPSSEND GETBST command to determine if the bad stripe table has incremented on any logical drive. If the Bad Stripe Table has incremented to one or more, the array should be removed and recreated. The IPSSEND.EXE executable is located on the IBM ServeRAID Support CD or the command line diskette.

You can attempt to backup the data, however you may encounter "file corrupted" messages for any files that had data on the stripe that was lost. This data is unfortunately lost forever from the current logical drive.

Plan to restore or rebuild the missing data on each critical logical drive if any of the following problems persists:
The critical logical drive remains inaccessible
Data corruption is found on the critical logical drives
The system continuously fails to boot properly to the operating system
Partition information on the critical logical drives is unknown
If the data is good, initiate a backup of the data.
After the backup completes, replace the remaining physical hard drive that is still Defunct. An auto-rebuild should initialize when the Defunct disk is replaced.
Redefine hot spares as necessary.
Capture another set of ServeRAID logs using DUMPLOG.
Clear the ServeRAID logs using the following CLEARLOG.BAT command available on the DUMPLOG website:

CLEARLOG <Controller#>

If a case was opened with your IBM Support Center for this problem, complete steps #14 and #15, otherwise you have completed the recovery process.
Plan to capture the ServeRAID logs again using DUMPLOG within 2-3 business days of normal activity (after the ServeRAID logs were cleared in step #12) to confirm the ServeRAID subsystem is fixed. These logs should be emailed to your IBM support center to ensure the ServeRAID controller and SCSI bus activities are operating within normal parameters.
If additional issues are observed, an action plan will be provided with corrective actions and steps #12 and #13 should be repeated until the system is running optimally.

Hardware Troubleshooting
If you continue to experience problems, like the drives get marked DDD again or a disk that will not change to an online state, review the following guidelines to assist in identifying the configuration or hardware problem.

The most common cause of multiple disk failures is poor signal quality across the SCSI Bus. Poor signal quality will result in SCSI protocol overhead as it tries to recover from these problems. As the system becomes busier and demand for data increases, the corrective actions of the SCSI protocol increase and the SCSI bus becomes closer to saturation. This overhead will eventually limit the normal device communications bandwidths and if left unchecked, one or more SCSI devices may not be able to respond to the ServeRAID controller in a timely manner resulting in the ServeRAID controller marking the disk drives Defunct. These types of signal problems can be caused by improper installation of the ServeRAID controller in a PCI slot, poor cable connections, poor seating of hot swap drives against the SCSI backplane, improper installation or seating of backplane repeaters, and improper SCSI bus termination.

There are many possible reasons for multiple disk failures, however you should be able to isolate most hardware problems using the following isolation techniques:

Check error codes within the ServeRAID Manager when a device fails to respond to a command. Research these codes in the publicly available Hardware Maintenance Manuals.
In non-hot swap systems, make sure the disk devices are attached to the cable starting from the connector closest to the SCSI terminator and work your way forward to the connectors closest to the controller. Also examine each SCSI devices for the proper jumper settings.
While the server is powered off, reseat the ServeRAID controller in its PCI slot and all cables and disk devices on the SCSI bus.
Examine the cable(s) for bent or missing pins, nicks, crimps, pinches or over stretching.
Temporarily attach the disks to an integrated Adaptec controller (or PCI version as available) and boot into the Adaptec BIOS using CTRL-A. As the Adaptec BIOS POSTs, you should see all the expected devices and the negotiated data rates. You can review this information and determine if this is what you should expect to see.
Once you proceed into the BIOS, choose the SCSI Scan option which will list all the devices attached to the controller. Highlight and select one of the disks and initiate a "Media Test" (this is NOT destructive to data). This will test the device and the entire SCSI bus. If you see errors on the Adaptec controller, try to determine if it is the device or the cable by initiating a Media Test on other disks. Test both Online and Defunct disks, to determine if the test results are consistent with the drive states on the ServeRAID controller. You can also move Hot Swap disks to a different position on the backplane and re-test to see if the results change.

If the problem persists, swap out the SCSI cable and retry the Media Tests on the same disks. If the disks test okay, the previous cable is bad. This is a valuable tool to use as you isolate for a failing component in the SCSI path.

IMPORTANT NOTICE: The Adaptec controller can produce varying results from the ServeRAID controller because of the way the Adaptec controllers negotiates data rates with LVD/SE SCSI devices. If an Adaptec controller detects errors operating at LVD speeds, it can downgrade the data rates to single-ended speeds and continue to operate with no reported errors. The ServeRAID controller will not necessarily change data rates under the same conditions.

Use the System Diagnostics in F2 Setup to test the ServeRAID subsystem. If the test fails, disconnect the drives from the ServeRAID controller and reset the controller to factory default settings. Retry the Diagnostic tests. If Diagnostics pass then attach the disks to the ServeRAID controller, one channel at a time and retry the tests to isolate the channel of the failing component. If the controller continues to fails diagnostic tests (when using the latest available diagnostics for the server), call your IBM support center for further assistance.

Disconnect or detach the first drive in the array to be marked defunct from the cable or backplane. Restore default settings to the ServeRAID controller. Attempt to import the RAID configuration from the disk drives. Depending on how the failure occurred, this technique may have mixed results. There is a reasonable chance that all the drives will return to an online state, except the first disk which is disconnected.

Move the ServeRAID controller into a different PCI slot or PCI bus and re-test.

When attaching an LVD ServeRAID controller to legacy storage enclosures, set the data rate for the channel to the appropriate data rate supported by the enclosure.

Mixing LVD SCSI devices with Single Ended SCSI devices on the same SCSI bus will result in switching all devices on the channel to Single Ended mode and data rates.

Open a case with your IBM Support Center and submit all the sets of ServeRAID logs captured on the system for interpretation to isolate a failing component.

Applicable countries and regions
Worldwide
Back to top

Document id: MIGR-39144
Last modified: 2003-05-28
Copyright © 2003 IBM Corporation

I have worked on 2104-TU3 and DU3 storages.I don't have any idea of 2104-TL1 Storage.Is there any special features in this model?

Regards

arvind

cawxy · Jul 28, 2003

Hi Bonsky & Arvind,

My RS6000 machine has an integrated Wide/Ultra2 SCSI controller to which the 2104 is connected and so it's not a RAID adaptor. Copying the file system is impossible as I cannot even mout it. Very unfortunately the data is not able to be recovered as Arvind has pointed out.

Anyway, thanks so much for your kindly help.

York

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

any help to recover a bad hdisk

cawxy

Technical User

arvibm

Vendor

cawxy

Technical User

bonsky

MIS

arvibm

Vendor

cawxy

Technical User

Similar threads

Part and Inventory Search

Sponsor