Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Solstice Disk Suite 4.2.1 disk failover problem 2

Status
Not open for further replies.

scripter99

Technical User
Sep 30, 2003
6
GB
Platform Type = V240
x2 1GHz CPUs, 4GB RAM

Hi,

During testing of our Solstice Disk Suite configuration, comprised of 2 disks configured as a mirror (using 2 submirrors), when we remove the master disk the mirror takes over seamlessly but the CPU goes 100% WIO. The system never recovers from this and requires a reboot, with the original master reinserted, in order to restore normal service.

Any ideas?

Also, something that is slightly different about this platform is that root is configured on slice 1 (The EEPROM has been updated to reflect this). Although, I do not see how this can cause a problem.

Here is a copy of my /etc/lvm/md.cf

d30 -m d10 d20 1
d10 1 1 c1t0d0s0
d20 1 1 c1t1d0s0
d31 -m d11 d21 1
d11 1 1 c1t0d0s1
d21 1 1 c1t1d0s1
d32 -m d12 d22 1
d12 1 1 c1t0d0s3
d22 1 1 c1t1d0s3
d33 -m d13 d23 1
d13 1 1 c1t0d0s4
d23 1 1 c1t1d0s4
d34 -m d14 d24 1
d14 1 1 c1t0d0s5
d24 1 1 c1t1d0s5
hsp001

(Please note the hsp is empty. Also, root is on d31).

Any assistance would be appreciated.

Regards,

Chris
 
There are a vew points:

* how many metadbs do you have on disk2 (min 3 is required); note that you need (n/2)+1 metadbs to boot the system; in OBP you boot c1t0d0s0 which loads the kernel, the kernel reads /etc/system und there is is metadevice d31 configured as /

* If you want to boot disk2 you need an OBP devicealias to boot from c1t1d0s1 eg "ok boot disk2" add disk2 to the list of boot devices

* 100% CPU Load: what OS? Which patchlevel? Did you install Recommened Patches (which date)

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years
 
any news? infos?

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years
 
Hi Franz,

thanks for your reply. The answers to the questions that you raise are as follows:

- I have 4 metadb replicas configured x2 on both slices 6 & 7.

- the OS installed is Solaris 8 05/03 (apologies for missing this off the original mail).

- the box is at Solaris Generic patch level 21. The reason for this is that the application software we are running was proved at this level. Although, I imagine that at some point we will upgrade to a later patch level.

- I configured an OBP devalias for the standby disk. As follows:

eeprom nvramrc="devalias altboot /pci@1c,600000/scsi@2/sd@1,0:b"

- as before, root is on slice 1 (hence 0:b above) and swap is on slice 0.

I double checked my config against example configs and it appeared OK. However, I am currently rebuilding from scratch in case I overlooked something.

Any more assistance would be appreciated.

Thanks,

Chris


 
if you have "only" 4 Metadbs it won't work booting from disk 2, since you do not have enough dbs to check (you need at least 51% of dbs, or [n/2]+1, in your case 3); create another metadb on disk2
Code:
metadb -a -c 2 /.../....s7

I recommend to install latest Recommeneded Patches

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years
 
Thanks Franz,

however, I have 4 metadbs configured on each of the 2 disks. As follows:

# metadb -a -f -c2 /dev/dsk/c1t0d0s6 /dev/dsk/c1t1d0s6
# metadb -a -f -c2 /dev/dsk/c1t0d0s7 /dev/dsk/c1t1d0s7

My understanding is that you just need at least 2 on each disk. Is that correct?

Also, based on the calculation that you have provided, that means I would need to have an odd number of replicas on disk 2 (but a number which is greater than that on disk 1), is that correct?

That being the case, if disk 1 failed (disk 2 now master)and was then replaced, does disk 1 assume the role of master once more? If not, and disk 2 stayed master, if disk 2 failed the system would fail (i.e. not enough relicas on disk 1).

Please feel free to correct my understanding (as I have probably confused myself!!!)

Thanks,

Chris
 
SDS/SVM needs to make a "majority decision" when booting; it needs (n/2)+1 Statedatabase to continue, and at least three dbs; SDS was designed to run on larger systems with lot's of disks (at least 3 Disks each hosting 3 Databases)
, when SDS was designed Sun Disks where very expensive, nobody mirrored small servers...

With 2 disks you cannot make the system realy failsave using SDS (system will not boot if one disk is not available), but you can make the system at least a little bit stronger (if one disk fails system will continue running); since most disks are hot swapable you can replace the failed disk, repartition it to your needs and resync the metadevices and the statedatabases

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years
 
Thanks Franz,

I am starting to understand a bit more now. Also, since reconfiguring the platform my Solstice configuration appears to work fine now. I can now remove disk 1, disk 2 takes over seamlessly, and the platform performance remains stable.

I think the problem that I had before might of been due to the fact that I removed disk 1 before the Solstice disks had finished resyncing!

However, during my testing I have made a number of observations. As follows:

- after removing disk 1 I noticed from metastat that 2 of my slices still reported their submirror for disk 1 as being 'Okay'. It wasn't until I actually tried some commands on these slices (e.g. ls, vi, etc) that the metastat for these slices was updated to show that the submirror for disk 1 'Needs maintenance'. Have you seen this before?

- also, what is the reason for making disk 2 bootable if the system will not boot with only one disk in? (i.e. disk 1 has failed). Is is so you can simply boot from disk 2 and have disk 1 as the standby?

Thank you for all of your help.

Best Regards,

Chris
 
1) 'Okay' vs. 'Needs maintenance': this works "as designed", SDS tries to access a metamirror (or device) only in case of a write (since you always write to any mirror); if you set up a readonly mirror (eg. a softwarepool) and configure to read only from the first disk (man metaparam) and disk2 fails status will stay "okay" forever. In Solaris 9 (i think it is SDS 4.2.1, too) there is a new daemon checking these devices and sets them to Maintainance in case of a failure...

2) booting disk2: do you get an error message? it should work

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years
 
Hi Franz,

apologies for the late reply, I have been on holiday. However, I have a couple of more questions for you if that is OK.

Basically, I have my SDS configuration working now. It is made up of 2 disks. Each disk has 2 slices that store the metadbs. Each of the 2 slices has 2 replicas. Hence, in total there are 4 replicas on each disk. My testing has shown that I can remove one of the disks (say, disk 1) and the system will stay running. This behaviour is in line with the majority consnesus algorithm on which SDS is based. As follows:

"The majority consensus algorithm accounts for the following: the system will stay running with exactly half or more replicas; the system will panic when less than half the replicas are available; the system will not reboot without one more than half the total replicas."

However, if I were to reboot the system with only 1 disk in, the boot would fail since only half of the replicas are available (as described above). In order to combat this, we have decided that when a disk fails we shall have a procedure in place to add another replica to the remaining disk so that the system should still be able to boot if required (until the failed disk has been replaced). My questions, therefore, are as follows:

1) Is it possible to add replicas on the fly?

When I try to do this (by adding the new replica to an existing slice) I receive the following error:

"Error: c#t#d#s#: has appeared in more than once in the specification of d#"

2) If the answer to (1) is YES, do we need a spare slice for this?

3) Also, should all replicas be put on dedicated slices? or, can they also reside on slices where normal filesystems are located?

Thanks for your help.

Regards,

Chris
 
As far as 3) is concrened, I think the usual advice is to use dedicated partitions for the metadbs. However, I have heard of 'borrowing' space from swap to form a new metadb slice, though I've never used it myself.

Generally, adding metedbs to existing slices is likely to increase, if not guarantee, data-loss and as such should be avoided.
 
Hi,

I am still having problems booting from my 2nd disk when the 1st disk has been removed. My nvramrc devalias is configured as follows:

ok printenv nvramrc
nvramrc = devalias altboot /pci@1c,600000/scsi@2/sd@1,0:b

The above string corresponds to the 2nd disk, as follows:

# ls -l /dev/rdsk/c1t1d0s1

llrwxrwxrwx 1 root root 47 May 18 10:39 /dev/rdsk/c1t1d0s1 -> ../.
./devices/pci@1c,600000/scsi@2/sd@1,0:b,raw

However, when I try to boot this I get the following error:

ok boot altboot

"Can't locate boot device"

ps: I installed the bootblk for the 2nd disk as follows:

installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c1t1d0s1

Any ideas?

Regards,

Chris

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top