Solaris 8 SCSI Problem 1

dfurm · Jul 19, 2002

Have recently installed some extra SUN 9.1G disks into an E450. I am now getting some scsi errors in dmesg. Can anybody advise me on what they mean?

Jul 18 12:13:00 stratus scsi: [ID 107833 kern.warning] WARNING: /pci@6,4000/scsi@4(glm2):
Jul 18 12:13:00 stratus SCSI bus DATA IN phase parity error
Jul 18 12:13:00 stratus glm: [ID 663555 kern.warning] WARNING: ID[SUNWpd.glm.parity_check.6010]
Jul 18 12:13:00 stratus scsi: [ID 365881 kern.info] <SUN9.0G cyl 4924 alt 2hd 27 sec 133>
Jul 18 12:13:00 stratus scsi: [ID 107833 kern.warning] WARNING: /pci@6,4000/scsi@4 (glm2):
Jul 18 12:13:00 stratus Target 3 reducing sync. transfer rate
Jul 18 12:13:01 stratus glm: [ID 923092 kern.warning] WARNING: ID[SUNWpd.glm.sync_wide_backoff.6014]
Jul 18 12:13:01 stratus scsi: [ID 107833 kern.warning] WARNING: /pci@6,4000/scsi@4 (glm2):
Jul 18 12:13:01 stratus scsi: [ID 107833 kern.warning] WARNING: /pci@6,4000/scsi@4 (glm2):
Jul 18 12:13:01 stratus Target 3 reverting to async.mode

Thanx in advance

ady2k · Jul 19, 2002

Do you install new scsi device for E450?
Have you check the scsi id ?
Possible your SUN 9.1 Gb scsi id conflict with the existing scsi id.
You can check your scsi device with this command :
# cfgadm -al
Good Luck

dfurm · Jul 19, 2002

thanx ady2k.

Hers is output from cfgadm -al

stratusa:/> cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 disk connected configured unknown
c0::dsk/c0t1d0 disk connected configured unknown
c0::dsk/c0t2d0 disk connected configured unknown
c0::dsk/c0t3d0 disk connected configured unknown
c1 scsi-bus connected configured unknown
c1::dsk/c1t6d0 CD-ROM connected configured unknown
c2 scsi-bus connected configured unknown
c2::dsk/c2t0d0 disk connected configured unknown
c2::dsk/c2t1d0 disk connected configured unknown
c2::dsk/c2t2d0 disk connected configured unknown
c2::dsk/c2t3d0 disk connected configured unknown
c3 scsi-bus connected configured unknown
c3::dsk/c3t0d0 disk connected configured unknown

Igaduma · Jul 19, 2002

Hi all,

With great pain I read your post!
I've had this problem on an E-250, it never actually went away...
We used an external diskpack, all correctly configured with disksuite and working well, getting great speeds & good response when the system was at running at load averages of 3, but from time to time these errors popped up and brought the system to it's knees.
I couldn't really find any logic when & why it happened.
All of a sudden the oracle db was slow, check the /var/adm/messages and whatdayaknow, these errors popping up in a lightning fast tempo, next thing you know, you got a read/write failure and doom&gloom is lurking.There goes your beer at 18:00.
I never found a good reason why,
But it's related to having 1 faulty disk that brings down the whole scsi chain I think.
The errors started always to appear on 1 certain disk and then propagated thru the chain.
Removing that disk with another didn't help.
In our diskpack scsi ID's are automatically assigned so that couldn't have been the problem.
Even with 1 disk the errors started to appear after a period of time.

Other sun.com reports talked about the external scsi bus being too close to the NIC which interferred with each other (E250). Never really believed that but be prepared however for a never ending story.

The errors went away although when I stopped using the external pack and placed all HD's on the inner bay of the E-250.

What scsi card are you using ?

Are the disks conform to the scsi standard of your card ?

Other reports talk about adding SCSI related parameters in your /etc/system but that doesn't help either in the long run, the errors always come back...2 days, 2 weeks, 1 month...when you completely forgot it...they return...
You could ofcourse downgrade your scsi chain to scsi 1 speed but thats hardly usefull since your system will become rather slow, and the errors *will* return!

(all these things I checked & checked & checked but never came to a clear solution)

You can patch your system untill you can't hit the enter key anymore, and 2 days later the errors re-appear, all fresh and ready to shout out to your user: "write FAILURE!."

If you find a solution to this please...please let me know

lars.van.casteren@base.be

PS: I run two E-250 of which 1 is an old upgraded machine (new cpu's & ram) and the other is a brandnew E-250.
The errors are on the old E-250, not the new one...

I

mikeclark · Jul 21, 2002

I have had this sort of problem with a faulty terminator on the scsi bus.
It was a short cable with 4 disks on it and would work the same even without the terminator fitted.

Apricot · Jul 22, 2002

Summary
~~~~~~~
The glm driver detects DATA IN parity error after WDTR(wide data
transfer request) was backed off by parity error.

Problem
~~~~~~~
When SCSI parity error or SCSI bus hang occured by SCSI cable and so
on, the glm driver backs off the transfer rate and width for the target.

The following are messages.

- Target %d disabled wide SCSI mode
- Target %d reducing sync. transfer rate
- Target %d reverting to async. mode

After that, if the glm driver was negotiated from a target,
the glm driver is not responded the same rate of the target.
When the glm driver is going to act with narrow SCSI, it is not
going to negotiate from itself.
When the target negotiate with wide SCSI, the glm driver responds to
narrow SCSI.

But the problem occurs when the target has plural LUNs.

For example, there is a Target A that has LUN 0 and 1.

0) The transfer rate and width of Target A is FAST-20 wide SCSI
at first. The glm driver knows that.
1) SCSI parity error occurs on Target A LUN 0.
2) The glm driver disables wide SCSI of Target A.
3) SCSI parity error occurs on Target A LUN 0.
4) The glm driver reduces syncronous trannsfer rate of Target A.
5) SCSI parity error occurs on Target A LUN 0.
6) The glm driver reverts to asyncronous transfer mode of Target A.
7) Before data phase, Target A LUN 0 issues WDTR and SDTR.
The glm driver responds with narrow async SCSI.
8) Target A LUN 1 does not issue WDTR and SDTR. It is going to act
by narrow async SCSI because the transfer rate and width depend
on every target.
9) The glm detects DATA IN parity error on Target A LUN 1 until host
machine reboot.

Platform
~~~~~~~~
Solaris7 8/99
taiho
X6541A
RAID device

How to Re-produce the Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You use X6541A(or X6540A) and a RAID device that it has plural LUNs.
The RAID device calls Target A.
The transfer rate and width of Target A is FAST-20 wide SCSI.
Target A acts WDTR and SDTR from itself after power-on reset.

1) The command issue to Target A LUN 0.
The transfer rate and width of Target A is FAST-20 wide SCSI.
2) SCSI parity error generates on Target A LUN 0 over three times.
The glm driver reverts to asyncronous transfer mode of Target A.
3) Power off and on the device.
4) The command issue to Target A LUN 0.
Before data phase, Target A LUN 0 issue WDTR and SDTR.
The glm driver responds with narrow async SCSI.
5) The command issue to Target A LUN 1.
6) The glm detects DATA IN parity error on Target A LUN 1 until host
machine reboot.

Justification
~~~~~~~~~~~~~
This problem was found during a RAID device test.

o From Techical Point of View
This is serious because ...
- All processes that issue to other LUN(=it does not negotiate)
terminate abnormally
because the glm driver detects DATA IN parity error
until host server reboot.

o From Business Point of View
This is serious because ...
- The other LUNs of the RAID device can not use until host
server reboot.
The RAID device has dual SCSI controllers and can make an
active exchange when the controller has broken down.
But at this situation, even if the controller that
has broken down was made an exchange,
the RAID device can not use until host server reboot.

Call Type
~~~~~~~~~~
Please release Patch for Solaris2.6 and Solaris7 and Solaris8.

Suggestion for Fixing
~~~~~~~~~~~~~~~~~~~~~
We have checked glm driver source.

We are thinking whether the problem can be solved by correcting
as follows.

-----------------
static void
glm_set_wide_scntl3(struct glm *glm, struct glm_unit *unit, uchar_t
width)
{

| uint16_t target = unit->nt_target;
| uint16_t lun;

switch (width) {
case 0:
| for (lun = 0; lun < NLUNS_PER_TARGET; lun++) {
| /* store new i/o parms in each per-target-struct */
| if ((unit = NTL2UNITP(glm, target, lun)) != NULL) {
unit->nt_dsap->nt_selectparm.nt_scntl3 &=
~NB_SCNTL3_EWS;

glm->g_dsa->g_reselectparm[unit->nt_target].g_scntl3
&= ~NB_SCNTL3_EWS;
| }
| }
break;
case 1:
/*
* The scntl3:NB_SCNTL3_EWS bit controls wide.
*/
| for (lun = 0; lun < NLUNS_PER_TARGET; lun++) {
| /* store new i/o parms in each per-target-struct */
| if ((unit = NTL2UNITP(glm, target, lun)) != NULL) {
unit->nt_dsap->nt_selectparm.nt_scntl3
|= NB_SCNTL3_EWS;

glm->g_dsa->g_reselectparm[unit->nt_target].g_scntl3
|= NB_SCNTL3_EWS;
glm->g_wide_enabled |= (1<<unit->nt_target);
| }
| }
break;
}
ddi_put8(glm->g_datap, (uint8_t *)(glm->g_devaddr + NREG_SCNTL3),
unit->nt_dsap->nt_selectparm.nt_scntl3);
}

-----------------

Apricot · Jul 22, 2002

Check if you have installed this Patch:

max. 109885-09
Keywords: glm PCI power management kadb hang LSI 1010 D1000 UD2S
Synopsis: SunOS 5.8: glm patch
Date: Jul/16/2002

from 109885-03)<-- there is a bug in this Version

4341851 The glm driver detects DATA IN parity error <---

regards ph

dfurm · Jul 22, 2002

Thanx Apricot

I'll try this patch and see what happens

Igaduma · Jul 23, 2002

MY!

This must have been the most usefull post I've ever read on this forum!

Since when is this patch released ?

If this solves the SCSI problem, my, I don't have anything left to worry about!

Thanks Apricot,
btw, where did you get that info from?

Igaduma

Apricot · Jul 23, 2002

Hi

I make the support for a lot of E450. And this week I have make a new install, and so I check the Patch-Level.

And so I see this new Patch.

regards ph

dfurm · Jul 23, 2002

Hi

applied this patch and more and have no more messages BUT
when doing a test scsi from ok prompt get the following output:-

move-memory failed with a result=fd
Device= pci@1f,4000/scsi@3
FRU= motherboard
scsi selftest failed Return code=1

On the face of it it looks OK from the OS.

regards

Apricot · Jul 23, 2002

This Device= pci@1f,4000/scsi@3 is mean the Internal 4 disk backplane.

Can you please check the patchlevel of the kernel?

# showrev | grep 105181

regards ph

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Solaris 8 SCSI Problem 1

dfurm

Technical User

ady2k

Vendor

dfurm

Technical User

Igaduma

Technical User

mikeclark

MIS

Apricot

IS-IT--Management

Apricot

IS-IT--Management

dfurm

Technical User

Igaduma

Technical User

Apricot

IS-IT--Management

dfurm

Technical User

Apricot

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor