Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations John Tel on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Memory Error

Status
Not open for further replies.

ponetguy2

MIS
Aug 28, 2002
442
US
Hello Everyone, I've been getting this error message on one of our servers every so often. Anyone has a clue what this means. FYI: I've installed the latest patches from Sun. This machine is running Solari 9.

I hope it's not a hardware issue. YIKES!!!

Oct 16 14:46:23 xpress10 SUNW,UltraSPARC-II: [ID 942467 kern.info] [AFT0]
Corrected Memory Error detected by CPU14, errID 0x00
06c63a.30457c17
Oct 16 14:46:23 xpress10 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.41b22708
Oct 16 14:46:23 xpress10 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1002610c
Oct 16 14:46:23 xpress10 UDBL Syndrome 0xf4 Memory Module Board 3 J3801
Oct 16 14:46:23 xpress10 SUNW,UltraSPARC-II: [ID 326970 kern.info] [AFT0]
errID 0x0006c63a.30457c17 Corrected Memory Error on
Board 3 J3801 is Persistent
Oct 16 14:46:23 xpress10 SUNW,UltraSPARC-II: [ID 620438 kern.info] [AFT0]
errID 0x0006c63a.30457c17 ECC Data Bit 14 was in err
or and corrected
 
Likely a DIMM replacement is needed, but it could be a system board or even slightly possible a CPU.
 
It's almost certainly a memory board problem (board 3). ECC is correcting memory errors on this memory module.
 
Here is my cediag output:

cediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTC
cediag: Analysed System: SunOS 5.8 with KUP 108528-27 (MPR active)
cediag: Pages Retired: 0 (0.00%)
cediag: findings: 0 datapath fault message(s) found
cediag: findings: 0 UE(s) found - there is no rule#3 match
cediag: findings: 0 DIMMs with a failure pattern matching rule#4
cediag: findings: 0 DIMMs with a failure pattern matching rule#5


I guess memory is okay. Is this correct? should I worry?
 
xpressdev1# cediag -v
cediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTC
cediag: info: cediag directory: /opt/SUNWcest/bin
cediag: info: UltraSPARC Version: 2 (2)
cediag: info: OS Type: SunOS
cediag: info: OS Version: 5.8
cediag: info: Hostname: xpressdev1
cediag: info: Memory size: 262016 (8KB pages)
cediag: info: MPR (deduced) PRL pages: 248 (8KB pages)
cediag: info: MPR-capable OS: true
cediag: info: KJP: 108528-27
cediag: info: MPR-aware kernel in-use: true
cediag: info: MPR enabled: true
cediag: info: MPR disabled in /etc/system: n/a
cediag: info: MPR force mode: n/a
cediag: info: MPR state: active
cediag: info: Rule#3 check: true
cediag: info: Rule#4 check: true
cediag: info: Rule#5 check: false
cediag: info: Rule#5 check via cestat: true
cediag: info: Rule#6 check: false
cediag: Pages Retired: 0 (0.00%)

cediag: findings: 0 datapath fault message(s) found
cediag: findings: 0 UE(s) found - there is no rule#3 match
cediag: findings: 0 DIMMs with a failure pattern matching rule#4
cediag: findings: 0 DIMMs with a failure pattern matching rule#5
 
Do you still have the "Corrected Memory Errors" in your current /var/adm/messages?
 
Couple of options there is a patch 118558-11 recommended, but it also asked for 117171-17, may also need 112233-12"......due to patch rejuvenation

A simpler method is to add
set automatic_page_removal=1 in /etc/system
and reboot - the fault will reoccur one more time and then disappear

"These errors are caused by the memory scrubbers (which run every 12 hours) finding memory faults. This mechanism reads the whole memory and will report if there is something wrong on that particular memory location. On Solaris 9, memory page retirement is available, which will mark the memory as bad & the O/S will not use that memory anymore."
 
the memory errors are sporadic. i did'nt get the memory error today. however, my boss suggested that this might be a cpu problem as well.

i'm confused. i wish we have a sun contract.
 
This is simililar to what we had on a V440 recently - we reported to Sun and there was no suggestion of a CPU error. The Corrected Memory messages occurred sometimes once a day sometimes twice and sometimes not at all - but always at the same times of the day or night. The /etc/system parameter was added - then a reboot -> error occurred once more then not again - although another reboot could reintroduce but will then disappear as page is released
This appears to be Solaris9 related only. I upgraded another system from 8->9 and did not apply any patches at all - I got the error within a few days. I applied all Solaris9 patch set and errors stopped.
 
ponetguy2;

If you have all software bases covered you could try moving that simm on brd 3 slot j3801 to another slot and see if it errors on the new slot, if it does I would replace the simm. I agree with marrow it is unlikely to be a cpu. If you have multiple errors check and see if the cpu being called out has changed, most likely it has.

Thanks

CA
 
i pruposed editing the /etc/system file, he was'nt too excited about that suggestion. i guess my only option is to apply the patch. he's not convinced that this is just a bug w/ solaris 9. i'll try to dig for documentation from sun regarding this error message.

thank you everyone for your suggestions.
 
Good luck ponetguy2 - it may well be that you do have a board problem i.e. reference to "Board 3 J3801 is Persistent" (I would and did log problem with Sun).

However FYI the message our system received was ..

Sep 18 12:46:51 host2 SUNW,UltraSPARC-IIIi: [ID 929717 kern.info] [AFT2] D$ d
ata not available
Sep 18 12:46:51 host2 SUNW,UltraSPARC-IIIi: [ID 335345 kern.info] [AFT2] I$ d
ata not available
Sep 18 12:46:51 host2 SUNW,UltraSPARC-IIIi: [ID 596501 kern.info] NOTICE: [AF
T0] Corrected memory (FRC) Event detected by CPU0 at TL=0, errID 0x00007b35.b883
25e8
Sep 18 12:46:51 host2 AFSR 0x00000000.1000024a<FRC> AFAR 0x00000002.022c9
ad0 INVALID
Sep 18 12:46:51 host2 Fault_PC 0x103609c Esynd 0x004a J_AID 1
Sep 18 12:46:51 host2 SUNW,UltraSPARC-IIIi: [ID 810412 kern.info] [AFT0] errI
D 0x00007b35.b88325e8 Data Bit 15 was in error and corrected
Sep 18 12:46:51 host2 SUNW,UltraSPARC-IIIi: [ID 596501 kern.info] NOTICE: [AF
T0] Corrected memory (FRC) Event detected by CPU0 at TL=0, errID 0x00007b35.b883
25e8
Sep 18 12:46:51 host2 AFSR 0x00000000.1000024a<FRC> AFAR 0x00000002.022c9
ad0 INVALID
Sep 18 12:46:51 host2 Fault_PC 0x103609c Esynd 0x004a J_AID 1
Sep 18 12:46:51 host2 SUNW,UltraSPARC-IIIi: [ID 810412 kern.info] [AFT0] errI
D 0x00007b35.b88325e8 Data Bit 15 was in error and corrected
 
I still think it is a hardware problem and probably isn't a software problem. It most likely is a bad DIMM or could be a system board, or (remotely) a CPU. It could also be, perhaps, a centerplane.

Sun is the only one who can determine this for sure.
 
One thought-

Does cediag disable ECC before it runs ?

If it runs with ECC enabled it's probably not going to find any problems.
 
hello everyone. here is what i've discovered regarding this error messages. it seems that everyone is correct. for the time being, this error should not cause problems. however, if it persist on a consistant basis, then i have a big problem.

i will need to replace the memory in question and update obp and add the latest patch for solaris 9.

i found a great article on the web regarding this problem and other thing which might be usefull for people. here is the page:
 
Memory Error Investigation

Page Retirement: is a feature implemented through the fixes to bug IDs 4484338, 4504686, 4880360, and 4915531.

A memory DIMM which is experiencing repeated correctable error ((CE) single-bit) might have an increased PROBABILITY of experiencing an uncorrectable error ((UE) multi-bit). Likewise, the probability of a memory error condition that could result in system downtime also increases.

To help address this, new features have been implemented for UltraSPARC II-based, UltraSPARC III-based, and UltraSPARC IV-based systems. These features attempt to PROACTIVELY PREDICT which memory components (DIMMS) have an increased probability of experiencing an uncorrectable error, and subsequently remove this memory from future use when it is no longer used by the kernel or any processes.

CE Categories:
Intermittent/Transient Soft Error: a CE is considered intermittent if the error is not detected upon a reread of the affected memory word.
Persistent/Temporary Soft Error: a CE is considered persistent if the error is detected upon reread, but the scrubbing operation corrected it.
Sticky/Stuck-at Hard Error: a CE is considered sticky if after scrubbing, the error is still present

NOTE:
CPU receives notification that a correctable memory error has occurred using the trap mechanism (refer to pg. 14 of Solaris OS Availability Features for detailed information).
These errors are caused by memory scrubbers, which runs every 12 hours by default. Memory scrubbers find memory faults.
Solaris 8 Kernel Update patch 117000-03/Solaris 9 Kernel Update patch 112233-12: Solaris 8 Kernel Update patch 117000-03 and Solaris 9 Kernel Update patch 112233-12 implement a more aggressive method of page retirement that is successful at retiring pages under a greater range of conditions.

Page Retirement: page retirement feature enables a page of memory to be removed from use by Solaris in response to repeated ECC errors within a memory page on a DIMM.
The OS distinguishes between pages that have CE and those that have UE. A page with an UE that might be able to be cleared is marked as TOXIC. Pages mapped to a DIMM that has experienced multiple correctable errors are marked as FAILING.
If a page is marked as TOXIC, the OS attempts to clean any errors from the page using a SCRUBBING algorithm when page_free() is invoked on that page. If it can verify that there are no errors on the page after it does its SCRUBBING, it allows that page to be returned to the freelist. This ensures that a single error does not cause a page to be removed from the system. If the SCRUBBING is unsuccessful, the page is marked as failing and is immediately retired.
If a page is marked as FAILING, no attempt is made to clean the page by SCRUBBING. It is immediately retired if it is no longer in use by other threads (a page is not returned to the freelist, and so will not be used again until reboot: amount of available memory is decremented).
Aggressive Page Retirement: (Solaris 8 Kernel Update patch 117000-03/Solaris 9 Kernel Update patch 112233-12) new algorithm which successfully retires pages which are locked, dirty, or in COPY_ON_WRITE status.


Current Status

cediag/cestat was installed on xpressdev1 (cediag/cestat is a utility from Sun Solaris which diagnoses system memory errors)
cediag is scheduled to run at midnight on xpressdev1. No significant errors were found by cediag on 10/18/05 and 10/19/05.
cediag will be installed on xpress10 on 10/21/05 after trading hours
no memory errors were found since 10/16/05 on xpress10
no memory errors were found on xpressdev1 since 10/05/05


Sources:

1)
2) Solaris OS Availability Features (
3)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top