Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RAM Failure. Does AIX compensate for ever?

Status
Not open for further replies.

jpor

Technical User
Nov 29, 2000
212
GB
Hi gurus,

I currently look after a H70 Enterprise server with 4 CPU's and what was 4GB RAM.

Recently I took the server down by using shutdown -F now and waited for the 'OK' to appear on the LCD on the outside of the server. ONce it displayed I pulled the plug power cable out of the back of the server and left it off for over 50 Minutes. This was due to a power outage and low power from the UPS.

WHen I powered up the server It now only shows 3846MB of RAM insted of 4096. And I now have the following errors from the errpt:

BFE4C025 0526171904 P H sysplanar0 UNDETERMINED ERROR
---------------------------------------------------------------------------
LABEL: SCAN_ERROR_CHRP
IDENTIFIER: BFE4C025

Date/Time: Wed May 26 17:20:43
Sequence Number: 9654
Machine Id: 0042C55A4C00
Node Id: h70
Class: H
Type: PERM
Resource Name: sysplanar0
Resource Class: planar
Resource Type: sysplanar_rspc
Location: 00-00

Description
UNDETERMINED ERROR

Failure Causes
UNDETERMINED

Recommended Actions
RUN SYSTEM DIAGNOSTICS.

Detail Data
PROBLEM DATA
0144 0000 0000 0036 C600 8401 1617 2900 2004 0526 0000 4942 4D2C 6D65 6D6F 7279
2D6D 25C3 8008 0001 0000 0000 0000 0000 4942 4D00 5031 2D4D 322E 3400 0002 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
3276 9278 014D A3B4 0000 0014 0044 3A18 014E 194C 050B 665C B99D 0BB8 3275 D160
050C 6A34 0000 0000 0000 0000 0044 3A28 327C E2F8 327C E000 0017 2009 3275 F580
0000 0000 0000 0000 0000 0000 327C F314 327C F2F8 327C F000 0017 2000 0002 0000
0000 0001 45BF 0000 0000 0000 0000 34E0 014E 1AC8 0000 34C8 0000 0000 0044 3AC8
0000 0000 014E 0470 0000 0000 0000 0000 014E 194C 0000 0000 0000 34C8 0044 3AC8
327D 3050 3276 9000 B09C 05F0 0044 3BB8 4000 000A 0000 0004 014C 9E80 0000 0000
0000 000A 327E 8000 327F 6100 0044 3B18 3281 5100 0001 1696 FFFF FFFC 0026 C97C
B99D 0000 4000 0030 B09C 05F0 B09D 0000 0000 0000 327F 6100 0000 0000 0000 000A
0000 0000 0044 3EB0 327F 6100 0044 3B18 4242 2084 0007 9A64 0000 05F0 3275 F580
0026 C990 0001 1696 FFFF FFFC 0026 C97C B99D 0000 4000 0030 B09C 05F0 B09D 0000
4000 000A 000A 0005 0000 0000 0026 CB2C 014C A168 0000 0000 327F 6100 0044 3B78
2242 2084 014A 1F78 0000 0001 0000 0001 014C A5D0 403D 3938 0000 0001 0000 0014
003D 3938 4000 0000 003D 3938 327E 8000 0000 000A 0000 011F B09C 05F0 327E 8000
4000 000A 000A 0005 0000 0003 0000 000A 0000 0000 0044 3EB0 014C A838 0044 3BD8
0000 0000 014A E264 0000 0001 0044 3C28 0000 0014 403D 3938 003D 3938 0044 3BE8
B09D 5E2C 0000 011F 4000 0000 0044 3C58 2228 2220 0006 0EE4 0000 0001 327F 6100
003C F0D0 0000 0000 0000 0000 3275 F580 0000 000A 0000 011F 3275 F580 000F 9880
0000 4021 0000 90B2 003C F0D0 0044 3C58 0026 C990 0002 3040 FFFF FFFC 0044 3C58
B99D 0000 4000 0030 B09C 05F0 0044 3C58 4000 000A 000E 6100 0000 0003 0044 3C58
0000 0000 2FF3 B400 0000 0008 0040 5044 0000 0001 0044 3C90 0000 0002 4000 0000
B09D 0984 0044 3C90 B09D 5A90 0000 0005 0007 401D 0044 3C90 0000 0114 0044 3CD8
0000 4021 0005 E3A8 0000 0000 0044 3CE8 0000 0000 2FF3 B400 0000 0008 000E 67C0
0000 0001 000E 67C0 000E 67C0 0000 0001 3002 10C0 0044 3CF0 0000 0000 0002 EFC4
0001 161F E600 0000 0038 0218 2FF3 B400 0000 0000 2FF3 B400 6005 0114 0000 0000
B09D 0000 000E 67C0 C00F DC14 0005 0114 0000 0000 40B4 C3DB 0217 24FD 0000 0005
0000 0001 000E 67C0 E600 1000 0000 34E0 3002 10C0 0044 3D80 0000 0003 0002 E7E8
---------------------------------------------------------------------------
LABEL: SCAN_ERROR_CHRP
IDENTIFIER: BFE4C025

Date/Time: Wed May 26 17:19:43
Sequence Number: 9653
Machine Id: 0042C55A4C00
Node Id: h70
Class: H
Type: PERM
Resource Name: sysplanar0
Resource Class: planar
Resource Type: sysplanar_rspc
Location: 00-00

Description
UNDETERMINED ERROR

Failure Causes
UNDETERMINED

Recommended Actions
RUN SYSTEM DIAGNOSTICS.

Detail Data
PROBLEM DATA
0144 0000 0000 0036 C600 8401 1617 2900 2004 0526 0000 4942 4D2C 6D65 6D6F 7279
2D6D 25C3 8008 0001 0000 0000 0000 0000 4942 4D00 5031 2D4D 322E 3300 0002 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
000E 6100 0000 0048 0002 0028 0043 8A38 0000 0000 0006 24C8 0000 0001 0043 8A68
0000 0000 0005 96C8 0000 0001 0043 8A78 0000 0000 0000 0008 B09C 050C 0043 8A38
0000 0011 0000 0001 0000 0001 0000 C722 0000 0011 B09D 0000 0000 0011 C000 0120
B000 0000 000E 6100 0000 0048 0000 02CE 0000 0008 C000 0000 17E8 2D70 0043 8AA8
0000 0037 0006 268C D000 00FB 0000 0038 0000 0080 D000 00FC 0000 0038 0043 8AC8
0000 0001 0000 0000 D000 0000 0043 8AB8 0043 8BEC 0005 99D0 D000 0000 0043 8AD8
0000 0000 0000 0001 0000 0001 0043 8AC8 0043 8BEC 0000 0000 0000 0002 0001 C067
0000 0000 0000 0000 0000 0001 0000 0001 0000 0011 402C 0D70 0000 0001 0000 0011
002C 0D70 4000 0000 002C 0D70 0043 8B08 B09D 21CC 0000 0067 0000 34C8 0043 8B18
0000 0001 0002 41A0 0000 0001 0000 0000 0000 0001 0000 0012 B09C 050C 0043 8B68
0000 0000 0000 0000 0000 0100 0043 8B68 0000 0011 0025 FE4C 002C 0D70 0043 8B68
B09D 21CC 0001 E710 4000 0000 0043 8BD8 2228 2020 0006 0EE4 0000 0001 5001 2000
002C 5728 4000 0000 002C 5728 326E B000 B09D 21CC 0000 0000 0000 0005 1000 0000
0000 000A 0000 0006 326F A210 0043 8BD8 4242 2022 0025 EDD4 B09C 050C 0040 5044
0000 C962 0000 0000 0000 0001 0000 0001 0043 8EB0 2FF3 B400 000E 6440 1000 0000
0000 9000 0000 0012 000E 6100 0000 000A B99D 0000 4000 0030 B09C 050C B09D 0000
4000 000A 0000 0000 0000 0003 0000 000A 0000 0000 0043 8EB0 326F A268 0043 8C28
4242 2020 0043 8C40 0000 0067 0001 C067 0026 C990 0043 8C60 B09C 050C 0043 8C38
0000 CAB3 0000 0000 0000 0001 0043 8C68 0040 4FE4 0036 F2BC 0000 0003 0043 8C68
0000 0000 2FF3 B400 0000 0008 000E 6440 0000 0000 0000 0002 C000 0000 0043 8C88
3002 1080 0005 8318 0000 0000 0043 8C88 FFFF FFFF 0043 8C90 0000 0000 000F AE00
0000 4021 0000 90B2 0000 0000 0043 8CD8 0000 0000 0002 3040 0000 0008 0043 8CD8
0000 FF3B C00F 0000 B09D 0000 0043 8CE8 3002 1080 0005 7EAC 0038 0218 0002 F0A4
0000 000A 0043 8CF0 0043 8EB0 0040 5004 000E 6440 002C 4E30 4000 0000 0000 0000
0000 CA8A 0000 0002 C00C 84B4 0004 4011 0000 002D B09D 0594 0000 0011 0043 8D68
0000 0000 0006 1A3C 0004 4011 0043 8D68 3002 1080 0007 104C 0036 F018 0043 8D58


When I ran the diag it came back with this:

A PROBLEM WAS DETECTED ON Wed May 26 17:22:19 BST 2004 801014

The Service Request Number(s)/Probable Cause(s)
(causes are listed in descending order of probability):

A10-200: Resource was marked failed by the platform. System is operating
in degraded mode.
n/a FRU: n/a P1-M2.3

A10-200: Resource was marked failed by the platform. System is operating
in degraded mode.
n/a FRU: n/a P1-M2.4

Looks like the 2 RAM modules are either dead etc...

Does AIX compensate this problem forever? As the server seems to run fine with the missing 256MB RAM.

I am considering contacting IBM for an engineers assistance, but due to the outage and lost productivity within the company I am hoping the system will run until the beginning of next week. Where I can then take the box down for repair.






( "To become Wise, first you must ask Questions")
 
hello jpor,

I'm not sure to follow you, why do you want to take the server somewhere to have it repaired ? if you call AIX support line, and you machine is still under warranty, they will send a guy to make the repair.

If it's out of warranty, just ask for a quotation.

Phone numbers are available from here :

This HW *shouldn't* be a problem, potentially the server could page more (it depends on other parameters then the strict amount of RAM); if you're lucky it was oversized and you won't see the difference. Of course prepare yourself to numerous warning messages :)

regards
 
Letis. Thanks for the response.

Sorry if I was not making my question easy to understand.

The server is under a 24/7 H/W support contract. It is just that we have 1 server running for the business and to take it down so soon to have the memory/ card replaced would not make the managers here very happy, as they have lost hours of production time.

I was just wondering if the bad memory is kept in bay by the O/S so it doesn't get used until the RAM gets replaced?



( "To become Wise, first you must ask Questions")
 
well, look in smitty chgsys to check the field "Amount of usable physical memory". If you see 3846 MB, it's good news, because it means that the faulting modules have been taken out of the memory the system will try to use.

Make a double-check with svmon -G (global memory usage in 4k pages), you should have an amount of total memory block equals to (3846*1024)/4.

I suggest too that you make no operations requiring a server reboot, except of course to change the faulting RAM.

regards,
 
Letis. Thanks for the commands to try.

In smit it does indeed show 384MB being seen by the system.

The svgmon shows the following:

svmon -G

size inuse free pin virtual
memory 983029 980492 2537 48596 109876
pg space 2097152 1269

work pers clnt
pin 48596 0 0
in use 161350 819142 0


Looks like it's running okay on the reduced RAM.

Thanks for your help.


( "To become Wise, first you must ask Questions")
 
It will continue to run without problems until you schedule your reparis with the IBM CE.
 
I was just wondering if the bad memory is kept in bay by the O/S so it doesn't get used until the RAM gets replaced?
>>>> yeah you are correct...but the OS after reboot is not going to use it until they replaced it and you will be needing another downtime. 24/7 is too expensive for IBM service so better use that thing. try to chase them right away.
 
Bonsky. That will be the plan next week. Hopefully the back log of work would have dispelled enough for us to take the box down for the IBM Engineer to replace the Memory Borad/RAM.

This isn't the first time we have had them out for this problem. Had to get IBM out over the weekend a couple of weeks ago for the same problem. And they replaced the RAM. The error message I am getting is exactly the same as before. So I am thinking we have a problem with the memory Card.


( "To become Wise, first you must ask Questions")
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top