silicontundra
MIS
During MemTest86+ v1.70 (latest with Win98SE boot floppy) for reliabilty testing in upgrading used RAM memory in a popular, redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB, 2002 era with latest v1.21 BIOS (9Jun05), v1.09 ISMP) 2U rack server, the screen hung about 30 min into the test. Then the box would not boot, dark screen, no BIOS. Box powers on with Power-on green LED on front panel but otherwise appears totally dead. The box has dual IBM 350 watt power supplies with both green LEDs on in rear.
Have not seen this IBM xSeries server issue discussed when googling the newsgroups and Tek-tips, so this long solution is described here with additional questions for enhancing reliability of legacy IBM servers. Details follow, regards, Phil
-----
The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 sticks of 1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz, PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) with Samsung SDRAM memory chips (K4H510638D-TC80) which got quite hot-to-the-touch during this strenuous testing. This was quite evident with older chips with 2002 date codes, compared to 2003.
The previous memory totalled 1GB RAM; 4 sticks of 256MB registered ECC DIMMs FRU 09N4306 / 38L4029 with double-sided MicronT 46V32M4-75A chips (512MB (2x256MB) min). One pair had older late-02 and one pair mid-03 chip date codes. Micron has 75ns rated chips vs Samsung 80ns rated chips; but both appear to have sufficient design margin for 100ns operational spec. The memory sticks are installed in matched pairs for 2-way interleaved operation; factory labels face inwards.
Have any SysE retrofitted gamer-type RAM chip coolers to servers in heavy production? There is only 3/16" 0.1875 in between the adjacent slot's DIMM's chips and 4 3/4" long chip array, 1/4" 0.25 in double sided thickness (1/8" 0.125" on 256MB). So the metal heat sinks need to be only 1/8" thick with fins. Does one remove the labels and use thermal paste for better plastic-metal heat transfer?
When booting the box from cold, the computer was powered on but there was no video from the integrated ATI RAGE XL chipset, no BIOS beeps. The Light Path Diagnostics panel showed nothing (latest Integrated Sys Mgmt Processor firmware). The blinking green LED next to the CMOS battery (ISMP activity) just shows that AC is connected to the box. The LEDs next to the DIMM slots showed nothing. Testing was done in pairs in DIMM slots 1 and 2, which are closest to edge of the mainboard and case, HW manual p57-8.
The IBM's Hardware Maintenance manual (48P9718, 11th Ed, Feb 04, latest) showed no such diagnosis and no such remedy. See Chap6 Symptom-to-FRU index, p83-113. The closest symptom would be BIOS beep code 1-1-3 (CMOS write/read test failed, p83), but there were no beeps. The "No-beep symptoms" table on p86 was clueless. Same with other manuals, including Options p21-3 (48P9719, 1st Ed, 7/2002), User's memory spec p3, reliability p5-6 (48P9717, 1st Ed, 7/2002). The Installation Guide has Chap2 Installing Options Memory p9-11, and Chap5 Solving Problems p27, Table 3 showed with Boot Code=No beep, to call IBM Service (48P9714, 2nd Ed, 7/2002).
The only real clues to solving the problem was in the HW Maint Manual "Undetermined problems" p112 near the end of chapter, which had Notes 1 and 2 on damaged data in CMOS and BIOS.
The un-documented fix or remedy was to remove, short the leads, and replace the CMOS CR2032 3v Lithium battery (FRU 33F8354) to reset the BIOS to default. Since this systemboard has in an upright CMOS battery retainer, one needs to use an insulated forcep (napkin) to remove it with one hand and the other hand's fingernail on the retaining clip (the flat positive + side with the mfgr name faces backwards). Advise removal of any ServeRAID board too in Slot 2 PCI-X 100MHz for ease of access. Use a screw driver in the black plastic clip to ease opening the Adapter Retainer without breaking the blue plastic pronged snaps. The manual procedures on "Replacing the battery" p69-71, and "Installing a ServeRAID-5i adapter" p54-5 has diagrams on the details.
Then the box will boot with a "161 Bad CMOS battery" error p102 and the BIOS needs the time and date reset and the system-error logs cleared.
After several repetitions of this scenario, was able to get both pairs of Samsung memory to pass at least one complete cycle of the test (abt 45 min), but not much further than that; before the box hung again and required having to R&R the CMOS battery. The newer pair had less of a problem making the test hurdle and felt to run cooler.
Even with the chassis cover on and including all 8 chassis fan array, the rear-most memory chips were noticably hotter than the chips closer to the dual fans and CPUs. And the DIMMs closer to edge of case also ran hotter, thus the newer pair's final destination was DIMM slots 1 and 2. The BIOS did spin-down the fans; after initial startup with all 8 fans roaring away. During the hour long memory testing, at no time did the fans spin-up with a room ambient about 60F.
We now have 4GB of iffy RAM memory...any comments from fellow IBM SysE?? Did Samsung change (shrink) their process technology fabricating half gigabit (512Mb) DDR DRAM memory chips in the late 2002 timeframe?? I used to think that Samsung set the world standard in memory chips. This is during the late 90s era when Intel / RamBus RDRAM RIMMs was battling the SDRAM DDR world.
Are there more reliable memory chip mfgrs that IBM OEMS such as Hynix (another Korean), Elpida Opt 33L5039 (Japanese), Infineon (higher rated IBM FRU 09N4308 33L5039 CL2 PC2100) (German), and Micron (unbranded IBM compatible to FRU 10K0071) (USA)?? Should we be looking at enhanced memory specialists such as Corsair, OCZ, Patriot, etc which produce high quality DIMMs with integrated chip coolers? Or is the best solution to overheating memory chip issue IBM's ChipKill technology. Has anyone installed this more expensive memory in xSeries servers??
BTW, this issue was cross-posted in news:comp.os.linux.hardware , news:comp.sys.ibm.pc.hardware.chips and Tek-Tips.com IBM Server discussion group on 27Apl07.
Have not seen this IBM xSeries server issue discussed when googling the newsgroups and Tek-tips, so this long solution is described here with additional questions for enhancing reliability of legacy IBM servers. Details follow, regards, Phil
-----
The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 sticks of 1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz, PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) with Samsung SDRAM memory chips (K4H510638D-TC80) which got quite hot-to-the-touch during this strenuous testing. This was quite evident with older chips with 2002 date codes, compared to 2003.
The previous memory totalled 1GB RAM; 4 sticks of 256MB registered ECC DIMMs FRU 09N4306 / 38L4029 with double-sided MicronT 46V32M4-75A chips (512MB (2x256MB) min). One pair had older late-02 and one pair mid-03 chip date codes. Micron has 75ns rated chips vs Samsung 80ns rated chips; but both appear to have sufficient design margin for 100ns operational spec. The memory sticks are installed in matched pairs for 2-way interleaved operation; factory labels face inwards.
Have any SysE retrofitted gamer-type RAM chip coolers to servers in heavy production? There is only 3/16" 0.1875 in between the adjacent slot's DIMM's chips and 4 3/4" long chip array, 1/4" 0.25 in double sided thickness (1/8" 0.125" on 256MB). So the metal heat sinks need to be only 1/8" thick with fins. Does one remove the labels and use thermal paste for better plastic-metal heat transfer?
When booting the box from cold, the computer was powered on but there was no video from the integrated ATI RAGE XL chipset, no BIOS beeps. The Light Path Diagnostics panel showed nothing (latest Integrated Sys Mgmt Processor firmware). The blinking green LED next to the CMOS battery (ISMP activity) just shows that AC is connected to the box. The LEDs next to the DIMM slots showed nothing. Testing was done in pairs in DIMM slots 1 and 2, which are closest to edge of the mainboard and case, HW manual p57-8.
The IBM's Hardware Maintenance manual (48P9718, 11th Ed, Feb 04, latest) showed no such diagnosis and no such remedy. See Chap6 Symptom-to-FRU index, p83-113. The closest symptom would be BIOS beep code 1-1-3 (CMOS write/read test failed, p83), but there were no beeps. The "No-beep symptoms" table on p86 was clueless. Same with other manuals, including Options p21-3 (48P9719, 1st Ed, 7/2002), User's memory spec p3, reliability p5-6 (48P9717, 1st Ed, 7/2002). The Installation Guide has Chap2 Installing Options Memory p9-11, and Chap5 Solving Problems p27, Table 3 showed with Boot Code=No beep, to call IBM Service (48P9714, 2nd Ed, 7/2002).
The only real clues to solving the problem was in the HW Maint Manual "Undetermined problems" p112 near the end of chapter, which had Notes 1 and 2 on damaged data in CMOS and BIOS.
The un-documented fix or remedy was to remove, short the leads, and replace the CMOS CR2032 3v Lithium battery (FRU 33F8354) to reset the BIOS to default. Since this systemboard has in an upright CMOS battery retainer, one needs to use an insulated forcep (napkin) to remove it with one hand and the other hand's fingernail on the retaining clip (the flat positive + side with the mfgr name faces backwards). Advise removal of any ServeRAID board too in Slot 2 PCI-X 100MHz for ease of access. Use a screw driver in the black plastic clip to ease opening the Adapter Retainer without breaking the blue plastic pronged snaps. The manual procedures on "Replacing the battery" p69-71, and "Installing a ServeRAID-5i adapter" p54-5 has diagrams on the details.
Then the box will boot with a "161 Bad CMOS battery" error p102 and the BIOS needs the time and date reset and the system-error logs cleared.
After several repetitions of this scenario, was able to get both pairs of Samsung memory to pass at least one complete cycle of the test (abt 45 min), but not much further than that; before the box hung again and required having to R&R the CMOS battery. The newer pair had less of a problem making the test hurdle and felt to run cooler.
Even with the chassis cover on and including all 8 chassis fan array, the rear-most memory chips were noticably hotter than the chips closer to the dual fans and CPUs. And the DIMMs closer to edge of case also ran hotter, thus the newer pair's final destination was DIMM slots 1 and 2. The BIOS did spin-down the fans; after initial startup with all 8 fans roaring away. During the hour long memory testing, at no time did the fans spin-up with a room ambient about 60F.
We now have 4GB of iffy RAM memory...any comments from fellow IBM SysE?? Did Samsung change (shrink) their process technology fabricating half gigabit (512Mb) DDR DRAM memory chips in the late 2002 timeframe?? I used to think that Samsung set the world standard in memory chips. This is during the late 90s era when Intel / RamBus RDRAM RIMMs was battling the SDRAM DDR world.
Are there more reliable memory chip mfgrs that IBM OEMS such as Hynix (another Korean), Elpida Opt 33L5039 (Japanese), Infineon (higher rated IBM FRU 09N4308 33L5039 CL2 PC2100) (German), and Micron (unbranded IBM compatible to FRU 10K0071) (USA)?? Should we be looking at enhanced memory specialists such as Corsair, OCZ, Patriot, etc which produce high quality DIMMs with integrated chip coolers? Or is the best solution to overheating memory chip issue IBM's ChipKill technology. Has anyone installed this more expensive memory in xSeries servers??
BTW, this issue was cross-posted in news:comp.os.linux.hardware , news:comp.sys.ibm.pc.hardware.chips and Tek-Tips.com IBM Server discussion group on 27Apl07.