Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

eServer 325 stalls frequently 1

Status
Not open for further replies.

RamonS

Technical User
Aug 10, 2008
11
0
0
US
Hi!
I have an eServer 325 with dual Opterons and 2 GB of RAM. I got it for cheap as a use unit and it generally works OK. I use it as a web and FTP server running Windows XP 64bit (which is the same as Server 2003 64 bit).
My problem is that the GUI of Windows stalls after some time. Initially, I can still move the mouse, change applications, and start a few new applications, but soon after that the GUI freezes up and the system appears unresponsive. I can still access the web and FTP server without problems, but having the pizza box become unresponsive to user input is very annoying.
So far I ditched the stock 120GB Maxtor IDE drive and replaced it with a Seagate SATA drive on a PCI-X SATA controller. I also reinstalled several times the OS. The medium I install from works perfectly fine and it is a genuine XP 64bit installation CD.
I ran MemTest for several rounds and it all came up fine. I also updated the NIC drivers and also dug up good 64 bit drivers for the Riva XL.
I also tried to not run any applications or web / FTP services as well as really crank up network and disk I/O while maxing out the CPU power. None of that makes a difference as to how stable or unstable the system is. I still suspect the RAM or a heat issue, but the internal fans are not screaming the whole time as they do when starting the box up and all sensors show considerably low numbers even under stress.
I also put in any security measures I can get my hand on (professional virus scan, frequent SpyBot runs, disabling Guest access, running the firewall and having only ports 80 and 21 open plus running this behind NAT. I also left the box off the network for some time and that also did not make any difference.

I'm running out of ideas. Anyone has a tip as to what to look at next?
 
How hot is the top of the server? Sounds like a cpu/ram heat problem. Can you remove the top cover and cheat the switch and look for a fan that maybe has failed? If not, I would download memtest86 (free) and run it for a day, or replace the memory. If that doesn't solve the issue, sell the memory on Ebay and try removing one CPU, one at a time of course, into slot 0. Then I would suspect the motherboard. Is the health light amber at all when it locks up? Do the system Event Analyxer in Windows say anything when it locks up? How long does it run after a reboot before it locks up? I ran across the exact symptoms today, and replaced the memory. We'll see how that goes, but this was a Compaq DL360 G2.

Burt
 
Thank you very much for your reply. There is no warning lighht and the server is barely lukewarm. There are no entries in the event logs. I do get some DCOM issues once in a while, but those events are hours away from when the system stalls. I can tell when it locks up, because the clock in the task bar stops. I had BOINC running for a while and I can also tell that way how many hours the system holds up. And that can be anything from 45 minutes to a day.
I let memtest run again for a day and see if it comes up with anything. After that I'll swap memory and then look at the CPUs.
 
Well, the memtest test didn't generate any errors, but I got new RAM anyway. I noticed that the RAM in the server was PC2100 from HP, not the original IBM and it was also not installed in the recommended slots. I got the original parts and tried those in the correct slots, but the flaky behaviour was the same. I then figured that since it was not the RAM maybe it was because each CPU did not have its own RAM banks available. So I put the HP RAM back in for the second CPU and gave that a try, but uptime was as 'reliable' as before.
I then pulled one CPU and ran each of them in socket 0, but that generated the same results.
So I guess I need to find a replacement board and some new processors. Luckily, it isn't a super expensive server board, but it really isn't something I can afford at the moment either. I may see if I can get the MSI board, since everything is on the main board it doesn't have to be the IBM version.

But before I go to extremes and basically buy a new server I wonder if there is any issue with running the server without a mouse or keyboard, or better to say, run it on a KVM. I have a PS/2 KVM and use a USB to PS/2 converter to attach mouse and keyboard. That works OK when I boot up and use the server, even switching back to the server works. After moving the mouse a bit the mouse and keyboard sync back up and I can use them no problem, but often enough soon after the system locks up halfways (I typically can still connect to the Apache web server and load pages). I also tried using UltraVNC and connected only remotely, but that eventually doesn't do the trick either as the VNC service cannot handle the incoming connection.

I'm tempted to rebuild the system and put Server 2003 on it, but I only have the 32 bit version. While that should work I'd castrate the 64 bit CPUs, but I guess at this stage it is worth a try.

Still, any hints, tips, or other helpful advice is greatly appreciated.

David

Burt - How did you make out with the Compaq?
 
hi,
I am not sure have well understood, but if your
network sessions continue to go when server freezes,
I belive there are not problems about memory and CPU.
If the problem is HW, it may be related to the graphic adapter.

I am not sure but I seem that also in WindowsXP you can
activate Remote Desktop and you may try use this during
the hang.

I belive the problem is related to the graphic system,
hardware or software. Ensure you are using the correct
device driver (64bit) for your graphic adapter.

I should make this test: use the machine in VGA mode, or
prssing F8 at boot and choosing VGA mode or Safe Mode + Network or changing driver (temporary deinstall)
and use the server in this mode to test if it hangs.

ciao
vittorio
 
Thanks vittorio!
I had the 64bit Rage XL driver installed and for now deinstalled it and run the system with the base VGA driver from Windows. Let's see what happens.
 
Well, uninstalling the Rage driver and using only the base VGA driver did not improve things.
I went ahead and installed Server 2003 32bit on the box and that now works rock solid for almost three days. The server was never 'up' so long before. My guess is that some driver either in XP 64bit itself or any 3rd party driver (the only other one I added was the one for the Broadcom ethernets) was crappy.
I doubt now more and more that it is a hardware issue, but was purely software related. I wonder if Server 2003 64bit is as flaky as its XP labeled clone, but I don't have the money to figure that out.
I guess the problem got solved by walking around it. Not really what I hoped for, but better this way and 32bit than nothing at all.
 
Dual core opteron? There's a really evil AMD bug that causes timer drift and impacts all sorts of things. Put /usepmtimer in the boot.ini, then reboot. The boot.ini switch won't hurt anything if you don't have the issue, but if you do have the issue the change is amazing. There is an mskb article that mentions the bug, but it erroniously states cases where you don't need the switch. My experience has been use it, the KB article will be updated soon to reflect that.

To test for the issue,

1. do a ping. If the response time is negative like

Reply from 10.216.10.10: bytes=32 time=-3ms TTL=128
Reply from 10.216.10.10: bytes=32 time=-3ms TTL=128

Ping statistics for 10.216.10.10:

Packets: Sent = 10, Received = 10, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = -4ms, Maximum = -3ms, Average = 429496726ms


you have the problem.

2. If you see a userenv event ID 1054 with "unexpected network error" and "group policy processing aborted" in the test then you have the issue.



 
Hi!

No, this is only a dual processor, single core system. I just got a pair of 250 Opterons for cheap. Will see if those go with that box.
 
Hi!

Just for someone who may stumble over this thread, using the BIOS version 1.36 (the latest available) I got the Opteron 250 OSA250CEP5AU to work. I had to put them in one by one, but eventually both got detected right.
One of them runs at 63°C and the other one around 43°C 'core' temperature. I assume that the higher temperature is for the one located towards the back. I know that lower is better, but are those temperatures considered OK? Hmmmm, I guess that would be better asked in a new thread. Ah well, I see how it develops and maybe it is a moot point.

And again, thanks for all your help!
 
63 at idle? It's getting up there, and will shorten the life of the CPU...

Burt
 
No, that's not at idle, that is under heavy load. I run BOINC on it as this server is for hobby use only and only a handful folks make use of the web server. So rather than have it idle it can do some good deeds.
Right now the temperature is typically below 60°C / 45°C and that is well within the operational range for the Opterons, which following AMD's specs goes up to 70 °C. I assume that the original 246 processors ran at the same temperature as they consume the same amount of power and have the same case design. Maybe I switch the CPUs at some point so that one gets the cooler spot at least once in a while. I'm not surprised that the heat difference is that big. The eServer 325 has CPU0 in the back of the case behind the memory for CPU1, which is in the front right behind the bank of fans. Seems not to be the smartest design, but it works.
I did see 'active' heatsinks for 1U servers that contain a fan, but I wonder if the more in air flow will make up for the less in cooling surface.
 
Probably...how is the server racked? Very important for proper air flow, meaning from front to back (no obstructions), and at least 2-3 feet away from the wall in the back. Also, what thermal compound do you use? I have always found a good copper core heatsink and Arctic Silver prove to lower CPU temps.

Burt
 
The server is not in a rack, but sits on a desk on top of an eServer 235. I got the 325 for under 200 bucks and paid for the 235 just 80, which is inside my budget range. The only used desktop racks I could find were over 500$. And then I'd need to get the rails as well. I don't need more than the two servers.
I do wonder if putting some wood bars between them will do any good, such as two 1" square rods. The servers are about a foot off the wall, so I will see if I can increase that space, but then again both servers run fine for over a year now in that spot (which doesn't mean much).
None of the air intakes are obstructed and there aren't any at the bottom or on the top of the 235 (there are some on the 325). Also, the fans run on low speed and are rarely kicking up a notch. So it is not that the fans are screaming like they do on startup, which makes me believe that the temp control doesn't consider the system to be excessively hot. And yes, the control works as I could hear a few times.
 
Space the servers apart and move it from the wall, and post the results. I would say at least 2" apart from eachother...

Burt
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top