Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Linux Cluster - Random Node Crash !

Status
Not open for further replies.

Thief

Technical User
Apr 19, 2001
92
0
0
US
hi ,

i have a peculiar problem with my linux cluster...i have an application that crashes my nodes randomly...the problem is i cannot identify the source of frequent crashes...so let me start by explaining my setup...
i have a 16 node 32 cpu linux cluster with redhat 7.2, which runs an application called lsdyna thru batch software...this application causes the crash of my nodes...initially i had 1 gb of swap for 2gb of ram..i increased that to 2gb....which i thot wud solve the issue...but still it continues crashing...i tried to look up in log files for any signs..but cudnt come up with anything...

is there any way i can find out the reason for the frequent crashes...some commands or some log files..

any advice, suggestion or comment will be highly helpful...
thanks in advance..

Thief................ ::)
(I think the surest sign that intelligent life exists elsewhere in the universe is that none of it has tried to contact us .)
 

Is the code dumping core? Use the ulimit command to allow core dumps before starting the process, if you can get a core dump it can be helpful in post mortem.
 
hello ericbrunson,

i looked up the system for core files and heres the list of them. should i be looking for core files with extensions??

/dev/core
/proc/sys/net/core
/usr/src/linux-2.4.9-31/net/core
/opt/gcc32/include/gnu/gcj/protocol/core

/dev/core is linked with /proc/kcore, can this core be useful in debugging in anyway ?

should i set the ulimit in user's env or in the start script for the process..



Thief................ ::)
(I think the surest sign that intelligent life exists elsewhere in the universe is that none of it has tried to contact us .)
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top