Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Get errlog analysis by diag - offline

Status
Not open for further replies.

MoshiachNow

IS-IT--Management
Feb 6, 2002
1,851
IL
HI,

I'd like to get the "diag" to anaylize the errlog for the system issues,by running the diag offline:

diag -Bc

Starting diagnostics
Testing sysplanar0
Testing L2cache0
Testing mem0
Testing oppanel
Testing proc0
Testing proc1
Ending diagnostics.
A problem was detected. Run diagnostics.

diag -Bcv
Starting diagnostics
Testing sysplanar0
Testing L2cache0
Testing mem0
Testing oppanel
Testing proc0
Testing proc1
Ending diagnostics.
A problem was detected. Run diagnostics.

When I then run it interactively - it shows a memory issue:
You have a bad memory chip in P1-M1.9 :

diag :

Error log information:
Date: Thu Jun 2 09:13:41 2005
Sequence number: 152
Label: SCAN_ERROR_CHRP
FRU: n/a P1-M1.9


I'd like to get the abobe info by running "diag" as an offline command,with some parameters,like :

diag -xxxx

I'd appreciate ideas.
Thanks

Long live king Moshiach !
 
Have you tried "diag -cs" ? This would test all devices instead of just a base set. I am not sure if it will give the output you want but it is worth a shot.


Jim Hirschauer
 
Actualy,not realy.

-cs runs a real time diag.While I'd like the diag to anaylyse the errlog for system issues and report them,as the interactive diag does.

Long live king Moshiach !
 
From looking at the diag command you will probably have to write a script to capture the name of each device and run "diag -ced device" on each one :-( That's my take on it anyway.


Jim Hirschauer
 
There's a man page for diag...

The -e flag forces error log analysis

I would also use -A (advanced)

base system:
diag -c -e -A -S 1

I/O devices:
diag -c -e -A -S 2

networking devices:
diag -c -e -A -S 7

all devices:
diag -c -e -A -s


HTH,

p5wizard
 
I did check the man pages ...
Again,the task is to "..anaylyse the errlog .." ,NOT to run the real time diag.
The suggested commands above run without any error on the given system
If I run diag > Diagnostic Routines ->System Verification->Problem Determination

THEN I get the above errors related to the memory,taken from the current errlog:

Error log information:
Date: Thu Jun 2 09:13:41 2005
Sequence number: 152
Label: SCAN_ERROR_CHRP
FRU: n/a P1-M1.9



Long live king Moshiach !
 
well then try -v for system verification, but If I read diag menus correctly, then there is no ELA performed...

I've written a script to analyse error log, using errpt command with -J -K -j -k flags, but it is still learning new stuff whenever a specific problem occurs... I also use -s and -e to specify the report interval.

Does no analysis of sense data though.


HTH,

p5wizard
 
Maybe you just need to run

/usr/lpp/diagnostics/bin/diagela ENABLE

and you'll get all the permanent HW errors in your mailbox. Instead of mailing the notifications, you can even have the system run your own script that perhaps logs the notifications somewhere central and then sends a mail.

See info on this page:

HTH,

p5wizard
 
As you have seen, -B is the only option that includes ELA, and that just prompts you to run diags if there was a problem. Even then it seems to be constrained by the ELA 30 day limit.
I have just tried it on a system with a Perm Hardware error over 30 days old on sysplanar0 (with no repair action logged) and diag -Bc ran clean:
# diag -Bc
Starting diagnostics
Testing sysplanar0
Testing oppanel
Testing mem0
Testing proc0
Testing L2cache0
Testing proc1
Ending diagnostics.
#
But by this time so does advanced diags in problem determination because the error is over 30 days old.
From some dark corner of my memory I seem to remember that when you run diags in PD, it will only go back 30 days and when it is run by the system at 4 in the morning it only goes back 7 days.
I don't think there is a solution to your request.

I have also found some problems on power4 machines when running diags in system verification against real problems when the error report has been cleared, i.e. 7029, remove a power lead, run diags in sv and it reports a PSU down, but clear the error report and log a repair action against sysplaner0, then run advanced diags in system verification to sysplanar0 (power lead still out) and the diags pass !
 
P.S.
What I have eventually used is the below errpt sequence that gives me the summery of the HW error for the last 5 days.That's more or less what I was looking for.
I will also script it to add the missing date.
Cheers.

errpt -AD -d H -s 0601173505 | grep -p 'Diagnostic Analysis'

Diagnostic Analysis
Diagnostic Log sequence number: 595
Resource tested: sysplanar0
Resource Description: System Planar
Location:
SRN: 651-812
Description: System shutdown due to: 1) Loss of AC power, 2) Power
button was pushed without proper system shutdown, 3)
Power supply failure.



Diagnostic Analysis
Diagnostic Log sequence number: 488
Resource tested: sysplanar0
Resource Description: System Planar
Location:
SRN: 25C38002
Description: Refer to the Error Code to FRU Index in the system
service guide.
Possible FRUs:
n/a FRU: n/a P1-M1.9


Long live king Moshiach !
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top