Reviewing history logs, how? 1

MCubitt · Nov 15, 2004

Running AIX 5.1, for some reason an Oracle DB shut itself down this morning. Checking the alert log, it is clear that Friday night/Sat Morning we had some unusual activity:

Usually the alerts show:

Code:

Shutting down instance: further logons disabled
Shutting down instance (immediate)
License high water mark = 352
Fri Nov 12 02:15:45 2004
ALTER DATABASE CLOSE NORMAL
Fri Nov 12 02:15:45 2004
SMON: disabling tx recovery
SMON: disabling cache recovery
Fri Nov 12 02:15:53 2004
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
Thread 1 closed at log sequence 15869
Successful close of redo thread 1.
Fri Nov 12 02:16:02 2004
Completed: ALTER DATABASE CLOSE NORMAL
Fri Nov 12 02:16:02 2004
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archiving is disabled
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
ARCH: Archiving is disabled
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
Fri Nov 12 05:40:09 2004
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 3
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.1.0.

but on this ocassion:

Code:

Sat Nov 13 00:13:07 2004
Thread 1 advanced to log sequence 15997
  Current log# 1 seq# 15997 mem# 0: /redolog1/IFSL/redo01.log
  Current log# 1 seq# 15997 mem# 1: /redolog2/IFSL/redo01a.log
Sat Nov 13 00:26:52 2004
Thread 1 advanced to log sequence 15998
  Current log# 2 seq# 15998 mem# 0: /redolog1/IFSL/redo02.log
  Current log# 2 seq# 15998 mem# 1: /redolog2/IFSL/redo02a.log
Sat Nov 13 01:31:50 2004
Thread 1 advanced to log sequence 15999
  Current log# 3 seq# 15999 mem# 0: /redolog1/IFSL/redo03.log
  Current log# 3 seq# 15999 mem# 1: /redolog2/IFSL/redo03a.log
Sat Nov 13 05:40:01 2004
Starting ORACLE instance (normal)
Sat Nov 13 05:45:50 2004
SMON offlining US=11
SMON offlining US=12
SMON offlining US=13
SMON offlining US=14
SMON offlining US=15
SMON offlining US=16
SMON offlining US=17
SMON offlining US=18
Sat Nov 13 06:31:06 2004
Thread 1 advanced to log sequence 16000
  Current log# 4 seq# 16000 mem# 0: /redolog1/IFSL/redo04.log
  Current log# 4 seq# 16000 mem# 1: /redolog2/IFSL/redo04a.log
Sat Nov 13 08:00:06 2004

Is there any log files on AIX to monitor what MAY have taken please at server level? Eg shutdown, power failure, etc.

The errpt does not assist much (nothing reported).

Thanks

Applications Support
UK

MCubitt · Nov 15, 2004

Beg your pardon, logged in as root I see:

Code:

bash-2.05a# errpt -a -s 1110090004
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     C60BB505

Date/Time:       Fri 12 Nov 13:51:39 2004
Sequence Number: 1536
Machine Id:      005D87BA4C00
Node Id:         IFSSERVER
Class:           S
Type:            PERM
Resource Name:   SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
           6
USER'S PROCESS ID:
       12082
FILE SYSTEM SERIAL NUMBER
          23
INODE NUMBER
     2719831
PROGRAM NAME
oracle
ADDITIONAL INFORMATION
??
slcra 3F8
ssexhd 2C0
??
??
Unable to generate symptom string.

Which is really helpful!

Applications Support
UK

KenCunningham · Nov 15, 2004

Do you have a core dump file to examine (initially you could use strings core | pg ) for the readable parts of the core file, which may give you some clues. Were there any jobs being run at 13:51 or so whose log files might give you an idea? Is the database OK now?

alexhu · Nov 15, 2004

'uptime' will tell you when the system last started
also there will probably an interuption to the sar display if it was more than a few minutes

Alex

MCubitt · Nov 15, 2004

Thanks,

The uptime is
bash-2.05a# uptime
03:52PM up 26 days, 23:37, 4 users, load average: 5.56, 5.89, 5.56

Maybe time for a restart.

As for the core, sorry I don't understand (new to AIX still).

Thanks

Applications Support
UK

KenCunningham · Nov 16, 2004

Do a search (find . -name 'core*' | xargs ls -la {} ) for a core file created around the time your problem arose and that should enable you to extract details of the possible cause.

MCubitt · Nov 16, 2004

Ken,

Thanks.. it was rge xargs thing I wanted for something else too, so that was an extra help!

I found a core file for 8 Nov 3.18pm and 16 Nov (today) 9.55am.

The DB shut down at 9.30 on 15 Nov so no clues from that.

Applications Support
UK

alexhu · Nov 16, 2004

Martin

I didn't mean you should retsart the server, they don't need it unless you want to do maintenance. I meant that you could tell when the system was last rebooted i.e your system did not crash, just the database

Alex

MCubitt · Nov 16, 2004

Alex,

Fair enough.

Thanks.

Applications Support
UK

aixweb · Nov 16, 2004

look to /var/adm/messages
or
configure properly your /etc/syslog.conf to log system logs what you want.

DSMARWAY · Nov 16, 2004

hi
are you sure you don't have backups running because the alert log shows at 2:50 dbase was closed and about 5:40 instance was started ? did you start the dbase instance

type # echo $HISTFILE
to see if you have shell history set i.e. when users enter in commands it gets logged to the $HISTFILE if set
otherwise look in /.sh_history file

if you start oracle instance via a user , try logging in as the user , type ls -la and see if any history files are there. you can then see if any commands in there did sqlplus
sessions at the time of the dbase shutdown.

Also have you enough space on your system , in the oracle alert logs are there any ORA errors at the time of the
dbase shutdown ?

MCubitt · Nov 16, 2004

Thank you all, actually, I ventured to /var/adm/cron and reviewed the log file there.

A healthcheck job did run around teh time (0931), but probably coincedence:
root : CMD ( /usr/lpp/diagnostics/bin/run_ssa_encl_healthcheck 1>/dev/null 2>/dev/null ) : Mon Nov 15 08:30:00 2004
Cron Job with pid: 61622 Successful
root : CMD ( /usr/lpp/diagnostics/bin/run_ssa_healthcheck 1>/dev/null 2>/dev/null ) : Mon Nov 15 09:00:00 2004
Cron Job with pid: 102422 Successful
oracle : CMD ( /home/oracle/keymed/IFSDBCOPY/copy.unix IFST IFSD ) : Mon Nov 15 09:19:00 2004
root : CMD ( /usr/lpp/diagnostics/bin/run_ssa_encl_healthcheck 1>/dev/null 2>/dev/null ) : Mon Nov 15 09:30:00 2004
Cron Job with pid: 118148 Successful
Cron Job with pid: 37502 Successful
Cron Job with pid: 166530 Successful
root : CMD ( /usr/lpp/diagnostics/bin/run_ssa_healthcheck 1>/dev/null 2>/dev/null ) : Mon Nov 15 10:00:00 2004
Cron Job with pid: 108728 Successful
root : CMD ( /usr/lpp/diagnostics/bin/run_ssa_encl_healthcheck 1>/dev/null 2>/dev/null ) : Mon Nov 15 10:30:00 2004
Cron Job with pid: 50394 Successful
root : CMD ( /usr/bin/errclear -d S,O 30 ) : Mon Nov 15 11:00:00 2004
root : CMD ( /usr/lpp/diagnostics/bin/run_ssa_healthcheck 1>/dev/null 2>/dev/null ) : Mon Nov 15 11:00:00 2004
Cron Job with pid: 183034 Successful
Cron Job with pid: 118370 Successful

We shutdown DB via cron at 2.10am and restrat at 5.40am so that'll be what you saw there.

the Oracle alert log showed as follows
Mon Nov 15 08:53:39 2004
Thread 1 advanced to log sequence 16026
Current log# 5 seq# 16026 mem# 0: /redolog1/IFSL/redo05.log
Current log# 5 seq# 16026 mem# 1: /redolog2/IFSL/redo05a.log
Mon Nov 15 08:53:54 2004
Thread 1 advanced to log sequence 16027
Current log# 1 seq# 16027 mem# 0: /redolog1/IFSL/redo01.log
Current log# 1 seq# 16027 mem# 1: /redolog2/IFSL/redo01a.log
Mon Nov 15 09:23:40 2004
Created Undo Segment _SYSSMU11$
Undo Segment 11 Onlined
Mon Nov 15 09:31:37 2004
Shutting down instance: further logons disabled
Shutting down instance (immediate)
License high water mark = 233
Mon Nov 15 09:34:30 2004
Starting ORACLE instance (normal)
License high water mark = 233
Mon Nov 15 09:34:52 2004
USER: terminating instance due to error 1092
Mon Nov 15 09:36:13 2004
PMON: terminating instance due to error 1092
Mon Nov 15 09:36:25 2004
Instance terminated by USER, pid = 110192
Mon Nov 15 09:36:51 2004
Starting ORACLE instance (normal)

Once it became apparent the DB was (attempting to) shut down, I shutdown abort to get it over with and to get in. Maybe this was the wrong action, but I was concerned to get the DB back.

I suppose potentially we could have had a space issue.. though I would have thought the errpt would have shown this?

You will see use and var are quite high in use:

Code:

          Filesystem 512-blocks      Free %Used     Iused %Iused Mounted on
            /dev/hd4     65536      24320  63%       2554    16% /
            /dev/hd2   8060928     402040  96%      69468     7% /usr
         /dev/hd9var     65536       1344  98%        975    12% /var
            /dev/hd3    851968     799992   7%        323     1% /tmp
            /dev/hd1     65536      60808   8%        235     3% /home
               /proc         -          -    -          -      - /proc
        /dev/hd10opt     65536       3904  95%       1214    15% /opt
       /dev/redo1_lv  32768000   15550784  53%         59     1% /redolog1
       /dev/redo2_lv  32768000   16534128  50%         58     1% /redolog2
     /dev/archive_lv  65536000   63478832   4%         26     1% /archivelogs
     /dev/oradata_lv 170393600   55098976  68%        110     1% /oracledata
    /dev/oraindex_lv 170393600  114227648  33%         70     1% /oracleindex
      /dev/ifsdoc_lv 403701760  366398744  10%       3108     1% /ifsdoc
    /dev/oraclefs_lv  32768000   17260696  48%      76613     2% /oracle

Perhaps we will never get to the bottom of this one!

Applications Support
UK

KenCunningham · Nov 16, 2004

/var could be the problem - have you tried clearing down some of the log files (under /var/adm if memory serves, wtmp, wtmpx etc? Processes like lp sometimes write to var as a temporary measure and it might be this that's causing the proble if there's no space on the device. Have you checked /var/adm/messages as suggested by aixweb?

MCubitt · Nov 17, 2004

There is no folder or file "/var/adm/messages" so I was unable to check that.

There was a large "samba.log" so I have cleared that.

I also cleared "smit.log" and "errpt.log" which were fairly large.

I am not too sure about these:
-rw-r--r-- 1 root system 6775609 21 Dec 2003 /var/ct/1569176140/run/mc/IBM.ERRM/core.0
-rw------- 1 root system 262144 21 Dec 2003 /var/ct/1569176140/log/mc/IBM.ERRM/trace.0

Presume I can remove them? They're quite old.

thanks

Applications Support
UK

KenCunningham · Nov 17, 2004

Apparently ERRM is 'Enterprise Rapid Recovery Methodology', whatever that might be. I would have thought it's safe enough to remove the files given their age, but someone with a knowledge of ERRM (quite an appropriate name really!) might be able to advise better than I. How much space have you released with your actions so far?

MCubitt · Nov 17, 2004

Ken, well, quite a lot fo space is recovered, no longer in the 90s but the 80s (fond memories!)

/dev/hd9var 65536 8960 87% 975 12% /var

I'll ask our unix supportish person about
-rw-r--r-- 1 root system 6775609 21 Dec 2003 /var/ct/1569176140/run/mc/IBM.ERRM/core.0
-rw------- 1 root system 262144 21 Dec 2003 /var/ct/1569176140/log/mc/IBM.ERRM/trace.0

Thanks

Applications Support
UK

KenCunningham · Nov 17, 2004

Yes, it's not looking too bad now. Did you clear down/truncate wtmp(x) as well, assuming you don't need them (using last for example). As far as the 80's go, I'm afraid I'm a 70's man myself. Pass me the zimmer!!!

KenCunningham · Nov 17, 2004

Good pice on ERRM here:

http://publib.boulder.ibm.com/infoc...c/rsct_books/rsct_admin_guide/bl503mst74.html

MCubitt · Nov 17, 2004

Shouldnt laugh, but did that and looks a lot, lot better!

/dev/hd9var 65536 28264 57% 976 12% /var

I had forgotten wtmp/x.

Server folder looking a lot healthier, eh!

thansk for all your help, Ken you get the star because as well as continued help you suffered the 70s (just kidding, the xargs tip alone was a revalation for me!)

Applications Support
UK

KenCunningham · Nov 17, 2004

Thanks! Must go for a lie-down now!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reviewing history logs, how? 1

Programmer

Programmer

Technical User

MIS

Programmer

Technical User

Programmer

MIS

Programmer

Technical User

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Technical User

Programmer

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor