Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Backup/Copy issue (F80 & 4.3)

Status
Not open for further replies.

beardyboy

Technical User
Jul 26, 2002
19
0
0
GB
Each night, as part of the backup script, we copy some large database files from a “live” filesystem to a “copy” filesystem. We then backup (using “tar”) from the “copy” filesystem. This is so the backups can continue whilst we reactivate access to the live databases.

This system has worked perfectly well since its inception about 18 months ago.

About 2 weeks ago, we noticed that the backup script was “tar”’ing a 0 byte file from the “copy” filesystem. On further investigation we could see that the filesystem was being reported as full, yet it only held about half of its capacity in data.

More to the point (and ensuring it wasn’t in use and nobody else was on the system) we tried to unmount this filesystem , only to receive the message that it was in use.

A system reboot seemed to resolve the problem, allowing us to unmount/mount the “copy” filesystem and it would also show a correct figure on how full it was.

The problem returned however after 2 days though.

Today, we have rebooted the system, unmounted this filesystem and ran an fsck on it. No problems were reported (and therefore none fixed!). Similarly we ran the same command on the originating unmounted “live” system with the same results.

We’ve ran “diag” on the underlying physical disks and all is fine.

We are still not clear as to what could be causing this problem. Any ideas



The system is an IBM RS6000 F80, about 2 years old, running AIX 4.3

 
Hi,

1.Could you please post the errpt output
2.Possibly you have some file that is beeing open and improperly written to by some software.
Next time you fail to unmount the &quot;copy&quot; FS - run &quot;fuser -uV <Filesystem>&quot; to find out which process keeps it busy.
Then you can do &quot;fuser -k <Filesystem>&quot; to release it.
But you first must discover who is the bad guy ...

&quot;Long live king Moshiach !&quot;
 
Heres the error report that i've just been sent.


# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
369D049B 0903010703 I O SYSPFS UNABLE TO ALLOCATE SPACE IN FILE
SYSTEM
E18E984F 0828160303 P S SRC SOFTWARE PROGRAM ERROR
2BFA76F6 0828155803 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0828160303 T O errdemon ERROR LOGGING TURNED ON
192AC071 0828155603 T O errdemon ERROR LOGGING TURNED OFF
C60BB505 0828105503 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY
TERMINATED
369D049B 0827010703 I O SYSPFS UNABLE TO ALLOCATE SPACE IN FILE
SYSTEM
E18E984F 0825190603 P S SRC SOFTWARE PROGRAM ERROR
2BFA76F6 0825190003 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0825190603 T O errdemon ERROR LOGGING TURNED ON
192AC071 0825185803 T O errdemon ERROR LOGGING TURNED OFF
E18E984F 0825173403 P S SRC SOFTWARE PROGRAM ERROR
2BFA76F6 0825172803 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0825173403 T O errdemon ERROR LOGGING TURNED ON
192AC071 0825172703 T O errdemon ERROR LOGGING TURNED OFF
# errpt -a | more
---------------------------------------------------------------------------
LABEL: JFS_FS_FULL
IDENTIFIER: 369D049B

Date/Time: Wed Sep 3 01:07:57
Sequence Number: 384
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: INFO
Resource Name: SYSPFS

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

Recommended Actions
USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED
INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
REMOVE UNNECESSARY DATA FROM FILE SYSTEM

Detail Data
MAJOR/MINOR DEVICE NUMBER
0026 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/copylive, /copydata/msm/live
---------------------------------------------------------------------------
LABEL: SRC
IDENTIFIER: E18E984F

Date/Time: Thu Aug 28 16:03:53
Sequence Number: 383
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: PERM
Resource Name: SRC

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

Recommended Actions
PERFORM PROBLEM RECOVERY PROCEDURES

Detail Data
SYMPTOM CODE
256
SOFTWARE ERROR CODE
-9017
ERROR CODE
0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
named
---------------------------------------------------------------------------
LABEL: REBOOT_ID
IDENTIFIER: 2BFA76F6

Date/Time: Thu Aug 28 15:58:09
Sequence Number: 382
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: TEMP
Resource Name: SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
0
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
0
---------------------------------------------------------------------------
LABEL: ERRLOG_ON
IDENTIFIER: 9DBCFDEE

Date/Time: Thu Aug 28 16:03:19
Sequence Number: 381
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: TEMP
Resource Name: errdemon

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

Recommended Actions
NONE

---------------------------------------------------------------------------
LABEL: ERRLOG_OFF
IDENTIFIER: 192AC071

Date/Time: Thu Aug 28 15:56:52
Sequence Number: 380
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: TEMP
Resource Name: errdemon

Description
ERROR LOGGING TURNED OFF

Probable Causes
ERRSTOP COMMAND

User Causes
ERRSTOP COMMAND

Recommended Actions
RUN ERRDEAD COMMAND
TURN ERROR LOGGING ON

---------------------------------------------------------------------------
LABEL: CORE_DUMP
IDENTIFIER: C60BB505

Date/Time: Thu Aug 28 10:55:18
Sequence Number: 379
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: PERM
Resource Name: SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

Recommended Actions
CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

Recommended Actions
RERUN THE APPLICATION PROGRAM
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
0
USER'S PROCESS ID:
0
FILE SYSTEM SERIAL NUMBER
-1
INODE NUMBER
-1
PROGRAM NAME

ADDITIONAL INFORMATION
Unable to generate symptom string.
---------------------------------------------------------------------------
LABEL: JFS_FS_FULL
IDENTIFIER: 369D049B

Date/Time: Wed Aug 27 01:07:59
Sequence Number: 378
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: INFO
Resource Name: SYSPFS

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

Recommended Actions
USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED
INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
REMOVE UNNECESSARY DATA FROM FILE SYSTEM

Detail Data
MAJOR/MINOR DEVICE NUMBER
0026 0007
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/copylive, /copydata/msm/live
---------------------------------------------------------------------------
LABEL: SRC
IDENTIFIER: E18E984F

Date/Time: Mon Aug 25 19:06:38
Sequence Number: 377
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: PERM
Resource Name: SRC

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

Recommended Actions
PERFORM PROBLEM RECOVERY PROCEDURES

Detail Data
SYMPTOM CODE
256
SOFTWARE ERROR CODE
-9017
ERROR CODE
0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
named
---------------------------------------------------------------------------
LABEL: REBOOT_ID
IDENTIFIER: 2BFA76F6

Date/Time: Mon Aug 25 19:00:08
Sequence Number: 376
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: TEMP
Resource Name: SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
0
0=SOFT IPL 1=HALT 2=TIME REBOOT
1
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
0
---------------------------------------------------------------------------
LABEL: ERRLOG_ON
IDENTIFIER: 9DBCFDEE

Date/Time: Mon Aug 25 19:06:15
Sequence Number: 375
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: TEMP
Resource Name: errdemon

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

Recommended Actions
NONE

---------------------------------------------------------------------------
LABEL: ERRLOG_OFF
IDENTIFIER: 192AC071

Date/Time: Mon Aug 25 18:58:57
Sequence Number: 374
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: TEMP
Resource Name: errdemon

Description
ERROR LOGGING TURNED OFF

Probable Causes
ERRSTOP COMMAND

User Causes
ERRSTOP COMMAND

Recommended Actions
RUN ERRDEAD COMMAND
TURN ERROR LOGGING ON

---------------------------------------------------------------------------
LABEL: SRC
IDENTIFIER: E18E984F

Date/Time: Mon Aug 25 17:34:39
Sequence Number: 373
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: PERM
Resource Name: SRC

Description
SOFTWARE PROGRAM ERROR

Probable Causes
APPLICATION PROGRAM

Failure Causes
SOFTWARE PROGRAM

Recommended Actions
PERFORM PROBLEM RECOVERY PROCEDURES

Detail Data
SYMPTOM CODE
256
SOFTWARE ERROR CODE
-9017
ERROR CODE
0
DETECTING MODULE
'srchevn.c'@line:'350'
FAILING MODULE
named
---------------------------------------------------------------------------
LABEL: REBOOT_ID
IDENTIFIER: 2BFA76F6

Date/Time: Mon Aug 25 17:28:28
Sequence Number: 372
Machine Id: 0056BC8A4C00
Node Id: esp
Class: S
Type: TEMP
Resource Name: SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
0
0=SOFT IPL 1=HALT 2=TIME REBOOT
1
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
0
---------------------------------------------------------------------------
LABEL: ERRLOG_ON
IDENTIFIER: 9DBCFDEE

Date/Time: Mon Aug 25 17:34:06
Sequence Number: 371
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: TEMP
Resource Name: errdemon

Description
ERROR LOGGING TURNED ON

Probable Causes
ERRDEMON STARTED AUTOMATICALLY

User Causes
/USR/LIB/ERRDEMON COMMAND

Recommended Actions
NONE

---------------------------------------------------------------------------
LABEL: ERRLOG_OFF
IDENTIFIER: 192AC071

Date/Time: Mon Aug 25 17:27:11
Sequence Number: 370
Machine Id: 0056BC8A4C00
Node Id: esp
Class: O
Type: TEMP
Resource Name: errdemon

Description
ERROR LOGGING TURNED OFF

Probable Causes
ERRSTOP COMMAND

User Causes
ERRSTOP COMMAND

Recommended Actions
RUN ERRDEAD COMMAND
TURN ERROR LOGGING ON

 
beardyboy, did you ever find out what was causing this? How are the files that are being backed up getting from the database to the copy filesystem (through database commands or through AIX commands)?
 
HI,

Can you please also post the errpt from the copy machine - for the time when the &quot;Fs full&quot; message comes up on the original system.
Thanks

&quot;Long live king Moshiach !&quot;
 
Getting any info is a pain, my company isn't directly supporting the machine, the company supporting it uses us as a third party and comes to us with issues they dont know how to solve. Basically the machine is reportiong back that the filesystem is full (4gb) when its actually only got a 2gb file. I'll see what other information I have to hand but i'm at a loss atm and lacking any decent info from the customer.


Heres the last of the info I have from them atm:

OK, the problem has occurred again. Looking at the /tmp space this time, we
had more than before so I'm not so sure this is a source of the problem
unless you can tell me otherwise

Filesystem 1024-blocks Used Available Capacity Mounted on
Before (problem)
/dev/hd3 98304 20012 78292 21% /tmp

Recently (no problem)
/dev/hd3 98304 18140 80164 19% /tmp

Now (problem again)
/dev/hd3 98304 15824 82480 17% /tmp

Here is what it looks like at the moment;

# df -kP
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/hd4 196608 101732 94876 52% /
/dev/hd2 1605632 628812 976820 40% /usr
/dev/hd9var 65536 10552 54984 17% /var
/dev/hd3 98304 15824 82480 17% /tmp
/dev/hd1 32768 1832 30936 6% /home
/dev/csalive 2097152 1909140 188012 92% /msm/csadata
/dev/weblive 524288 219532 304756 42% /msm/web
/dev/copydata 819200 299680 519520 37% /copy
/dev/copytest 4096000 3101120 994880 76% /copydata/msm/test
/dev/jnlvol 262144 8268 253876 4% /msm/jnl
/dev/copycsa 2621440 1925596 695844 74% /msm/csacopy
/dev/mactest 4096000 3099344 996656 76% /msm/test
/dev/maclive 4096000 3124560 971440 77% /msm/live
/dev/copylive 4096000 4096000 0 100% /copydata/msm/live
# pwd
/
# cd /copydata
# ls -al
total 24
drwxrwxrwx 3 root system 512 Jun 29 2002 .
drwxrwxrwx 30 bin bin 1536 Sep 04 15:06 ..
drwxrwxrwx 4 root system 512 Aug 21 13:52 msm
# cd msm
# ls -al
total 32
drwxrwxrwx 4 root system 512 Aug 21 13:52 .
drwxrwxrwx 3 root system 512 Jun 29 2002 ..
drwxrwsrwx 2 root system 512 Sep 04 01:07 live
drwxrwsrwx 2 root sys 512 Sep 04 01:10 test
# cd live
# ls -al
total 3934200
drwxrwsrwx 2 root system 512 Sep 04 01:07 .
drwxrwxrwx 4 root system 512 Aug 21 13:52 ..
-rw-r--r-- 1 root system 2014302208 Sep 04 01:07 database.mpk
-rw-r--r-- 1 root system 0 Sep 04 01:07 database2.mpk
#

Once again, we can't unmount the fs.

# umount /copydata/msm/live
umount: 0506-349 Cannot unmount /dev/copylive: The requested resource is
busy.

Either form of the &quot;fuser&quot; command didn't show up anything. An interesting
issue occurred when I added the -f flag to fuser. Below is the command run
on this problem fs and on an OK one that wasn't being accessed.

# fuser -dfV /copydata/msm/live
/copydata/msm/live:
fuser: 0506-084 Extended read failed (sid=3970, off=804503272).

# fuser -dfV /copydata/msm/test
/copydata/msm/test:

#

My AIX documentation doesn't have this 7 digit code in (I hate it when I
can't find the error code in the documentation!! - there are so many missing
ones)



Running an fsck -p /copydata/msm/live produced the following results

/dev/copylive (/copyd): Bad Inode Map (NOT SALVAGED)
/dev/copylive (/copyd): Bad Block Map (NOT SALVAGED)
/dev/copylive (/copyd): Filesystem integrity is not guaranteed
/dev/copylive (/copyd): 9 files 4195256 blocks 3996744 free

As mentioned, the file system is still mounted and although not in use by
anyone (as far as we can tell) I suspect that these errors relate to the
fact that it is still mounted. This is pretty much what it did last time
until we rebooted, after which an fsck on the unmounted filesystem reported
no errors.
 
Hi,

1.Since the FS failes once the target file reaches exactly 2 GB,I would suspect the the ulimit for the user who is creating this file (user=&quot;nobody&quot; ?) is 2GB,which is default.

Find out this user file size limit:

lsuser -f <UserName> | grep fsize

If it's set to 2 GB - increase is:

chuser fsize=-1 <UserName>

2.Also make absolutely sure that the fsck on the unmounted REMOTE system passes.


&quot;Long live king Moshiach !&quot;
 
Thats sounds like a possibility and could be why this error has only started to show in the last few weeks as the file has increase over 2 GB.. I'll pass this onto the customer and see it it helps.

Thanks.
 
Also, make sure the filesystem is largefile enabled. (The default is to not be largefile enabled when you create a filesystem.) If it isn't, you'll have to remake it to get it largefile enabled.
 
Yes,correct
beardyboy - use:

lsfs -q <FSname>

and watch for bf: true/false.
true means big File is enabled on this FS.
false means bad news - as per &quot;bi&quot; above.

&quot;Long live king Moshiach !&quot;
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top