Backups not completing but show "End Writing" in Job Details

broaw · Nov 4, 2005

Hello,

I'm running Netbackup 5.0 on Solaris 9 and sporadically see where client backups are still active but when looking at the job details, the last entry is "End Writing" and the time is hours ago. The details also show the backup at 99% complete.

Basically it looks like the backup is complete but the scheduler is not issuing "release" commands to truly complete the backup. At this point the tapes are idle in the drives and no further backups will start.

I end up stopping the software and clean up any NB processes and clear the IPC message queues then restart the daemons. This is happening for both W2K and Unix clients and is not the same clients each time. It is also happening on full or incrementals.

Thanks for any advice in advance,

Bill

Vela · Nov 7, 2005

Curious....what tape library are you using?

If using STK..in the tail log do you see alot of error's

like LH_UNMOUNT_ERR ??

broaw · Nov 7, 2005

It is a STK L700 with LTO 2 drives. I was in error in my previous post. The tapes are actually unmounted from the drives. I thought the tapes were still mounted, but it was actually another backup running. It seems like the scheduler just forgets to finish up.

Vela · Nov 7, 2005

What do you have in the logs directory?

Check out the bpbkar and bpbrm logs..I've seen this before, but not enough to warrant further investigation or a support call. Are these Unix client's the backups are hanging on? Kinda sound's like a Windows deal though...

Ryan

Vela · Nov 7, 2005

ah NM..I see you said Windows AND Unix..

You did what I'd try with stopping then starting services-
But since you say Unix client's too..Unix client's don't have a process that runs 24/7 til the client is called upon to backup....Might be an issue with the media server or master then?

This could also be ACSLS not communicating back to Veritas that the tape has been unloaded OR the probem that I have is that Veritas issues the unmount command , then ACSLS can't satisfy the request immediately and Veritas will down the drive...

I'd call both STK and Veritas and see what they have to say..because they hardly ever agree on any issue..but STK is right alot of the time...

Ryan

broaw · Nov 7, 2005

I'm going to set up the logs as you suggested. One other thing I noticed on the master.

If I lookup the backup's job id using ps, it comes up as defunct. Compared to an active backup, that comes up with a bpsched associated with it. I'm seeing this on the master server.

DRFranco · Nov 7, 2005

I have seen this before too. Most often the scheduler was hung up on something. We have seen this for example when several hundred jobs are queued and clients are shutdown for maintenance without the jobs being canceled first. If they are then selected and canceled the scheduler must time out on each job (300 seconds here) before moving on to the next task (canceling the next job). This activity could take hours depending on how many jobs are in the scheduler. The jobs show up as active for hours while the scheduler times out connecting to each client before moving on. Jobs that have finished say "end writing" but stay active until all the other clients that were shut down reach their 300 second timeout and the scheduler moves on. The scheduler processes things in a sequencal order so anything can hang it up and keep it from updating the activity monitor! At least the design of the scheduler has been completely redesigned on NB6.0. It is now a multi-threaded engine. Anyway, this activity can be seen in the bpsched log if you search for "timeout".

bswip · Nov 7, 2005

I am experiencing a similar probem. My Master and Media servers are Solaris 9 with NBU Mp3.

I started noticing the jobs lagging and not completing. This turns out to be a possible memory leak in either bptm (fixed in MP3A and MP4), a bpsched memory leak, or as I am just discovering, it could be a memory leak caused by the 'ce' drivers on Solaris 9.

My problem was also accelerated by the fact that I kept loosing free memory but could never regain it. NBU would hang and the only recourse was to reboot --- not something I like to do on Unix --- Stopping NBU did not release the memory...

I've been working with both Sun and Symantec/Veritas for a week now. I could never get a dump because the system would nvere halt --- that is until today...

Now both Sun and Symantec have something to work with. I am currently monitoring the memory utilization of the kernel and when it gets above 50%, I will force a coree dump...

I just hope you are not in the amse boat, since when it starts it will accelerate the occurance and manifest itslef as if bpsched is not starting/ending jobs, or used a lot of CPU...

broaw · Nov 7, 2005

bswip... thanks for the input

That gives me something else to check out. I started thinking today that it may be something locked into the OS. I killed the bpsched message queues (ipcs -aq/ ipcrm -q #) and saw more jobs enter the queue. Obviously, I now lost contact with the storage device by killing the message queues, but it opened something up. I'll check the memory tomorrow and see if it hogged up.

I agree, you should not have to reboot. I take care of a couple of SGI IRIX also and rarely had to reboot them for NB or for adding any peripherals for that matter.

bswip · Nov 8, 2005

Here are some helpful settings in /etc/system

*
*
* Debugging parameters to enable us to see
* who is hogging memory
*
*
set kmem_flags=7
set kmem_logging=0
set kmem_transaction_log_size=1
set kmem_content_log_size=1

In addition, you should check that the shared memory size is no larger than half your available memory. My Master has 8GB memory so my setting is :

set shmsys:shminfo_shmmax=4294967296

In your message queue settings, I was given this by Symantec so that you could process about 30,000 jobs (queued and active), I would set the following:

*
* Message Queues settings
*
set msgsys:msginfo_msgmap=65536
set msgsys:msginfo_msgmni=2048
set msgsys:msginfo_msgtql=2048
set msgsys:msginfo_msgmax=2048
set msgsys:msginfo_msgmnb=524288
set msgsys:msginfo_msgssz=2048
set msgsys:msginfo_msgseg=32767

Then, when the system is running, I would setup a script like this

while true
do
echo
echo "::memstat" | mdb -k
echo
echo `date`
echo sleep 7
done

This will give you a chance to monitor your memory usage. I was told by Sun that the Kernel should never consume more than 40% of memory, and when it does, it should be released back when the process completes.

If you have dump setup, then when you are running low on memory, I was told to stop NBU, and then issue the "reboot -d" command so that a core dump could be produced. The kernel paramaters above will tell Sun or if you know how to read a core dump, who was using the memory...

My system is now at 2020M free after only 15 hours... I'll be rebooting shortly and sending the core dump to Sun.

I'll share what they find...

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Backups not completing but show "End Writing" in Job Details

broaw

Technical User

Vela

Technical User

broaw

Technical User

Vela

Technical User

Vela

Technical User

broaw

Technical User

DRFranco

IS-IT--Management

bswip

Programmer

broaw

Technical User

bswip

Programmer

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Backups not completing but show &quot;End Writing&quot; in Job Details

Technical User

Technical User

Technical User

Technical User

Technical User

Technical User

IS-IT--Management

Programmer

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor

Backups not completing but show "End Writing" in Job Details