Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Checkpoint resume not working - no entity found

Status
Not open for further replies.

skramer

IS-IT--Management
Nov 30, 2009
12
0
0
DE
Hi everybody,

for a couple of policies I have checkpoints set to 15 minutes (most of them in conjunction with compression). Some of these polices are not allowed to run at working hours, so they are suspended (mo-fri 7:00 am) and resumed (mo-fri 7:00 pm) with a crontab entry.

Depending on the client involved I sometimes decide to resume a policy manually - usually it works fine but this is not reliable. Sometimes the jobs are not resumeable (error 227 no entity found) and the collected fragments were cleared from the disk.

I suspect, that somewhere the information that the job is not done gets lost - but where and why?

I found some "keywords" which might be connected to this matter, but I am having problems gathering further information, so what about these:

- Pem Job State: PJS_SUSPEND_END (8)
- Task Incomplete, current status PEM_EC_INCOMPLETE (-2)
- Task completion callbacks
- nbpemreq -persisted screen

Both, Server and Client are running on NBU 6.5.4.

Thanks in advance,

Steph
 
Interesting, I am currently working on a similar case. When I work out what's going on, I'll report back.

Some things to check.

1. With the policy running, run nbpemreq -subsystem screen 1 - in the output for the given policy/ client you should see checkpoint enabled.

2. In nbjm, do you see the jobid reporting as restarting as jobid2, I guess you will, just above this you will probably see the original jobid as being not found in the jop mapper. This i think is where it is 'forgotten'. pem2 keeps a list of jobs in the job map, once it's not in there I don't think it's able to resume.

3. You can grep in nbjm for 'checkpoint' - do you see any checkpoint for yur jobid.

4. The checkpoint is sent from bpbkar. grep for 'CPR' to check if the checkpoint is taken.

I will have to investigate the keywords.

Martin
 
Hi Steph,

Think I have the answer for you.

There is a bug in NBU 6.5.4 that means manually run jobs with checkpoints will not resume.

Martin
 
Hi Martin,

thanks for your answer but a bug cannot be the problem because sometimes it is working and sometimes not. I have manually run jobs with checkpoints which I can resume one day and again not another day.

I'm getting closer though - I guess the problem has something to do with the addresses. In the bpdbm-logs I'm getting plenty "Address already in use" messages.

For my suspended policy I cannot find an entry in the output of:
netstat -a | grep -i wait
The tcp_time_wait_interval is set to 30000, the client port window is from 20000 to 21000 and the server port window is from 20000 to 22000.

Any further suggestions?
 
Check the value set for "Move backup job from incomplete state to done state"
Also maybe "Image cleanup interval"

Bob Stump
VERITAS - "Ain't it the truth?"
 
Hi Steph,

I will have shortly ... Still investigating.

Certainly 6.5.4 will not resume all type of backups. For example, Oracle backups (which are user initiated) will not resume, but I believe SAP/ SQL will.

The log to look in is nbpem, at Debug/ Diag at least 4 - just grep the jobid out and the reason for the 'failure' to resume should be clear. bpbkar/ bpbrm /bpdbm / nbjm are all involved, but pem makes the decision.

As mentioned, manually initialed jobs (bpbackup -i) seem not to resume no matter what, believed to be a bug at 6.5.4

Martin
 
Firstly, my error, the bug was in 6.5.3

These are the findings regarding the behavior of checkpoints in NetBackup.

In NBU 6.0, a user backup can go to incomplete state if there is an error, and it could be manually restarted from a checkpoint - but this is risky, because when the backup goes into incomplete state, the Oracle database is still locked.

In NBU 6.5.2 and later there is a new scheduler (PEM2), and with the new scheduler user backups cannot go to incomplete state - they cannot be restarted from a checkpoint. As far as I can tell this is undocumented.

As a side note - only scheduled backups will auto-retry (without operator intervention). A manual (immediate) backup (bpbackup -i) will go to incomplete, but must be manually resumed. Same for a user backup in 5.1 and 6.0.

Does this fit in with your issues ?

Martin
 
Well then back to my suggestion:

Check the value set for "Move backup job from incomplete state to done state"

.


Bob Stump
VERITAS - "Ain't it the truth?"
 
@Martin

I want to manually resume the policies - that is what is not working! I am suspending the policies with a script called by cron because they are not allowed to run in the working hours.

@Bob

"Move backup job from incomplete to done state" is set to 72 hours (the maximum I believe) and "Image cleanup interval" is set to 12 hours.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top