Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Veritas 9 - A communications failure between BE9 & Remote Agents 1

Status
Not open for further replies.

Zebed00

Technical User
Apr 8, 2002
6
0
0
GB
Hi,

I'm wondering if anyone can help on this one please. We've recently installed Veritas Backup Exec 9.0 in a mixed Windows NT/2000 Domain. We are receiving the following failures randomly on a nightly basis.

Completed status: Failed
Final error code: a00084f9 HEX
Final error description: A communications failure has occurred between the Backup Exec job engine and the remote agent.

Final error category: Resource Errors

We've made sure that all the NIC's are set to 100 Mbits and Full Duplex (as recommended by Veritas' TID) and we've ensured that NIC Teaming is switched off. The weirdestpart is that sometimes it happens on one of the jobs and other times is completely different backup set.

Please, if any one has received this error in the past and found a solution, would they please let me know.

Thanks very much in advance,

:)
 
When you set the NICs to 100/Full, did you also do it for your switch. Having the switch set to auto and the nic set to anything but auto sometimes causes spradic communication problems.
 
I have had EXACTLY the same issue. NICs and Switches have be set up as Veritas recommends and still the errors continue. Strangest thing is that as I think was mentioned, that the error floats. On one backup Server 3, Drive D might show an error, the next night Server 3, Drive D will be fine and Server 4, drive letter G, will be messed up. I am completely stumped, most frustrating part is that over the past 7 weeks, I have gotten exactly 3 sucessful backups, the rest have failed with this error.
 
I have also been fighting this same issue for the past few months.
What I've discovered is that it's not so much a communications issue between machines as it is a failure of my tape library to continue the backup onto a second tape once the first tape is full.
I still don't know what's causing this, because sometimes it works just fine and I can span 4-5 tapes no problem, but then other times it fails immediately after it fills one tape.
Right now it has me completely baffled. If anyone else knows what's causing this, please let everyone on here know!!!
 
I rebooted a server once during a back up and I got that same message. Funny thing though, my server was allready backed up, not sure if it checks for a heart beat or not.

Matt
 
What type of tape library are you using? I am using a Dell Powervault 120T autoloader. I haven't noticed a correlation between the time of failure and loading new tapes. More information would be great. Were you able to solve the problem.
 
I'm using an Exabyte 430M library.
So far I haven't found a solution that I'm confident in. With the 430M, I can change it's emulation mode to emulate a different library. I have it set to emulate a 480 right now...things seem to have improved some....but again, I'm not convinced it's solved.
 
I have found that when I get that error if I create a seperate backup job for the server it got the error on, the error goes away. I started out with one backup job a night for all of my servers. I used to have the same problem you are having. I am now running 3 different backup jobs per night and I have not had that problem for a month now. I am only using 50% of my tapes capacity (IBM Ultrium 200GB) so the tape was never an issue for me.
 
I tried that as well...split my backup jobs into a number of smaller pieces. Things worked well for me for a couple of nights...until yesterday. Now I'm getting TFLE_PROGRAMMER_ERROR1. **sigh**
I'm ready to dump 9.0 and go back to 8.6
 
We just upgraded from BE 8.5 to 9.0 this weekend. I've got mixed feelings on BE 9 right now. I like some of the new menuing in 9 but in some spots it seems like they made common tasks a lot more difficult. It's working great at home, but at work it's kind of sucking.

We've got our media server connected to 2 autoloaders:
-Compaq Storageworks TL892 autoloader w/ 2 DLT 35/70 drives
-HP MSL5000 SDLT autoloader w/ 2 SDLT 110/220 drives on a SAN

The new MSL5000 we put in this past weekend is working great, however, the legacy TL892 (which worked well for a few years on BE 8.x) is now going offline and failing during jobs. Not to mention, we're starting to see a lot of SCSI and backup related errors in our event logs. The BE 9 failure message is:
CTSHENFPA5 FRIDAY FULL -- The job failed with the following error: TFLE_PROGRAMMER_ERROR1

then..

The drive hardware is offline!

Please confirm that the drive hardware is powered on and properly cabled.

yes, of course the power is turned on and the device is properly cabled - you were using it just fine a few minutes ago before you got stupid and took it offline!


BTW.. the upgrade from 8.5 to 9.0 was completey and thoroughly painful. Document all of your job properties and schedules, chances are you'll end up entering them all back in by hand when all is said and done.

Jas
 
I received a response from Veritas this morning regarding the TFLE_PROGRAMMER_ERROR1 message.
They have released a hotfix and new autoloader drivers for BE9 as of 6/20/03.
I was told that this error message is caused by a hardware problem or a driver issue with BE9. Veritas suggested uninstalling previous versions of the drivers and installing the latest release.
You can get the new drivers here:
You'll need the hotfix installed before installing the new drivers.
So far things look good for me. Compression on my library has improved with these new drivers, but right now I'm waiting to see if the job will finish as it spans multiple tapes.
 
Zebed00 and others:

Did anyone solve this problem themselves or get a definitive solution from Veritas? I'm getting this error only when backing up from one VLAN to another.

Veritas advises that I disable any NIC teams, set the NICs and switch to 100/Full from auto and go to build 4454. We were using teaming so I've made the changes. 4454 is the build we're on but it still occurs without the team. I've run the SGMon util and get:

bengine: 7cc 07/08/2003 13:48:23: OpenListenSocket: Media server IP address: 0

and then the job craps. IP address 0??

I have an open case with Veritas. I will be opening another due to the 'disable the NIC team and remove the 2nd adapter' instructions - all our mission critical servers are double NIC teamed - V9 won't be much good if it can't access those machines at all and/or run using the team.

I'll keep you posted should I receive anything helpful. If anyone has any suggestions I'd appreciate a reply.

thebigf1
 
Not really a complete response from Veritas, however using a load of registry keys we now appear to get 98% of our backups working. Then completely out of the blue, A communications error has occured appears again. It'll flit from job to job for a few days and then disappear again.

The basics on the reistry keys was to double the values (all except one where in fact uoi had to half the value)- Veritas suggested we try looking at TechNote ID: 258159 and apply the registry changes. Even though it didn't seem particularly relevant, it does appear have improved things slightly.

Hope it helps :-D
 
Is the Tech Note ID in your message correct? I tried to find it on the Veritas sight and had no luck. Thanks in advance.

 
Here's the article mate

TechNote ID: 258159 Last Updated: June 12 2003 06:38 PM GMT
E-Mail this document to a colleague

Caution! The information in this TechNote is based upon certain assumptions, including product, operating system and platform versions. You can review this information in the TechNote Summary portion of this document. This document ( 258159 ) is provided subject to the disclaimer at the end of this document.


Symptom:
The error: "A timeout occurred waiting for data from the agent during operation shutdown" is returned when performing a backup operation with Backup Exec 9.0 for Windows Servers.
Exact Error Message:

a00084f8 HEX - A timeout occurred waiting for data from the agent during operation shutdown.

Solution:
This issue occurs when the timeout period expires for the Remote Agent for Windows Servers (RAWS).

To correct this problem, increase the timeout periods as follows:

1. Open regedit or regedt32 on the Backup Exec media server.
2. Increase the value of the following keys:
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Agent Browser/TCPIP/Expire Time to 1200 (Decimal)
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Engine/Agents/Data Connection Flush Timeout Seconds to 1800 (Decimal)
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Engine/Agents/NDMP Connect Open Time Out Seconds to 300 (Decimal)
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Engine/Agents/Notify Data Halted Time Out Seconds to 300 (Decimal)
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Network/TCPIP/Disconnect Delay to 1500 (Decimal)
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Network/TCPIP/WorkBufferSize to 32768 (Decimal)
Set the registry value HKEY_LOCAL_MACHINE/Software/VERITAS/Backup Exec/Engine/NTFS/Restrict Anonymous Support to 1. Create the value if necessary.
3. Stop all Backup Exec Services
4. Start up Backup Exec Services


Hope it helps :-D
 
Has anyone been able to figure out this issue yet?

Completed status: Failed
Final error code: a00084f9 HEX
Final error description: A communications failure has occurred between the Backup Exec job engine and the remote agent.

Final error category: Resource Errors


In my case I noticed that I started getting this error in late July after I installed the latest hotfixes from Microsoft for NT4. Has anyone else noticed this, or is my experience just coincidence.

Any information would be helpful thanks.

Brandon (brandonk@triadperform.com)
 
I got the error for the first time last night, but have been having issues with the following error sporadically back to BENT 8.6.

Event ID: 11
Source: sonysdx-VRTS
The driver detected a controller error on \Device\Tape0.

0000: 0f 00 18 00 01 00 76 00 ......v.
0008: 00 00 00 00 0b 00 04 c0 .......À
0010: 01 01 00 00 85 01 00 c0 ....…..À
0018: 00 00 00 00 00 00 00 00 ........
0020: 00 00 00 00 00 00 00 00 ........
0028: 00 00 00 00 01 00 00 00 ........
0030: 00 00 00 00 00 00 00 00 ........
0038: 02 c4 00 00 00 44 04 00 .Ä...D..

Anyone else having the issue with your Veritas driver?

After I get the eventid 11 error, I have to restart the server to "unlock" the crash and be able to eject the tape.

Steve
 
Hello guys,

I am having the same problem, when I looked in the Job Engine debug log, it says the DeviceManager has become idle.
Does anyone know why this is happening?? Through out the
log it has showing this twice, the first time, it was only
a one time thing, then the tape continue writing, the second time, it kept trying to resume the process but failed. Then the job goes on to idle ... for a very long of period. Then the error occurred.

BTW, I have a problem after enable server's NIC card to FULL/100. I can not connect to the server via terminal service once it's been set to FULL/100, anyone know why??

My network colleague said tha the switch is already set to FULL/100 in default. But I don't know why this keeps happening.

Laic
 
If hardware problems are causing this issue, check to see whether you are getting any Event ID's 7,9,11 or 15 in the System event log - this indicate SCSI errors.

I can't remember what each of them are, but I think that MS and Veritas have good Q articles, technotes on the subject.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top