I just went through one of those weekends that every systems administrator dreads -- a server disaster that has no explanation.
We are a small domain -- eight subnets, three controllers, about 100 clients. We're W2K all around on our servers and our primary master also functions as our exchange box (I know, I know, but we're a non-profit and money is tight). Friday afternoon we had a power failure, but I was able to get both of our in-house servers (primary master and firewall) shut down in an orderly fasion before the UPC ran out. When the power came back on, the firewall server came back up fine, but the DC would give me an lsass error before I ever got to a login screen. I tried booting to directory services restore mode, but no password that I had would work. Not the DS restore password, not administrator, not anything. Every other disagnostic mode of W2K would return the lsass error.
At this point, we got Microsoft involved, and the first engineer we talked to had us reinstall Windows to a new directory, join as a member server, and promote to a domain controller, copying the AD from one of our other DCs. That worked fine until we rebooted after running DCPROMO. Same lsass error before a login screen, no joy on the DS restore password (which, by the way, we took great care to make we had correct). We're going a little nuts by now, of course. So we reinstalled Windows yet again, and this time, we got an lsass error on reboot before we ever promoted to a DC. So, the engineer concluded that the SAM database on every reinstall was getting corrupted, which was crapping out our passwords. So, he recommended that we reinstall long enough to get the data off the server (we had a backup from Thursday night, but there had been significant work and e-mail activity done on Friday before the power failure that we didn't want to lose). Since this is a RAID 5 server, he thought we should blow away the containers, rebuild them and reinstall W2K from scratch. He said we had either hardware failure or a nasty boot sector virus. If it was a virus, the rebuild and reformat would take care of it. If is was hardware, it would manifest itself again, and then we could the manufacturer involved.
Fine. But here's the weird part. Our own virus scanner (up-to-date signature files) had found nothing before the crash, and a run of trendmicro's housecall off the web revealed nothing either. Once we rebuilt the containers and started over, though, everything worked perfectly. No fuss, no muss. We got all the exchange data back on Sunday, had most of the important files copied over Sunday night, and were back up and running as an organization Monday. It's been perfect since. It's hard to explain, but the server just "feels" better now. Response time from clients seems to be better, the console to the server itself is more responsive.
So my question now is, what happened? I was under the impression that boot sector viruses could only be propogated via floppy, and we keep our servers in a locked room to which I only have a key. We run CA's eTrust virus scan and it never saw a thing. I really doubt that's what it was. But if it's hardware, am I just waiting for the other shoe to drop now, or could there have been some sort of hiccup in the initial build of the containers 19 months ago that just deteriorated over time? If anyone has seen a similar problem, I'd love to hear their thoughts. Me: We need a better backup system.
My boss's boss: Backup? We don't need no stinkin' backup!
We are a small domain -- eight subnets, three controllers, about 100 clients. We're W2K all around on our servers and our primary master also functions as our exchange box (I know, I know, but we're a non-profit and money is tight). Friday afternoon we had a power failure, but I was able to get both of our in-house servers (primary master and firewall) shut down in an orderly fasion before the UPC ran out. When the power came back on, the firewall server came back up fine, but the DC would give me an lsass error before I ever got to a login screen. I tried booting to directory services restore mode, but no password that I had would work. Not the DS restore password, not administrator, not anything. Every other disagnostic mode of W2K would return the lsass error.
At this point, we got Microsoft involved, and the first engineer we talked to had us reinstall Windows to a new directory, join as a member server, and promote to a domain controller, copying the AD from one of our other DCs. That worked fine until we rebooted after running DCPROMO. Same lsass error before a login screen, no joy on the DS restore password (which, by the way, we took great care to make we had correct). We're going a little nuts by now, of course. So we reinstalled Windows yet again, and this time, we got an lsass error on reboot before we ever promoted to a DC. So, the engineer concluded that the SAM database on every reinstall was getting corrupted, which was crapping out our passwords. So, he recommended that we reinstall long enough to get the data off the server (we had a backup from Thursday night, but there had been significant work and e-mail activity done on Friday before the power failure that we didn't want to lose). Since this is a RAID 5 server, he thought we should blow away the containers, rebuild them and reinstall W2K from scratch. He said we had either hardware failure or a nasty boot sector virus. If it was a virus, the rebuild and reformat would take care of it. If is was hardware, it would manifest itself again, and then we could the manufacturer involved.
Fine. But here's the weird part. Our own virus scanner (up-to-date signature files) had found nothing before the crash, and a run of trendmicro's housecall off the web revealed nothing either. Once we rebuilt the containers and started over, though, everything worked perfectly. No fuss, no muss. We got all the exchange data back on Sunday, had most of the important files copied over Sunday night, and were back up and running as an organization Monday. It's been perfect since. It's hard to explain, but the server just "feels" better now. Response time from clients seems to be better, the console to the server itself is more responsive.
So my question now is, what happened? I was under the impression that boot sector viruses could only be propogated via floppy, and we keep our servers in a locked room to which I only have a key. We run CA's eTrust virus scan and it never saw a thing. I really doubt that's what it was. But if it's hardware, am I just waiting for the other shoe to drop now, or could there have been some sort of hiccup in the initial build of the containers 19 months ago that just deteriorated over time? If anyone has seen a similar problem, I'd love to hear their thoughts. Me: We need a better backup system.
My boss's boss: Backup? We don't need no stinkin' backup!