Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Dell Server - system partition full! Problems! 2

Status
Not open for further replies.

Yorkshireman2

Programmer
Jan 21, 2005
154
0
0
CA
I have inherited some IT duties since our IT man died recently.
Our oldest server (Dell Poweredge with W2k server o/s) keeps failing backups and the system partition (C:)has only a few hundred MB of space left.
The C:(system) partition is Only 7GB anyway!
Another partition has only a few hundred MB left as well.
Backup software is Veritas Backup Exec 9.1 to an external SCSI Quantum DLT.

1. Is the lack of space likely the reason for the backup failure? (the failure is always 'directory not found' and the job log shows the unfound directory path is a non-existent path with corrupt characters in it. Removing the selected backup folders either side of the corrupt line from the job just causes the same error in a different place.

2. The raid 5 array has 4 disks of about 37GB but one failed a couple of years ago and was replaced with a drive twice that size-- but the array does not use the extra capacity due to the other small disks.
Is it safe to replace each of the other small disks with larger drives, one at a time, and allow a rebuild to occur before changing the others?

I have never done this so it must be safe- this server has ALL the important live stuff on it- mail server, database, symantec server etc.

3. How long does a rebuild take?
Do I assume correctly that the server would be off-line during this?
Or will everything keep running during this?

Will the lack of space on the C: partition cause problems with the rebuild?

4. If step 2 is a safe, 'do-able' solution, then will the Paragon partition software, mentioned on this forum, safely increase the system partion (C:)as someone mentioned?
(I hope the backups stop failing then)

Acronis also seem to have partition software that they say will work on Raid arays and w2k server- but does anyone have experience of this or Paragon that proves definitely which one I should use?

5. Any other tips /ideas welcome. No one else at the company seems worried but I can see the whole thing coming down if nothing is done.




Yorkshireman2
 
1) Most likely. Before I would do anything I would free up space on the C: partition. Start with search/delete for *.tmp files, delete files in \temp directories, temporary Internet files. Search google for low disk space on drive for more ways to gain space.
After you gain approx 1 gig of free space (at least 500 meg), run chkdsk, backup before proceeding. I have run many Windows 2000 systems which small partitions and can always free a gig, might take hours but it can be done.

2)"Is it safe to replace each of the other small disks with larger drives, one at a time, and allow a rebuild to occur before changing the others?"

The procedure you described is correct, there always a possibility something may go wrong, not likely, so do not proceed with out a backup.

time.. depends on the raid controller's speed and the disks.. could be a couple hours for each drive replacement, but this could vary considerably. The server does not have to be off line, but during the drive replacements, background initialization will take place and the server will slow down considerably; doing this off hours is the better choice. Lack of space will not be a factor, as you should not consider any of this until you free up space.

3)Used both Acronis and Paragon, primarily Paragon, many times without issues, but I am always nervous playing with partitions. Once the entire disk space is regained, if you have multiple partitions, you may need to shrink another partition to increase the system partition, no big deal.

5)"No one else at the company seems worried but I can see the whole thing coming down if nothing is done." You are correct, it will eventually go down and the repair will take hours, they should be worried.



........................................
Chernobyl disaster..a must see pictorial
 
Thanks Technome,

I sent a thanks notification after your reply and I am updating you now.
The sytem partition now has 1.2GB free space and I replaced one of the Raid disks successfully. This job was moved up in the queue because a disk failed! OMG - the controller made a 'godawful' beep all day while the disk was bad! I had to wear earplugs all day while I found the problem, changed the disk and left it to rebuild. Rebuild took the rest of the day and probably well into the evening.
It took me a while to find definitely why the loud beeps were coming from the server, because no error messages appeared, even in the event log.
Then I discovered the array manager software was missing. Later found out from the boss "Oh, I may have deleted that while clearing files from the drive".

After downloading new array manager software and installing (which wouldnt install without first upgrading the raid driver) and two reboots later, a nice warning message popped up telling me of the disk failure (would have been nice to know that at first).

Anyway- I'm on the road to getting it more reliable so thanks again.
Now I just have to figure out why the DLT backup exec job fails with a corrupted log entry and why the DAT backup job says it was successful but the tape has not ejected.

(I have a thread open on the corrupt log and job failure)


Yorkshireman2
 
Personally I like the array beeping noise, as clients and their employees stay away from the server, which gives me peace while fixing the array ,if not quite.

1.2 Gigs is sufficient, but keep and eye on the disk space every couple weeks; I start to worry at .9 gigs free.



........................................
Chernobyl disaster..a must see pictorial
 
Hi Technome,

Disaster has struck!

Following that plan of replacing one drive at a time to get them all at the larger size, I used this long weekend to go into work today (Sunday) and shut down the server and replace another of the small drives. When I powered on --- it wouldn't boot up! Something about a mismatch between nvram and disks.
It also didn't seem to recognise any boot drive.
It reported there were 0 logical drives.

So I thought that maybe that disk (channel 0 on the controller card) is reserved as the boot disk- So I put the original drive back and put the new one in channel 3 instead and tried again - same thing- it said there was a mismatch between the NVRAM and the disks.
Now I was worried. So I put the channel 3 drive back as well and tried again. Same thing.

I read the guide book on the CERC/PERC and it said there was a way to resolve the NVRAM mismatch.
So I entered the web bios utility as notified on the start up screen. It noted the mismatch and asked if I wanted to resolve the mismatch by using the configuration in the NVRAM or the configuration on the disks. I reasoned that I changed the disks and then put them back, so the NVRAM should still have the correct configuration and I chose to use the NVRAM configuration.

I rebooted from there. It still cannot find a boot drive. It keeps saying press F1 to try again, but no good.

I looked in the Web bios utility again and found one channel of the raid adapter was not recognised. I shut off and looked inside again and found channel 1 IDE connector on the drive was pulled out a bit accidentally (probably as I was moving the drive cage around.)
I seated it again and tried again- the web bios utility now recognises the drive and channel but it still won't boot.

If I read the utility correctly, it seems that it doesn't even recognise that the drives were in an array.
I tried going part way through adding the drives into an array but it then began to sound like it would reconfigure everything and wipe the data on the drives, so I backed out.

It seems that no logical disk is recognised and so no boot disk is found.

I don't understand how this has happened from just changing a disk drive. When that last disk failed I just changed the disk and booted up then chose 'rebuild' in the array manager. I though this would be the same job--done within 1 hour and rebuilding over night during the long weekend.

Now the whole server is dead and our business has no email, no database, no contact with the outside world.
I tried to do a good thing during the long weekend and it seems the whole business is down now. I ledt a messsage on the bosses phone-haven't heard back from him.

What can I do to make the drives work again and the server boot up? The drives are all the same as before, with the same data as when it worked, but the computer will not recognise them.


Help!



Yorkshireman2
 
So I thought that maybe that disk (channel 0 on the controller card) is reserved as the boot disk"
Data is distributed across all the disks in a raid 5 array, so the entire array is the "boot" disk.

"So I put the original drive back and put the new one in channel 3 instead and tried again - same thing- it said there was a mismatch between the NVRAM and the disks."
Not a good move, as two disk were technically replaced at once. Added to this was the pulled connector. Hopefully the adpater has not failed the original array, but has just put on the brakes.

You need to get the array back to it's exact state of the last successful boot, that is that it booted up, and if a disk was replaced, and background initialization started/finished, leave that disk in. If a disk is replaced in a raid 5, and the background initialization is started after the new disk is added, any "original" disks are now just another disk to the controller, which must go through background initialization, basically making "oringinal" disks useless for recovery.

Can you manually duplicate the original raid setup, exactly?
If so do it, but under NO circumstances do you answer yes to initialize the drives. If you MANUALLY initialize, data is history, with no chance of recover (a background initialization by the controller is different). Configuration on disk would be the best option, but only if
the exact working set of disks can be duplicated, but tread carefully here.

I will be out the next few days, so I suggest you go over to the Dell forum at


two members ,Dev Mgr and At5147 will provide reliable information.


........................................
Chernobyl disaster..a must see pictorial
 
Hi technome,


Panic over- it's up now.

Del support was worried- it looked bad, but their solution to manually rebuild was basically what you suggested, I think. Their fellow walked me through the steps and it worked.
Basically I had all four original disks backin there as when it last booted. The problem was that the controller couldnt read the disks because it lost the configuration.
When I resolved the mismatch between NVRAM and disk i chose the wrong option. I should have chosen 'Use the disk configuration'.

This fix is a 50/50 one because data corruption could have occurred,but my only other option was to start again and install O/S and Pass etc.

So- first I entered the config utility and selected Objects-> Adapter. I selected Fast initialization and set this to OFF. That prevents any auto initialization of the disks after making the array again.

Next, back out of there and into Configure->View/Add Configuration. The utility scans for disks and showed them as READY.
Starting with disk 0, I used the spacebar to select each one to ONLINE. The ONLINE number on each drive had to match the original disks as they were setup.

Then Ipressed Enter to end the array and pressed F10 to configure it.
I selected the array (only one in my case) and it showed the configure page. I selected Raid 5, as my original was, and guessed at the stripe size etc. I accepted this and moved on, saved abd exited.
I rebooted and it worked!


Seveal lessons to be learned:
1. To change a disk in a raid array, first use the array manager to force the disk to degraded/failed state, then shutdown.

2. Only ever change one disk and power up at a time; if you see any errors, find out why and do NOT change another disk.

3.Make sure all connectors are fully seated before powering up- its easy for a connector to pull out while working on a drive cage in confined space.

4. If you get an error when you boot up, consider it carefully, especially if you see a mismatch between NVRAM and disk configuration.


Thanks again to technome and to Jay at Dell.

Yorkshireman2
 
call me old fashioned but I still like to set my servers up with Raid 1 for the OS and Raid 5 for the data

Norm
 
Excellent,

"I should have chosen 'Use the disk configuration'."
Actually if you chose this with the incorrect disks in, it might have caused the raid to fail the raid 5, so by luck you choose the choice which did not make the matter worse.

" 1. To change a disk in a raid array, first use the array manager to force the disk to degraded/failed state, then shutdown."
Should not make a difference, I do live pulls/replacements all the time, the critical part is to verify the "new " drive has taken the place of the old drive and background initalization has started/completed between each drive replacement.

Corruption should not be an issue, if the server boots you should be OK. The only time you get corruption is if the OS is "writing" when and array goes down (freezes)... had that a couple times but a only a few OS files took a dive, an "over the top" repair install worked well


You learned the major lesson, if you start to panic, back off. My fear about your raid was there were two disks technically removed, if the raid adapter had recognized you had arrays, it would also failed the array. Thankfully the controller was confused and would not to proceed.

While your at it, shut down the server, mark your cable connections, numer your drives as to there slots. I use both Dells Open Manage and LSIlogic's Power Console. Power Console lets you backup the raid configuration to a file.

20 years ago I learned the lesson when I had a late night screw up. I paniced due to being up nearly 24 hours, I pull all the disks out of a downed server but got the drives order mixed up. At that time there was no drive roaming so channel placement and drive order had to be maintained exactly..took me nearly 6 hours of hit and miss to guess the correct order..finished just as the office staff showed up for work.


........................................
Chernobyl disaster..a must see pictorial
 
Norm, agree, raid 1 for the OS and another raid type for the data is the way to go, but yorkshireman2 can deal with that on his next server setup... he has had enough excitement for now.


........................................
Chernobyl disaster..a must see pictorial
 
I agree technome, he has had more than his share of excitement for one day

By the way I enjoyed looking at your Chernobyl web pages , certainly some lessons to be learned from that disaster

Norm
 
Norm

As to Chernobyl, I not sure we have learned from the lesson. What struck me was how many people were ordered/volunteered to enter the area just after the accident, unprotected from lethal radiation levels..I am sure many knew they would not survive.
Meanwhile, the higher ups in the US deemed it necessary to place reactors near metro areas.



........................................
Chernobyl disaster..a must see pictorial
 
technome and normntwrk,

Thanks for both inputs- your right about the limit on excitement- I need wine/beer now before my next step!

I actually didn't sleep well between the server going down and it finally going on line again, yet as soon as it worked I felt a weight fall off my shoulders. :)

I am still confident to begin replacing the next disk though. (Will his foolishness never end?)





Yorkshireman2
 
I actually didn't sleep well between the server going down and it finally going on line again, yet as soon as it worked I felt a weight fall off my shoulders. :)"
About 6 years ago I had a raid 5 destroyed due to a firmware upgrade, had 6 hours sleep in a >72 hour period... took me days to recover.

"I am still confident to begin replacing the next disk though."
As long as another drive connector does not come loose, you will be OK.<grin>

Do not forget to run chkdsk /f on the array.


........................................
Chernobyl disaster..a must see pictorial
 
Hi Technome,

I will add one last post to this thread-to explain the outcome.
A couple of weeks ago I tried changing that disk and rebuilding it. At the end of the rebuild, the server never rebooted. It could not find a bootable disk, despite identifying a logical disk. That was the end of it. Panic at work-one of the busiest couple of weeks began where I had been reserved to train some paying customers so I had to hand the server problems over to someone else.
The fellow who took the server away was pretty sure the Raid controller card had an intermittent fault so he replaced it and installed the disks but of course had to start from scratch with blank disks- no way to recover the data.
Now, after two weeks, he finds he cannot restore the back up tapes to the server so we have had to give up on it for now. The business has been running on an emergency mail server, set up for us on our ISP's site, downloading to email clients. Most people have lost all their emails and contacts and tasks etc. for a long way back. Some have had to use other mail clients because their Outlook client wouldn't start up at all; even when I tried changing accounts in the set up and starting these, they cannot find the new remote mail server.
One printer hardly works - it is an HP, connected by a parallel to Ethernet device called a ZOT P100s print server device. I cannot get a connection to it when I try adding it as a new printer on my PC, and those who were originally connected to it have intermittent operation- sometimes it prints but it takes about 2 minutes from pressing print to the time the output appears.
The Symantec server is gone too, so it is not updating the clients- I have to manually update each client regularly.
Our second server was never set up correctly as a full fail -over, so it has active directory replicated but no global catalog- so problems; it will not promote to primary domain controller and many errors are occuring because of this.

In summary- my first post in this thread turned out to be a prophecy- as you agreed, I was right to worry and it did all come crashing down.

I was annoyed at myself at first because I thought it was my fault that the disk rebuild failed. However it seems the underlying problem was the Raid controller all the time.
Raid 5 is supposed to be fault tolerant but it seems the slightest thing wrong stops the disk array operating and it won't boot. Probably a corrupt entry due to the controller though.


Anyway, as I said- no point in prolonging this thread now- it's over and the server is dead.

Thanks again for all your help though.




Yorkshireman2
 
One last...
Sorry to hear the array did not survive.
Raid card failures, at least LSI based, are extremely rare. More often a disk in an array, not necessarily one which has failed has issues. Often the drives electronics has a failing/out of spec component which does not cause the raid adapter to fail it. Hit a few of these and they are the stuff nightmares are made of.
To the point, be leery of re-using those disks in another array setup unless setup and tested on a non-production machine for a couple weeks.



........................................
Chernobyl disaster..a must see pictorial
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top