GeodSoft logo   GeodSoft

The Stupidity of Windows NT Installs, Backups and System Recovery - 10/5/00

Due to the structure of the Windows registry, there is no assured way to quickly rebuild a failed machine unless essentially identical hardware as the failed machine is available. Further, regardless of whether or not an essentially identical machine is available, rebuilding a failed machine is needlessly complicated by the stupid installation process Micro$oft has given Windows NT. Important aspects of the install process are motivated by commercial concerns of minimizing production costs and obsolete software inventory and not by providing the customer with a reasonably expeditious means of installing the software they have purchased.

- Hardware Failure
- NT: Multiple Installs and Reboots Required
- NT Install Insurance
- Swapping Disks
- Network Problems
- Registry Backup
- Incompatibilities
- Restoring the Original Server
- Lessons Learned
- Recovering an OpenBSD Machine
- OS Comparisons
- For the Future
- Keep NT Backup Servers Available

Hardware Failure

This discussion is prompted by the fact that my Windows NT server was down for a little over two days in early October 2000. Though I had current backups and multiple available machines to install them on, I was not able to get the server back in service until the original hardware was repaired.

I was nearby when the server failed. Audible alarms alerted me to the fact that the NT server could not be pinged and I looked at it. All lights had gone out. The top of the case was very hot and all checks of the cables, cords and switches revealed nothing amiss. I correctly concluded the power supply had failed.

I was not particularly concerned as I had recent backups and spare PCs with similar hardware. My original plan was to build a replacement server using backups then get the original one repaired. I had two available Celeron 533s with 128MB RAM and 10GB hard disks. The NICs were the same in all PCs. The original NT server was PIII 500. The only thing I saw as an incompatibility was the video but figured NT would use standard VGA modes.

NT: Multiple Installs and Reboots Required

If you get Windows NT in the fall of 2000, you get the same CD ROM that Microsoft started selling in mid 1996, plus one or more service packs and option packs. Every other vendor I've encountered sells upgraded versions of their products. When you buy the latest, you get the latest, sometimes with minor patches to be applied. Only Microsoft (that I'm aware of) keeps selling the same disks year after year and forcing their customers to perform multiple upgrades to get to a current version of the software.

Starting an install with software that's over four years old can create compatibility issues. The original NT couldn't deal with disk partitions over 4GB (which seemed huge then). You need SP 3 for this. 10GB is about the smallest drive you can buy today. Thus, before you can even deal with your whole disk, unless you want a bunch of little logical drives, you need to install NT 4 and then SP 3. If you're using NTFS it takes 4 reboots to get to that point (including the automated one where the FAT partition is turned into an NTFS partition).

If you're installing NT 4, service pack 4 or later and including IIS from the Option Pack the order of install is as follows. Base OS and then SP 3 to configure today's larger hard disks. 4 reboots. Install Internet Explorer 4.01; another reboot. Install the Option Pack with IIS and Index Server; another reboot. Install SP 4 - 6A; another reboot. Five installs that are order sensitive and 7 reboots; if you know what you're doing and have no problems, you can get through this sequence in about an hour and a half.

If you have RAID, use any video over 16 colors or 800 x 600 resolution, or have just about any third party drivers, you can typically add another reboot for each. Sometimes you can stack several installs on one reboot but this can also cause problems. If you have several similar machines and multiple third party drivers to install it's worth trying but it's not worth the complications and risk of having to start over for one or two machines.

NT Install Insurance

For the past year and a half of so, all my NT installs start with a minimal, non network, standalone install into WINNTbak. The only additional software that goes into this install is the backup software, installed to the same location as it will go in the real NT install.

I've seen NT delete the C: partition and turn what was the D: partition into C: when all I wanted was to do a completely clean install over the C:\WINNT directory. I've been informed by others that this is normal NT behaviour. With a fully functional (if rarely used) WINNTbak already installed, I've been able to quickly boot a machine regardless of how badly the "real" install may be corrupted. I have the option of wiping WINNT clean and reinstalling. At least twice before, doing a full restore or just a registry restore over a corrupted system has had a correctly working machine in about half an hour. Care needs to be taken to keep passwords synchronized so you can use WINNTbak when you need it.

In this case, my plan was to do a new WINNTbak install on the new machine and then install an uncompress program. My NT server backups are .ZIP files written to writable CD ROMs. I didn't actually plan to install the real copy of NT but simply to uncompress the files off the CD to their appropriate locations. I never completed the basic install due to apparent hardware incompatibilities.

Swapping Disks

After I had SP 3 installed on the first Celeron, I tried partioning and formatting the space remaining after the basic install on C:. With NT, the boot partition is still limited to 4GB. The first sign of problems was a dialog box that said the event log was full. The system log was stuffed with alternating messages, several of each per second. One said the PC had "old or out-of-date firmware" and the other that the second partition "has a bad block."

I thought the Celeron had a bad hard disk and then tried an entirely different approach. Everything that the server needed was on the hard disk of the server that failed. Also the Celerons were quite similar to the failed server.

I removed the hard disk from the first Celeron and put the NT server disk into it. At this point the Celeron had what it would have if WINNTbak had been installed followed by a full restore of the original system.

Network Problems

When the Celeron with the original server hard disk started, it booted but came up with a bunch of errors including network errors I did not expect. When NT "successfully" finishes a boot and displays a login prompt, it saves the current settings as the "last known good". I figured the registry had already been changed by the time I knew there were problems. The video was obviously different. I decided to try to fix the problems with the original server's hard disk in the Celeron. Putting the disk back in the original machine was not likely to work at this point even if it had a good power supply.

First I tried upgrading the adapter, hoping that whatever settings were different with the NIC in this machine would get reset by the upgrade. The cards were the same in both machines and known to be good. I figured a PCI slot or IRQ variation was causing the problem. Upgrading did not help so I removed the adapter (software in the networking control panel, not the physical NIC). NT told me I had to reboot before I could re-install the adapter.

I rebooted and tried reinstalling the adapter with no success. I noticed the IP Protocol had no settings so I uninstalled that and the adapter and rebooted again. Subsequently, every attempt at reinstalling TCP/IP resulted in messages that there were old registry entries remaining and that some name space could not be deleted. TCP/IP never installed. It's been my experience that once you hit one of these inconsistent registry states, there are only two ways out: restore from a known good backup that precedes the problem or reinstall Windows and rebuild the system manually.

Registry Backup

Part of the backup that I was counting on were registry backups done with the Resource Kit program, regback. If you don't have an NT supported tape drive, this is your only option for backing up the registry, an absolutely essential piece of system software. The files that the registry are in are always open while the system is up and cannot be accessed as files. To be backed up they need to be accessed by a program that can access the registry contents via API calls and can then save the contents in a format that can be used by a restore program reversing the process. NTBackup and regback are the only Microsoft supplied programs that can do this (that I know of.)

I've been making these regback, registry backups, for several years on systems without local tape drives but never actually tried to put a registry back with them. I thought that if I restored the registry (on the Celeron while using the original server disk) I might be able to make a clean uninstall of the networking software so that I could reinstall it. Every attempt at using regrest (the restore counterpart to regback) failed. I tried enough combinations of relative and absolute path names and both full directory and individual file (hive) restores that I was confident it simply wasn't going to work no matter how I varied the command line. I reviewed the instructions between nearly every attempt.

At this point I restored the hard disks to the original PCs and took the NT server and the Celeron with what I thought was a bad hard disk back to the shop where they came from.

Incompatibilities

I then experimented with the second new Celeron PC, and got exactly the same sequence of errors as the first. I then believed the two messages were directly related and that something in the BIOS or disk geometry was giving NT a problem and all it could report was a bad block. While these are low end PCs, they are less than six weeks old. I expect the bios is too new for this now old NT 4 to recognize. They are about as generic as you can get today. They are compatible enough that in addition to Windows 98, Linux and OpenBSD install on them.

Over time I experimented with both primary and extended partitions, partition and logical drive sizes less than 4GB and using all remaining space and FAT and NTFS. All seemed to get the same result, except that at some point the event log entries simply stopped appearing. That does not inspire confidence that the resulting system will be usable. The format seemed to hang each time.

After I realized new event log entries were not being created, I returned to a full size primary partition and tried another format. About an hour and a half later, after appearing to have worked but incredibly slowly, NT displayed a dialog box which said NT couldn't format the disk. My recollection is that the second partition format typically only takes a few minutes (varying with size of the partition and speed of the machine).

Finally I tried the quick format which I don't trust and normally do not use. It appeared to complete in a second or two. A subsequent scan of the new disk suggests it's OK. I never got the opportunity to determine if this install would be usable as the shop fixed the power supply and formatted the Celeron disk in only a few hours.

I wanted to experiment with the Celeron to see if a restore could have been accomplished with less than a full manual install. After the original server was working again, as described below, I could think of no way to do this without conflicting with the IP addresses of the original server. I was not willing to take the NT server off-line for such experiments now that it was back.

Restoring the Original Server

With a new power supply, the original NT server booted but as I expected, the registry was damaged and there were errors. I suspected but had never confirmed that the registry backups made by regback were physical duplicates of the original registry files.

I had registry backups from shortly before the power supply failed. As soon as I confirmed that there were boot problems with the current registry, I rebooted to the WINNTbak system. When booted this way the registry files for the "real" system are ordinary files that can be viewed, copied or deleted. I made a system32/config.bak directory and copied all the WINNT registry files into this backup directory. Then I replaced the four registry files with those made by the most recent regback run.

I was both right and lucky. The files made by regback are in fact duplicates of the current registry. When I rebooted again, back to the real system, it was without errors and all the system settings were just as they had been before the power supply died.

Lessons Learned

The net result of this exercise is that I've confirmed that I am saving current registry settings as well as applications and data with my backups. I've again confirmed that a minimal, second copy of NT on the same partition has significant system recovery value. I now know how to put back the registry as of any date, for which there are backups, if a new install causes problems.

Unless you're a really large shop and buy multiple servers at the same time, including machines specifically as backup machines, the odds of having an identical machine available, down to slot placement and IRQ settings seems slim. Even something as simple as a failed NIC, if it is not replaced by identical hardware, identically configured, may cause the kind of problems I encountered when I moved the hard disk.

It may do no good to get the software that was on a previous server onto a new server. If NT detects that hardware has changed, as soon as the machine boots, it may trash itself because it's confused by the hardware changes and has no graceful means of reconfiguring the changes.

This experience raises the question how to best recover from an NT system failure when the original machine is no longer available. Since the only way to be sure the registry is right for a new the machine is to install the OS and all the services and applications on that machine, the loss of any component that cannot be replaced by an essentially identical component may force a complete manual reinstall of everything that was running on the original machine.

Since some software configuration settings are stored outside the registry and may have been changed long after the original install, it's probably a good idea to restore application and data directories from the original machine. Depending on where each service or application stores its configuration data this could create new inconsistencies. You can only hope there were no important settings in the registry that were lost with the failed system. Because of the extraordinary jumble of system and application configuration data that the Windows registry is, the only thing that you can absolutely count on NT backups preserving, is your data.

Recovering an OpenBSD Machine

The second Celeron that I tried to rebuild the NT server on had been used as an OpenBSD 2.7 test machine. It was loaded with a variety of network and CGI scanning and probing tools for testing firewalls and computer security. While I didn't consider it a throwaway machine, I hadn't started doing backups on it. Before overwriting this install, I tried to think of all the root directories that I'd made changes in after the initial BSD install. I made a tar of these and ftp'd it to my workstation. That took less than 15 minutes.

To recreate this "attack" machine, I deliberately picked the other Celeron and moved the NIC so they would not be quite identical. I did a basic OpenBSD install following my install notes but with a different host name and root password and had a fully functional UNIX machine in just under ten minutes with one boot. In less than 18 minutes I'd retrieved the tar and restored it and rebooted.

When I logged in everything looked like the attack machine and seemed to work except I could not su to root. su didn't have the SUID bit set. A quick check of man for tar showed that I had not used the -p option. I repeated the restore with the -p option but did not reboot. This fixed the su problem.

My OpenBSD experience is less than a year compared to four and a half for NT. I have done more BSD installs and administration than either NT or Linux during this past year because my firewall is on OpenBSD. Except for the security related functions, generally similar tasks are performed on Linux and NT so the NT experience continues grow along with OpenBSD experience.

I need to spend some more time checking it out, but as near as I can tell, including preliminary testing, I had a machine functionally identical to the original attack machine in less than 40 minutes. This is on my first attempt at restoring any BSD machine. I do have applicable AIX experience so the UNIX similarities certainly contributed to the success.

OS Comparisons

Less than an hour is compared to significant portions of two days spent on NT systems during which time no meaningful progress was made. The only thing I succeeded in doing was to put the original system back to its original status. I could have done a complete manual NT install in less time but until I worked through all the things I tried, I really couldn't accept that there simply is no sure way to restore an NT configuration to a not identical machine.

My conclusion is that if you have good backups of a UNIX like system and compatible hardware to install the same OS as the failed system, you can build a functional equivalent of a failed system relatively quickly. UNIX like systems store their hardware specific settings in separately accessible and modifiable locations. Even if the hardware is different, following the initial install and restore, a knowledgeable administrator should be able to make the necessary adjustments before rebooting.

With NT on the other hand, unless you have essentially identical hardware, it does not matter how much NT administrative experience you have, you have no assurance that you can successfully transfer anything other than data from your backups. With luck you may get a system back quickly but until you've tried you can't know. If there are any differences in the hardware configurations of the old and new machines, there is no predicting in advance whether a restore will get you a functioning system or whether the system will have to be built from scratch installing the OS and every service and application separately.

For the Future

Without identical hardware in reserve you can't plan NT system recovery. The next time I'm faced with getting a replacement NT server up quickly I'd proceed as follows: Start with a WINNTbak install, which should be the first step of any NT install. If the replacement hardware seems very close to the failed hardware, then working from WINNTbak, do a full restore including C:\WINNT from the failed machine. Continue with this only if it looks very close to a fully functional system when it's booted, assuming it does boot.

If it fails to boot or has serious errors, erase the C:\WINNT directory and do a normal NT install to this location. If the hardware is significantly different, don't bother with a restore, just start with a standard NT install to the normal location.

Continue installing IIS and any other software that runs as NT services (daemons). Service configuration information goes in the system file of the registry along with hardware. It does not go in the software file into which most ordinary application configuration information goes. Unfortunately on most servers this means reinstalling most of the software because servers normally run all their key applications as services.

If the failed machine had a significant amount of non service software installed, stop after the services are installed and reboot to the WINNTbak install. Make a complete backup of the C:\WINNT registry and then restore applications and data. Restore the software, security and default files from the previous system's registry. Do NOT restore the system file in the registry.

Hopefully the preceding steps result in a functional equivalent of the replaced server. If not, reboot to the WINNTbak system and put back the saved C:\WINNT registry. Reboot to the standard system and continue manually installing all software necessary to make the replacement machine a functional equivalent of the failed machine. Finally, restore data and application configuration information that's stored outside of the registry.

Keep NT Backup Servers Available

Depending on the complexity of the server, the similarity or lack thereof between the failed and replacement server and the quality of the documentation on the failed server's installation, the preceding procedures could take two plus hours or a very long day. If you want to replace an NT server quickly, the replacement must be pre built and largely ready to go. Ideally only one IP address would need to be changed and the backup machine rebooted to replace the failed machine.

Web document trees or other data that would be visible to the server users should be kept in sync with the server(s) to be replaced. If the data is not automatically kept synchronized but needs to be restored, the backup media and software on the two systems must be compatible. Also the current backups cannot be locked inside of the failed server, for example a tape changer that can't be ejected if the machine cannot be powered up.

If they are pre built, replacement servers do not need to match the hardware configuration of they server they are replacing but the installed software has to duplicate the functionality of the server to be replaced. The backup might be an older or slower machine or a development machine. One server might backup several production servers. For example, production web, list and database servers might be separate machines where the backup is one with all three installed. This likely means doubling license costs if not hardware costs. There may be functional differences if some services are less important than others.

The approaches discussed above get a server back online where users expect to find it, while the live server is repaired. This buys time; it does not result in a server that can be left indefinitely as a production server and does depend on the availability of spare machines. Server swaps may result in data loss such as messages in a list server's message base. Slower backup machines are likely not adequate for peak load times.

These approaches do not provide uninterrupted service. Done very well they could restore partial functionality in as little as ten minutes. Interruptions could be much longer but should still be much less than installing new machines in a pressure situation.

If you have true 7 x 24 up-time requirements and the necessary resources, you have sufficient redundancy already that the loss of any one or two servers won't cripple your operations. In such environments it probably makes less difference what operating system is chosen.

In environments, where some down time can be tolerated if it's not long or frequent, or high availability simply cannot be afforded, it's unfortunate that NT is much more likely to be used than the UNIX like alternatives. It's an unfortunate choice because it's both much more expensive and much harder to restore a failed server quickly.

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
About >
Building GeodSoft.com >
stupid.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.