The Stupidity of Windows NT Installs, Backups and System Recovery
- 10/5/00
Due to the structure of the Windows registry,
there is no assured way to quickly rebuild a failed machine unless essentially
identical hardware as the failed machine is available. Further, regardless
of whether or not an essentially identical machine is available, rebuilding
a failed machine is needlessly complicated by the stupid
installation process Micro$oft has given Windows NT.
Important aspects of the install process
are motivated by commercial concerns of minimizing production costs and
obsolete software
inventory and not by providing the customer with a reasonably expeditious
means of installing the software they have purchased.
- Hardware Failure
- NT: Multiple Installs and Reboots Required
- NT Install Insurance
- Swapping Disks
- Network Problems
- Registry Backup
- Incompatibilities
- Restoring the Original Server
- Lessons Learned
- Recovering an OpenBSD Machine
- OS Comparisons
- For the Future
- Keep NT Backup Servers Available
Hardware Failure
This discussion is prompted by the fact that my
Windows NT server was down for a little over two days in early October
2000. Though I had current backups and multiple available machines to
install them on, I was not able to get the server back in service
until the original hardware was repaired.
I was nearby when the server failed. Audible alarms alerted me to the
fact that the NT server could not be pinged and I looked at it. All lights
had gone out. The top of the case was very hot and all checks of the cables,
cords and switches revealed nothing amiss. I correctly concluded the power
supply had failed.
I was not particularly concerned as I had recent backups and spare
PCs with similar hardware. My original plan was to build
a replacement server using backups then get the original one repaired. I had
two available Celeron 533s with 128MB RAM and 10GB hard disks. The NICs
were the same in all PCs. The original NT server was PIII 500. The only
thing I saw as an incompatibility was the video but figured NT would
use standard VGA modes.
NT: Multiple Installs and Reboots Required
If you get Windows NT in the fall of 2000, you get the same CD ROM
that Microsoft started selling in mid 1996, plus one or more service packs
and option packs. Every other vendor I've encountered sells upgraded
versions of their products. When you buy the latest, you get the latest,
sometimes with minor patches to be applied. Only Microsoft (that I'm
aware of) keeps selling the same disks year after year and forcing their
customers to perform multiple upgrades to get to a current version of
the software.
Starting an install with software that's over four years old
can create compatibility issues.
The original NT couldn't deal with disk partitions over 4GB (which seemed
huge then). You need SP 3 for this. 10GB is about the smallest
drive you can buy today. Thus, before you can even deal with your whole disk,
unless you want a bunch of little logical drives, you need to install NT 4
and then SP 3. If you're using NTFS it takes 4 reboots to get to that
point (including the automated one where the FAT partition is turned into
an NTFS partition).
If you're installing NT 4, service pack 4 or later and including IIS
from the Option Pack the order of install is as follows. Base OS and
then SP 3 to configure today's larger hard disks. 4 reboots. Install
Internet Explorer 4.01; another reboot. Install the Option Pack with
IIS and Index Server; another reboot. Install SP 4 - 6A; another reboot.
Five installs that are order sensitive and 7 reboots; if you know what
you're doing and have no problems, you can get through this sequence
in about an hour and a half.
If you have RAID, use any video over 16 colors or 800 x 600 resolution, or have
just about any third party drivers, you can typically add another reboot
for each. Sometimes you can stack several installs on one reboot but
this can also cause problems. If you have several similar machines
and multiple third party drivers to install it's worth trying but it's not
worth the complications and risk of having to start over for one or
two machines.
NT Install Insurance
For the past year and a half of so, all my NT installs start with a minimal,
non network, standalone install into WINNTbak. The only additional
software that goes into this install is the backup software, installed to
the same location as it will go in the real NT install.
I've seen NT delete the C: partition and turn what was the D: partition into
C: when all I wanted was to do a completely clean install over the C:\WINNT
directory. I've been informed by others that this is normal NT behaviour.
With a fully functional (if rarely used) WINNTbak already
installed, I've been able to quickly boot a machine regardless of how
badly the "real" install may be corrupted.
I have the option of wiping WINNT clean and reinstalling. At least twice
before, doing a full restore or just a registry restore
over a corrupted system has had a correctly
working machine in about half an hour. Care needs to be taken to keep
passwords synchronized so you can use WINNTbak when you need it.
In this case, my plan was to do a new WINNTbak install on the new machine
and then install an uncompress program. My NT server backups are .ZIP files
written to writable CD ROMs. I didn't actually plan to install the
real copy of NT but simply to uncompress the files off the CD to their
appropriate locations. I never completed the basic install
due to apparent hardware incompatibilities.
Swapping Disks
After I had SP 3 installed on the first Celeron,
I tried partioning and formatting the space
remaining after the basic install on C:. With NT, the boot partition
is still limited to 4GB. The first sign of problems was
a dialog box that said the event log was full. The system log was stuffed
with alternating messages, several of each per second. One said the PC
had "old or out-of-date firmware" and the other that the second partition
"has a bad block."
I thought the Celeron had a bad hard disk and then tried an
entirely different approach. Everything that the server needed
was on the hard disk of the server that failed. Also the Celerons
were quite similar to the failed server.
I removed the hard disk from the first Celeron and put the NT server
disk into it. At this point the Celeron had what it would have if
WINNTbak had been installed followed by a full restore of the
original system.
Network Problems
When the Celeron with the original server hard disk started,
it booted but came up with a bunch of errors including
network errors I did not expect. When NT "successfully" finishes
a boot and displays a login prompt, it saves the current settings as the
"last known good". I figured the registry had already
been changed by the time I knew there were problems.
The video was obviously different. I decided to
try to fix the problems with the original server's hard disk in the
Celeron. Putting the disk back in the
original machine was not likely to work at this point even if it had
a good power supply.
First I tried upgrading the adapter, hoping that whatever settings were
different with the NIC in this machine would get reset by the upgrade.
The cards were the same in both machines and known to be good. I figured a PCI
slot or IRQ variation was causing the problem. Upgrading did not help
so I removed the adapter (software in the networking control panel, not
the physical NIC). NT told me I had to reboot before I could
re-install the adapter.
I rebooted and tried reinstalling the adapter with no success. I noticed
the IP Protocol had no settings so I uninstalled that and the adapter
and rebooted again. Subsequently, every attempt at reinstalling TCP/IP
resulted in messages that there were old registry entries remaining and
that some name space could not be deleted. TCP/IP never installed.
It's been my experience that once you hit one of these
inconsistent registry states, there are only two ways out: restore from
a known good backup that precedes the problem or reinstall Windows
and rebuild the system manually.
Registry Backup
Part of the backup that I was counting on were registry backups done
with the Resource Kit program, regback. If you don't have an NT supported
tape drive, this is your only option for backing up the registry, an
absolutely essential
piece of system software. The files that the registry are
in are always open while the system is up and cannot be accessed as files.
To be backed up they need to be accessed by a program that can access the
registry contents via API calls and can then save the contents in
a format that can be used by a restore program reversing the process. NTBackup
and regback are the only Microsoft supplied programs that can do this
(that I know of.)
I've been making these regback, registry backups, for several years on
systems without local tape drives but never actually tried to put a registry
back with them. I thought that if I restored the registry (on the Celeron
while using the original server disk) I might be able to make a clean
uninstall of the networking software so that I could reinstall it.
Every attempt at using regrest (the restore counterpart to regback) failed.
I tried enough combinations of relative and absolute path names and both
full directory and individual file (hive) restores that I was confident
it simply wasn't going to work no matter how I varied the command line.
I reviewed the instructions between nearly every attempt.
At this point I restored the hard disks to the original
PCs and took the NT server and the Celeron with what
I thought was a bad hard disk back to the shop where they came from.
Incompatibilities
I then experimented with the second new Celeron PC, and got exactly the same
sequence of errors as the first. I then believed the two messages were
directly related and that something in the BIOS or disk geometry was giving
NT a problem and all it could report was a bad block. While these are low
end PCs, they are less than six weeks old.
I expect the bios is too new for this now old NT 4 to recognize. They
are about as generic as you can get today. They are compatible enough
that in addition to Windows 98, Linux and OpenBSD install on them.
Over time I experimented with
both primary and extended partitions, partition and logical drive sizes
less than 4GB and using all remaining space and FAT and NTFS.
All seemed to get the same result, except that at some point the event
log entries simply stopped appearing. That does not inspire confidence
that the resulting system will be usable. The format seemed to hang each time.
After I realized new event log entries were not being
created, I returned to a full size primary partition and tried another
format. About an hour and a half later, after appearing to have worked
but incredibly slowly, NT displayed a dialog box which said NT couldn't format
the disk. My recollection is that the second partition format typically
only takes a few minutes (varying with size of the partition and speed of
the machine).
Finally I tried the quick format which I don't trust and normally
do not use. It appeared
to complete in a second or two. A subsequent scan of the new disk
suggests it's OK. I never got the opportunity to determine if this
install would be usable as the shop fixed the power supply and
formatted the Celeron disk in only a few hours.
I wanted to
experiment with the Celeron to see if a restore could have been
accomplished with less than a full manual install. After the
original server was working again, as described below,
I could think of no way to do this
without conflicting with the IP addresses of the original server.
I was not willing to take the NT server off-line for such experiments
now that it was back.
Restoring the Original Server
With a new power supply, the original NT server booted but as I expected,
the registry was damaged and there were errors. I suspected but had
never confirmed that the registry backups made by
regback were physical duplicates of the original registry files.
I had registry backups from shortly before the power supply failed.
As soon as I confirmed that there were boot problems with the current
registry, I rebooted to the WINNTbak system. When booted this way the
registry files for the "real" system are ordinary files that can be
viewed, copied or deleted. I made a system32/config.bak
directory and copied all the WINNT registry files into this backup directory.
Then I replaced the four registry files with those made by the most recent
regback run.
I was both right and lucky. The files made by regback are in fact
duplicates of the current registry. When I rebooted again, back to
the real system, it was
without errors and all the system settings were just as they had been
before the power supply died.
Lessons Learned
The net result of this exercise is that
I've confirmed that I am saving current registry
settings as well as applications and data with my backups. I've again
confirmed that a minimal, second copy of NT on the same partition
has significant system recovery value.
I now know how to put back the registry as of any date, for which there
are backups, if a new install causes problems.
Unless you're a really large shop and buy multiple servers
at the same time, including machines specifically as backup machines,
the odds of having an identical machine available, down to slot
placement and IRQ settings seems slim. Even something as simple as a failed
NIC, if it is not replaced by identical hardware, identically configured,
may cause the kind of problems I encountered when I moved the hard disk.
It may do no good to get the software that was on a previous server
onto a new server. If NT detects that hardware has changed,
as soon as the machine boots, it may trash itself because it's confused
by the hardware changes and has no graceful means of reconfiguring
the changes.
This experience raises the question how to best recover from an NT
system failure when the original machine is no longer available.
Since the only way to be sure the registry is right for a new the
machine is to install the OS and all the services and applications on
that machine, the loss of any component
that cannot be replaced by an essentially identical component may force
a complete manual reinstall of everything that was running on the
original machine.
Since some software configuration settings are stored outside the
registry and may have been changed long after the original install,
it's probably a good idea to restore application and data directories from
the original machine.
Depending on where each service or application stores its configuration
data this could create new inconsistencies. You can only hope there were
no important settings in the registry that were lost with the failed system.
Because of the extraordinary jumble of system and application configuration
data that the Windows registry is, the only thing that you can absolutely
count on NT backups preserving, is your data.
Recovering an OpenBSD Machine
The second Celeron that I tried to rebuild the NT server
on had been used as an OpenBSD 2.7 test machine. It was loaded with a variety
of network and CGI scanning and probing tools for testing firewalls and
computer security. While I didn't consider it a throwaway machine,
I hadn't started doing backups on it. Before overwriting
this install, I tried to think of all the root directories that I'd made
changes in after the initial BSD install. I made a tar of these and
ftp'd it to my workstation. That took less than 15 minutes.
To recreate this "attack" machine, I deliberately picked the other Celeron
and moved the NIC so they would not be quite identical. I did a basic
OpenBSD install following my install notes but with a different host name
and root password and had a fully functional UNIX
machine in just under ten minutes with one boot. In less than 18 minutes
I'd retrieved the tar and restored it and rebooted.
When I logged in
everything looked like the attack machine and seemed to work except I could
not su to root. su didn't have the SUID bit set. A quick
check of man for tar showed that I had not used the -p option. I repeated
the restore with the -p option but did not reboot. This fixed the su
problem.
My OpenBSD experience is less than a year compared to four and a half
for NT. I have done more BSD
installs and administration than either NT or Linux during this past year
because my firewall is on OpenBSD. Except for the security related
functions, generally similar tasks are performed on Linux and NT so the
NT experience continues grow along with OpenBSD experience.
I need to spend some more time checking it out, but as near as
I can tell, including preliminary testing, I had a machine functionally
identical to the original attack machine in less than 40 minutes. This
is on my first attempt at restoring any BSD machine. I do have applicable
AIX experience so the UNIX similarities certainly contributed to the
success.
OS Comparisons
Less than an hour is compared to significant portions of two days spent on
NT systems during which time no meaningful progress was made. The only thing I
succeeded in doing was to put the original system back to its original
status. I could have done a complete manual NT install in
less time but until I worked through all the things I tried, I really
couldn't accept that there simply is no sure way to restore an NT
configuration to a not identical machine.
My conclusion is that if you have good backups of a UNIX like system and
compatible hardware to install the same OS as the failed system,
you can build a
functional equivalent of a failed system relatively quickly.
UNIX like systems store their hardware specific settings in
separately accessible and modifiable locations. Even if the
hardware is different, following the initial install and
restore, a knowledgeable administrator should be able to make the necessary
adjustments before rebooting.
With NT on the other hand, unless you have essentially identical hardware,
it does not matter how much NT administrative experience you have,
you have no assurance that you can successfully transfer anything other
than data from your backups. With luck you may get a system back
quickly but until you've tried you can't know. If there are any differences
in the hardware
configurations of the old and new machines, there is no predicting
in advance whether a restore will get you a functioning system or whether
the system will have to be built from scratch installing the OS and
every service and application separately.
For the Future
Without identical hardware in reserve you can't plan NT system recovery.
The next time I'm faced with getting a replacement NT server up quickly
I'd proceed as follows:
Start with a WINNTbak install, which should be the first
step of any NT install. If the replacement hardware seems very close to
the failed hardware, then working from WINNTbak, do a full restore
including C:\WINNT
from the failed machine. Continue with this only if it looks very
close to a fully functional system when it's booted, assuming it does boot.
If it fails to boot or has serious errors, erase the C:\WINNT directory
and do a normal NT install to this location. If the hardware is significantly
different, don't bother with a restore, just start with a standard NT
install to the normal location.
Continue installing IIS
and any other software that runs as NT services (daemons). Service
configuration information goes in the system file of the registry along
with hardware. It does not go in the software file into which most
ordinary application configuration information goes. Unfortunately on
most servers this means reinstalling most of the software because servers
normally run all their key applications as services.
If the failed machine
had a significant amount of non service software installed, stop
after the services are installed and reboot to the WINNTbak install.
Make a complete
backup of the C:\WINNT registry and then restore applications and data.
Restore the software, security and default files from the previous system's
registry. Do NOT restore the system file in the registry.
Hopefully the preceding steps result in a functional equivalent of the
replaced server. If not, reboot to the WINNTbak system and put back
the saved C:\WINNT registry. Reboot to the standard system and continue
manually installing all software necessary to make the replacement
machine a functional equivalent of the failed machine. Finally,
restore data and application configuration information that's stored
outside of the registry.
Keep NT Backup Servers Available
Depending on the complexity of the server, the similarity or lack thereof
between the failed and replacement server and the quality of the documentation
on the failed server's installation, the preceding procedures could take
two plus hours or a very long day. If you want to replace an NT server
quickly, the replacement must be pre built and largely ready to go.
Ideally only one IP address would need to be changed and the backup machine
rebooted to replace the failed machine.
Web document trees or other data
that would be visible to the server users should be kept in sync with
the server(s) to be replaced. If the data is not automatically kept
synchronized but needs to be restored, the backup media and software
on the two systems must be compatible. Also the current backups cannot
be locked inside of the failed server, for example a tape changer that
can't be ejected if the machine cannot be powered up.
If they are pre built, replacement servers do not need to match the hardware
configuration of they server they are replacing but the installed software
has to duplicate the functionality of the server to be replaced. The backup
might be an older or slower machine or a development machine. One server
might backup several production servers. For example, production web, list and
database servers might be separate machines where the backup is one with
all three installed. This likely means doubling license costs if not
hardware costs. There may be functional differences if some services
are less important than others.
The approaches discussed above get
a server back online where users expect to find it,
while the live server is repaired. This buys time; it does not result in
a server that can be left indefinitely as a production server
and does depend on the availability of spare machines. Server swaps
may result in data loss such as messages in a list server's message base.
Slower backup machines are likely not adequate for peak load times.
These approaches do
not provide uninterrupted service. Done very well they could restore
partial functionality in as little as ten minutes. Interruptions could be much
longer but should still be much less than installing new machines in a
pressure situation.
If you have true 7 x 24 up-time requirements and the necessary resources,
you have sufficient redundancy already
that the loss of any one or two servers won't cripple your operations.
In such environments it probably makes less difference what operating
system is chosen.
In environments, where some down time can
be tolerated if it's not long or frequent, or high availability simply
cannot be afforded, it's unfortunate that NT is much more likely to be
used than the UNIX like alternatives. It's an unfortunate choice
because it's both much more expensive and much harder to restore a
failed server quickly.
Top of Page -
Site Map
Copyright © 2000 - 2014 by George Shaffer. This material may be
distributed only subject to the terms and conditions set forth in
http://GeodSoft.com/terms.htm
(or http://GeodSoft.com/cgi-bin/terms.pl).
These terms are subject to change. Distribution is subject to
the current terms, or at the choice of the distributor, those
in an earlier, digitally signed electronic copy of
http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the
time of the distribution. Distribution of substantively modified
versions of GeodSoft content is prohibited without the explicit written
permission of George Shaffer. Distribution of the work or derivatives
of the work, in whole or in part, for commercial purposes is prohibited
unless prior written permission is obtained from George Shaffer.
Distribution in accordance with these terms, for unrestricted and
uncompensated public access, non profit, or internal company use is
allowed.
|