New Windows Anomalies and Some UNIX Comparisons - 4/9/01
Here I use "new" to mean anomalies not previously seen by or
commented on by me. From what I know of Windows, it would
surprise me greatly if others had never encountered these or
similar problems.
There was a power outage here yesterday that lasted over 40
minutes. All my systems are on UPSs but generally not with
enough reserve to last that long. I happen to use APC UPSs which
come with the hardware and software necessary to connect Windows
PCs to the UPSs. This allows an automated "managed" shutdown to
be initiated after a set period of time rather than simply
letting the machine crash when the battery runs out. For reasons
of system integrity, a managed shutdown is preferred, especially
on systems that may be engaged in active disk activity such as
writing log files. Due to the fragility of the Windows
registry, it's more important that Windows systems be shut down
properly rather than simply be allowed to crash.
Not only the computers but all routers, switches and other
equipment necessary to maintain functioning web sites are on UPSs.
My sites can and do remain up and available through
short local power outages. This means that the web servers
and firewall are likely to be active and thus writing logs at the
time of a power failure.
My NT server and workstation shutdown in response to the UPS
software. The firewall, which will nearly always be active if any
of the web servers are, failed when the UPS battery ran down.
The UPS that the Linux and OpenBSD web servers were on lasted
until power was restored. The Linux and OpenBSD servers had been
up 204 and 53 days respectively. Both continued their long up-time
runs. (The Linux machine was up for 336 days when the reset button
was accidentally pressed. The OpenBSD machine was up until it
was moved.) Neither ever crashes. Except for hardware changes and
operating system upgrades, it's hard to think of a reason for
ever rebooting either. Any daemon (service) can be stopped and
started and upgraded if necessary, at will and networking setup
changed without rebooting.
The firewall had to go through its file systems checks and thus
took somewhat longer than normal to boot. This is still not much
longer than a typical NT server boot including starting of
services. Everything worked as expected and there no problems;
the system remained up until it was upgraded a few months later.
The NT workstation was OK but the NT server didn't fare quite so
well. When the NT server rebooted, it first came up with the
wrong login dialog box. This server is a Primary Domain
Controller and the login box is supposed to have three fields for
user name, password and domain. The domain field was not
displayed on the login box, just the user name and password
fields. I made several attempts to log in but knew it would be
futile.
I think this is the third time I've now seen this odd behavior
after a reboot. The first time, I did not recognize the odd
login dialog and tried repeatedly, with every username and
password I knew to log on but none were accepted. I did a
hardware reset and the normal login prompt appeared. I logged in
successfully on the first attempt. The second and third times I
saw the odd login box, I used the hardware reset after only a few
failed login attempts. I think in each case the normal login
appeared after a single extra reboot.
Following the reboot (1) and login, the NT server appeared normal;
there were no error messages or dialog boxes indicating any
failed services. The first sign of a problem was when my
intrusion detection system
started displaying warnings that logs from the other servers were
not available. They were not available because FTP was not
running on the NT server. Neither was the web server as it
turned out. Manual attempts to start both failed, generating the
following error message: "Could not start the World Wide Web
Publishing Service service (sic) on \\NT. Error 2140. An
internal Windows NT error occured." The FTP message was
identical except for substituting "FTP Publishing Service".
Multiple attempts produced the same results.
Another reboot (2) failed to start either service and manual
attempts generated the same error messages. At no time were
there either system dialog box error messages or event log
entries to indicate that a service had failed to start. It was my
own custom programmed warning system and no NT feature that
alerted me to the problem. Microsoft's unhelpful error messages
provided no useful information that I was not already aware of.
As these messages were displayed only in response to manual
efforts to fix an error condition that was identified by other
means, no Microsoft error message or system log played any useful
role in either identifying or fixing a major system problem.
Even after I was aware of the problem and knew when it occurred,
I could find no event log entries describing the problem.
I guessed that something in the registry had become corrupted. I
rebooted (3) to the alternate copy of NT that I install on all my
NT systems as insurance against just such occurrences. When I
saw that the "system" part of the registry was about 50% larger
than the backup made a few hours before the power failure, even
though no system changes had been made, I thought for sure I'd
found the problem. I restored the backup registry over the
production registry and rebooted (4) once more.
The web and FTP servers did not start and manual attempts produced
the same error messages as before. At this point I was almost out
of ideas and starting to consider a full reinstall. Before doing
so, I decided to check all recently changed files on the system
to see if any other changes might have contributed to the problem.
A search showed that MetaBase.bin had changed a couple of days
before. This file stores most IIS and FTP configuration data. I'd
restricted a computer that was violating the
Terms of Use and causing lots of error
messages, from accessing the web site. I decided to restore the
earlier version of MetaBase.bin but before I did, I discovered the
real cause of the problem. In the directory where MetaBase.bin
is stored there was also a MetaBase.bin.bak. The .bak file matched
the time stamp and size of the last modifications that I'd made.
The active MetaBase.bin file was one byte smaller and time stamped
during the second reboot following the power outage. For some
reason, during the first reboot that allowed me to log in after
the power failure, NT had replaced the proper MetaBase.bin with
a damaged file and kept the original as a .bak copy.
I can't even begin to imagine what caused this NT behaviour. As
soon as I copied the previous MetaBase.bin file into place (the
one with the new IP restrictions), I was able to start both IIS
and FTP without further error messages. I rebooted (5) once more
to see if the system was fixed and consistent. This time the web
and FTP servers started automatically, as they are supposed to.
The NT server now appears to be doing what it was prior to the
power outage.
Before drawing some final conclusions, I want to comment on the
one aspect in this that favors NT. When you buy an APC UPS,
it comes with the software for Windows to force an orderly
shutdown of the machine before the battery runs out. Corresponding
software is available for the more common UNIX variants. To
get it you'll need to go to a web site or send in a form as it's
not included in the packaging with the UPS.
For two reasons this is not significant. First, with most UNIX's
it simply is no big deal if they experience a hard crash as
caused by a power failure. I have yet to see any UNIX variant
not successfully come back from such a situation and resume
normal operations afterwards. I know this is not always true and
it's probably a really bad idea for heavily used database server
to count on this. Still, for many light to moderate use UNIX
servers, it's quite reasonable to let them run on UPS power until
the battery fails.
To use the automated shutdown features wisely, unless the UPS is
high end with software that can shutdown at a specified battery
percentage rather than a fixed number of minutes, you must be
conservative in your estimate of the UPS battery life. Thus
there will be a several minute period between the automated
shutdown and the end of useful battery life. It's not very
unusual for power to be restored during that period. Depending on
the computer power switch, a computer may or may not come back on
after power is restored. Switches on newer computers are less
likely to restart a computer following a power outage than on
older computers. Where computers are not tended to 24 hours a
day, an automated shutdown in response to a power failure, can
turn what would have been a non event into hours or days of down
time (if the UPS had enough battery power to outlast the outage).
The other factor is that just having the Windows software and
installing it does not assure the system will shutdown as
expected. If the UPS software does not issue the shutdown with
the correct options, application or user dialog boxes may prevent
the shutdown from completing. Ensuring that a UPS is properly
configured and will both shut a server down as expected and take
full advantage of available battery capacity, requires time
consuming tests. If a server and UPS are not run through an
actual test where power is removed, restored prior to shutdown
time, removed and kept off until after a shutdown, restored and
shut off after shutdown and restored and left on, the system has
not been tested. If a UPS is added to a production server, it's
almost certain, proper tests will not be conducted. Anyone
sufficiently knowledgeable and willing to do the necessary
testing, will also be able to get the necessary software from the
UPS manufacturer, if it is available. The same person would make
availability, part of the UPS selection process.
This latest experience is just one more example of the
inadequacies of Windows NT (and 2000) as a server operating
system. A UNIX (OpenBSD) server suffers a hard crash when the
battery power runs out. Pressing the power button is the sum
total of the recovery procedures when power is back. In
contrast, an NT server, following an orderly UPS initiated
shutdown trashes itself. No system error messages or logs
announce or reveal any problems. Third party software reveals
the problem. Approximately three hours of investigation, good
recent backups and five reboots are required to get the system
back to where it was prior to the power outage.
Top of Page -
Site Map
Copyright © 2000 - 2014 by George Shaffer. This material may be
distributed only subject to the terms and conditions set forth in
http://GeodSoft.com/terms.htm
(or http://GeodSoft.com/cgi-bin/terms.pl).
These terms are subject to change. Distribution is subject to
the current terms, or at the choice of the distributor, those
in an earlier, digitally signed electronic copy of
http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the
time of the distribution. Distribution of substantively modified
versions of GeodSoft content is prohibited without the explicit written
permission of George Shaffer. Distribution of the work or derivatives
of the work, in whole or in part, for commercial purposes is prohibited
unless prior written permission is obtained from George Shaffer.
Distribution in accordance with these terms, for unrestricted and
uncompensated public access, non profit, or internal company use is
allowed.
|