NT Server Down, Won't Be Fixed - 8/20/2001
Following the most recent failure of my Windows NT 4, IIS 4 web
server, I've decided to discontinue maintaining my NT web server
mirror. I recently (Oct. 2001) learned that the underlying cause of
the problem was a bad memory chip. For some time I'd thought it was
caused by the Microsoft security "rollup" patch. The server sat
unused for several weeks, then I installed a much larger hard disk and
tried to install Linux. I had to upgrade the BIOS but even after that
Linux installs failed. Subsequently I installed OpenBSD. Within an
hour or so of replacing my live OpenBSD web server, the new server was
displaying "segmentation fault" and "memory fault" messages when I
checked on its performance.
I then remembered I'd added a new memory chip at the same time
the rollup patch was applied to the NT server. I confirmed that
the errors were repeatable across reboots and not present when
the new chip was removed or replaced. Linux also installed
cleanly with the bad chip removed. Microsoft can't be
responsible for the bad memory but it's clear that neither NT nor
Linux protected the system from the bad memory as did OpenBSD.
While OpenBSD displayed meaningful errors that led directly to a
fix of the underlying problem, it also continued to serve web
pages without interruption. It's most unlikely that if Apache had
been executing at bad memory addresses it would have functioned
normally, but it does seem likely that OpenBSD would have
protected the system from the results, and provided useful
messages through the system logging and the console.
Because NT almost totally trashed the disk system, restoring the
NT server would require reinstalling a basic NT 4 system and
restoring from backups. For those interested in more information
about the crash, subsequent troubleshooting and the reasons I'll
not restore this system, a detailed account follows.
Around 3 A.M., Sunday, August 19, 2001, one of my automated
alarms went off, alerting me to the fact that the shared drive on
my NT server was no longer available. I determined the web site,
which is on that drive was also not responding. I went to the
console to diagnose the problem and was confronted with a login
dialog box. This was odd since as I'd not logged out after last
using this machine. It wasn't the standard login dialog of a
Primary Domain Controller (PDC) but rather the standalone server
dialog box without the domain name field. For the past year or
so, about 30% of the time the NT box reboots, it displays this
inappropriate login dialog which won't let me log in. I have to
press the reset button and normally following the reboot, the PDC
login dialog is displayed letting me log in.
When I pressed the reset button, I was careless and pressed the
reset button on the Linux server which is next to the NT server.
The Linux server had been up for just over 11 months. As I
looked at the front of the machine, I realized what I'd just done
and screamed in rage. This was the longest any machine I've been
responsible for had been up and I had expected it to reach a
year, barring an extended power outage.
When I logged in on the NT machine, I started IE 5 to see if the
web site was up. I was somewhat puzzled, because when the alarm
triggered by the shared drive went off, the NT server was
responding to automated pings but the web site
was down. I don't remember the precise sequence of events but
the site did not come up and a Dr. Watson dialog box appeared.
CPU use went to 100% and stayed there. I tried to invoke task
manager. The hourglass displayed briefly but task manager did
not start. I wanted to kill the runaway process. I tried the
start menu; the task bar would appear but when I clicked start,
the program menu would not appear. I clicked on a few desktop
icons including the command prompt icon but nothing would start.
I could task switch between the few tasks that were running but
otherwise could do nothing.
Then the system spontaneously rebooted. During the boot, chkdsk
started running. While chkdsk was running, it displayed numerous
messages about disk corrections it was making. During this, the
machine spontaneously rebooted again. This time it came up with
the error message "Windows NT could not start because the
following file is missing or corrupt: \WINNT\system32\l_intl.nls.
You can attempt to repair this file by starting Windows NT Setup
using the original Setup floppy or CD-ROM. Select r' at the
first screen to start repair." It was then after 4 am. I
powered off the NT machine and shutdown or reconfigured the
alarms that were going off because the NT server was not
pingable.
The next morning when I powered up the server, I got the same
"could not start" message. I reset the system and booted to the
backup, minimal install system that I keep for system backup and
recovery purposes. I started Zip Central and NT spontaneously
rebooted. Subsequently, the backup system spontaneously rebooted
twice, when the blue green background appeared. The fourth boot
to the backup system completed but a "Directcd.exe - Entry Point
Not Found" dialog box contained the message "The procedure entry
point CopyAcceleratorTableW could not be located in the dynamic
link library USER32.dll." Starting Zip Central generated a
corresponding message.
As the backup system was clearly not useable, I decided to try to
"repair" the main install in \WINNT. After going through the
three install floppies, I was prompted for the install CD. I
then got a series of messages telling me that specific files did
not match the original install file and asking if I wanted to
restore the file, skip it or restore all files. It was
immediately obvious these were the system files, upgraded by
various service packs since the original NT 4 CD. I skipped
each. At first the order appeared to be alphabetic but soon the
pattern ended. As files in various directories and not in any
logical sequence appeared, I decided there was no meaningful
order.
After approximately two hundred files, l_intl.nls was listed. I
restored this and then ended the install / repair procedure
ignoring messages that the install was not complete. I
successfully rebooted and was able to use the control panel to
change the system's IP address (so I could cover the NT server's
normal IP address, with a working web server). At first, the
system looked OK but when I started Zip Central, the system
spontaneously rebooted again. After completing the reboot, I
logged in again and again tried Zip Central. It started but soon
displayed a "Zip Central - Untitled" dialog box stating "Access
violation at address 00403DF8 in module 'ZIPCENTRAL.EXE' . Read
of Address." I was then interrupted and unable to return to the
machine until the following morning.
The machine had rebooted and was displaying the Press Ctrl + Alt
+ Delete to login dialog. I tried and the machine hung and
eventually I pressed reset. The next reboot completed, and it was
clear from the Event Viewer, which turned out to be one of the
few programs that actually worked, that the machine had
spontaneously rebooted about three times since I left it. Notepad
and several other programs came up with errors similar to the one
Zip Central had displayed. Solitare froze as soon as I tried to
move a card; I had to terminate sol.exe via the non responsive
task dialog. Easy CD Creator and one other program caused
immediate spontaneous reboots when selected. The web and FTP
servers as well as NetBIOS services were not functioning.
By this point is was entirely clear that both systems were
thoroughly corrupted, probably as a result of the disk problems
indicated by the numerous correction attempts made by chkdsk. To
restore the systems, I'd need to do a fresh install of the backup
NT system and from that, restore the full system from recent
backups.
A few hours after I had installed the patch, I received a SANS
"Security Alert Consensus" message stating Finally, a number of
you wrote in about the Microsoft post-SP6a security "rollup"
patch we discussed in the last issue of SAC. It appears that the
"rollup" crashed a ton of systems and created a fair amount of
general chaos.' They go on to say "whenever possible, test
patches should be tried on nonproduction machines." Most
organizations don't have essentially identical test servers to
their production servers and testing patches on dissimilar
systems is of little value.
The next paragraph is obviously obsolete as a result of learning
that bad memory was the unerlying cause. NT did nothing to help
me identify that problem and I'm leaving the following paragraph
as it was written, based on knowledge just after the crash. The
bad memory was not in the machine for any other problems referred
to.
Despite the time this has already taken, I don't know if, a)
Microsoft's latest security patch resulted in total system
failure, causing damage as severe as a skilled intruder could
inflict, since by comparison an erased disk would be simple to
fix or b) a previously, almost stable system (other problems over
the past year and a half are documented elsewhere) spontaneously
self destructed. If a), there is no way to know without perhaps
days of experimenting and testing, whether to restore to pre-
patch state and avoid the patch or to restore to the post patch
state. If b), then no course I take today can assure me this
won't recur tomorrow or perhaps four months from now.
As I write this, I am on the fourth draft of a very long (over
120 page) comparison of Linux, OpenBSD and Windows server
systems. In large part because of problems like these, plus
Microsoft's increasing licensing costs and extraordinarily poor
security record, I'd already decided not to upgrade my Windows NT
server to Windows 2000 or any successor. I accept the consensus
opinion that 2000 is better than NT but nothing I've seen or read
suggests it comes close to remedying the fundamental
architectural defects that make Windows so clearly inferior as a
server platform for my specific needs.
Since I do not intend to continue with Windows servers in my
business, I can see no reason to spend more time on what is now
an almost obsolete Microsoft product, rehashing the same kinds of
problems that I've extensively documented on this page and
elsewhere. I'll migrate the backup / CD-R functions to another
machine, most likely my NT workstation for the near term. When
time permits, I expect to mirror my web site to a third OS which
will be some as yet to be determined distribution of Linux or
perhaps FreeBSD.
Top of Page -
Site Map
Copyright © 2000 - 2014 by George Shaffer. This material may be
distributed only subject to the terms and conditions set forth in
http://GeodSoft.com/terms.htm
(or http://GeodSoft.com/cgi-bin/terms.pl).
These terms are subject to change. Distribution is subject to
the current terms, or at the choice of the distributor, those
in an earlier, digitally signed electronic copy of
http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the
time of the distribution. Distribution of substantively modified
versions of GeodSoft content is prohibited without the explicit written
permission of George Shaffer. Distribution of the work or derivatives
of the work, in whole or in part, for commercial purposes is prohibited
unless prior written permission is obtained from George Shaffer.
Distribution in accordance with these terms, for unrestricted and
uncompensated public access, non profit, or internal company use is
allowed.
|