Understanding and Troubleshooting Blue Screens

mikeblas

[H]ard|DCer of the Month - May 2006
Joined
Jun 26, 2004
Messages
12,777
Understanding and Troubleshooting Reboots and Blue Screens

There are many post from people trying to diagnose unexplained system reboots and blue screens. I wanted to write a guide to show some techniques for dealing with these problems because when I participate in such threads, I seem to be answering some of the same questions and giving the same background explanations repeatedly.

What Causes a Reboot?

There are a few different events that can cause your machine to spuriously reboot or shutdown.

If your system has a flakey or overtaxed power supply or voltage regulator on the motherboard, for example, the machine may load the power supply or regulator enough to force its output below an acceptable voltage, causing the machine to power down. When it powers down, the load is lifted and the voltage goes back to normal, and the machine restarts. Obviously, this happens suddenly and without the knowledge of the operating system, so no record of the shutdown is kept. (In fact, you'll lose data not yet written to the drives.)

An uninitiated shutdown might occur because of a thermal management problem. If the processor detects it is overheating, it will shut itself down without writing any event or error that the operating system captures.

Less careful system builders might find that their reset or power switch isn't correctly wired, or has been damaged and is intermitently shorting to cause a reboot or shutdown.

You can seperate these low-level hardware problems from device driver and configuration issues by looking for an entry in the system event log. If Windows has logged an event, it means that Windows knows why the shutdown is happening and is trying to tell you about it. If there's no event, the shutdown happened very suddenly within the hardware and Windows wasn't offered the chance to (or didn't complete the action of) writing something to the event log.

What is the Event Log?

Windows NT introduced the system event log as a place where the operating system, the various services in the OS, and device drivers can write information about an error or a status. There can be a large number of services running on a particular machine, many of which are critical to the operation of the machine or the network it is running on. When these services change important state or have an error to report, they do so in the system event log.

A Windows XP desktop machine has at least three event logs: the Application event log, the Security event log, and the System event log. Applications can create additional log types, too. When your machine encounters a critical error, the system will write information about the error to the system event log.

You can view the event log by right-clicking on your "My Computer" icon and selecting "Manage" from the resulting context menu. That will bring up the Computer Management applet:

ComputerManagement.jpg


You can expand the "System Tools" node, then the "Event Viewer" node to see the event logs your machine has. Double-clicking on the "System" node will show the system event log on the right hand side of the window.

Listed in that log are the most recent system events. Clicking on the headers lets you sort by event type, event name, and so on. If you have suffered a critical error, you'll see an error event in the log. Double-clicking on the log entry will open it in a new window.

An crash will always be reported as a system event with the source "System Error", as shown in this sample:

ErrorEvent.jpg


Double-clicking on the error event entry reveals a properties dialog for the event:

EventProperties.jpg


The error code and parameters all come together to describe the particular event. Clicking on the link in the description will send the event data (which is just the bug check code and its parameters) to Microsoft for identification. If the issue is known, you'll be shown a web page which explains the cause of the error.

You should copy the description to the clipboard by highlighting all the text and pressing CTRL+C. You can now insert the text into email you send to your hardware vendor's support, or into a post you make on a forum asking for help. Providing the complete message is critical, since it contains information that helps identify and diagnose the problem.

What if there are no Error Entries?

Windows allows you to configure how the system will react in response to a critical error. You can change these settings by right-clicking on your "My Computer" icon and choosing "Properties" from the pop-up menu. In the "System Properties" dialog, activate the "Advanced" tab and press the "Settings" button in the "Startup and Recovery" group box.

You'll get a "Startup and Recovery" dialog like this one for your efforts:

StartupRecovery.jpg


You'll want to make sure the "Write an event to the system log" box is marked. If it isn't, the system won't write a log entry at all!

You might want to unmark the "Automatically restart" box so that you'll see a message for the event before the system reboots. Otherwise, the system will stay at the blue screen display until someone manually resets it.

Unmarking this option can help diagnose a machine which you're using interactively, but it might not be a setting you wish to leave in place. If the machine is important, you'll probably want it to restart after it records an error message so that it comes back up and is running again, and visible over the network. A system that has halted for a critical error is no longer responding to network requests.

The settings for "Write debugging information" are interesting. If you're skilled with using a debugger yourself, or know someone who is, or want to get a dump to send to a vendor for support, you should create a memory dump. The dump contains some subset of your system's memory contents and information that will help a developer debug the problem.

Typically, the developer will want a "kernel memory dump" but may want a "complete memory dump". A "complete memory dump" is exactly what it sounds like: a copy of every byte of memory on your system. If you have two gigs of RAM, you'll get a two gigabyte file! Comparatively, a minidump is very small -- exactly 64 kilobytes, and often contains enough information to make a diagnosis, though in some cases might not have everything required.

Debugging a dump file is beyond of the scope of this post (for now).

If you've got your system recovery settings set to write a log event and you're still not finding a log after a reboot, you can deduce that the problem isn't visible to windows. You'll want to find a lower-level cause for the issue, such as thermal management in the CPU, a flakey power supply, or a shorted reset switch. At this point, removing or swapping components, or running a memory tester is advisable.

Even if you do find that you're getting an error log entry, you might consider consistency. If the address and bug check code are different every time, it isn't hard to imagine the problem isn't actually with the involved drivers and instead suggests flakey memory.

What do I do with my Error Entry?

If you do have an error record, you'll want to know how to investigate it.

A critical system error goes by many names: a blue screen, a bug check screen, or a stop screen, for example. The critical error happens when a problem in software running in kernel mode is detected. The highly-trusted code which runs in kernel mode has the potential to cause significant system destabilization, so Windows carefully checks what it can for the code running at this level. It's highly unusual for application code to run in kernel mode; when application code faults, we usually see an application crash, usually in the form of an "unhandled exception" error.

Code in kernel mode includes, however, any device driver running on the system. As we all know, device drivers certainly aren't bug free. An outright bug in the driver might be causing your blue screen, or some action taken by the hardware might be causing events the driver isn't in a state to handle.

The operating system detects a variety of problems in drivers. The cause of the blue screen is given by a particular bug check code. This eight-digit hexadecimal number appears in the error log as well as on the blue screen, plus on the dialog that appears when your system restarts from a blue screen.

For example, a blue screen might have this text on top:

Code:
STOP: 0x00000079 (0x00000002, 0x00000001, 0x00000002, 0x00000000)

The number we are most interested in here is the first one: 0x00000079. On the error report dialog box, like this one:

ErrorSignature.jpg


the most important number we want is the "BCCode", which is an abbreivation for "bug check code". The BCCode is the same as the first number shown on the STOP message; the Bug Check parameters are the same as the four parameters on the STOP message.

What the bug check code means is only of interest to someone debugging the problem. For example, the code might indicate that the driver tried to acquire a lock twice in a row without releasing it, or that a thread has entered the driver and not returned from the code in the driver for a long time. Obviously, there's nothing the user can do about these situations aside from trying to repair or replace the driver.

However, the bug check code will also help you identify the problem and describe it to your vendor's support staff, or search for clues about the problem on the web.

This blue screen shot is lifted from the Windows XP Resource kit:

BSOD.jpg


It shows a very valuable piece of information: in the section near the bottom labelled "driver information", Windows is telling us that it believes it knows which module is causing the problem. In this case, the ficticious sample WXYZ.SYS module is fingered as the culprit.

Which Driver?

Since repairing the code in the system is out of the question, our goal is to at least identify what driver is causing the problem. If we know what driver is flakey, we can either upgrade it (hoping the error is fixed in a newer version), adjust the settings of the associated hardware, or deduce that the problem is endemic to the system as a whole.

The report that is in the event log should contain the name of the module causing the fault, as in the blue screen example above. If you don't recognize it by name, you can search for the name on the web. You'll probably find other reports against the same module during your search, and may stumble into information about the problem.

If you can't find the name in a search, try looking at the version information in the file. Find the file on your system, then navigate to it in Windows Explorer. Right click on the file and bring up its properties. In the Properties dialog, activate the "Version" tab.

FileProperties.jpg


The manufacturer name and file description should be included here. Investigating the location and version information for the file might lead you to malware.

When searching for information about your blue screen, don't forget to also check the Microsoft Knowledge Base. Microsoft does listen to reports from customers who send in crash report logs, and if a solution (or identifiable cause) is found, the cause is written up for a KB article. It's worth noting that most of the reports Microsoft recieves are caused by bad memory or unstable overclocked machines.

Also, make sure the article you're looking at describes exactly the critical error you've got. Just because the stop code number is the same doesn't mean the error is the same. Any driver can raise any stop code, so an article describing code 0x0000007E only pertains to your problem if it has the same parameters (the numbers after the stop code in the report), and the stop code was reported by the same module as mentioned in the article.

While you'll get a lot of hits searching the web for "critical error" followed the error number, you'll also get lots of unrelated hits. The name of the driver (if available) and the four parameters that go along with the code will help you make sure you've identified exactly the right issue.

But What is the Cause?

The cause of a blue screen is most likely a software bug in a device driver. The driver might simply have a coding error that causes it to do something wrong; Windows detects the mistake and shuts down the system.

But let's assume that the driver is correctly coded. What else could cause a blue screen?

Perhaps the driver is working, but the hardware isn't. Maybe the driver expects the hardware to respond to a signal by throwing a single interrupt. But the interrupt is thrown twice, and the driver ends up re-entering itself. This is a problem with the hardware, I suppose. But one could argue that a robust driver would deal with the eventuality of a doubled interrupt.

It's also possible that the driver is failing because of some other system problem. Say that your memory is flakey: if the driver writes a value to a memory location, then later reads it but the value has changed, it's likely that the driver will get sick and possibly crash.

Determining what to do ends up being similar to fixing any other problem: use good diagnostic logic to figure out what the real cause is. If the device works in another computer, then the device itself probably isn't bad. If a memory test reveals that your memory is working well, even under stress, then bad memory can probably be ruled out.

Outside cases include a flakey CPU, a bad motherboard (or, more accurately, malfunctioning chipset components), or a bad power supply. Recommending replacement of any of these components without first investigating the source of the blue screen is, in my opinion, foolhardy.

Some Specific Bug Check Codes

Some bug check codes could be thrown by any driver at any time. These require investigation; debugging a dump is the easiest way to perform such an investigation. A few other codes, however, are a bit easier to diagnose because they can only be thrown in certain circumstances.

Bug check 0x000000FE is one such example. This check is named BUGCODE_USB_DRIVER, and is thrown when a USB driver fails. The driver is obviously the USB driver on your system; you may need to update your USB drivers or your chipset drivers, depending on what is providing USB services on your machine.

Another common bug check code is 0x0000009C MACHINE_CHECK_EXCEPTION. The Machine Check Architecture, defined by Intel and implemented on both Intel and AMD processors, is a mechanism by which the processor checks the operating consistency of the machine. If the processor detects an unrecoverable error in its own cache memory, for example, it will throw a machine check exception. The processor will also throw exceptions for errors on the system memory or address busses, and a few other reasons. The references section of this document has links to documentation describing the Machine Check Architecure (MCA).

When a machine check exception is triggered for an unrecoverable error, the OS has no choice but to shut down the machine. It will report the errors, and you can look at the values reported and the documentation for the MCA to see if the report is giving an obvious cause.

Odds are, however, that there's no deterministic cause. The errors that the MCA reports are at the lowest level of the machine, and may have any of several causes. A wobbly or electrically noisy power supply may cause signal transients that leaves the machine running, but causes a transient on the bus that's noted by the MCA hardware. Dust can accumulate on the motherboard or in sockets, then get damp, and cause a capacitive load that distorts the timing of high frequency signals. Parts can be on their way to a marginal failure, and so on. All of these issues (and may others) can cause machine check exceptions.

The good news is that MCA bug checks are almost certainly caused by a hardware problem. AMD provides a tool which will analyse the Machine Check record stored in the system event log and expand the enformation it contains.

Summary

I hope this article helps you understand what blue screens are and gives you some insight into diagnosing them. In the future, I may expand the article to explain debugging dumps, though that's a great deal of content to write for free. I wrote this article because I'd like to see users take a methodical approach to debugging blue screens. Many people intend to be helpful, but end up posting guesses about the issue without even understanding which bug check code was involved or the working of the bug check mechanism. By making some observations before recommending a solution, I think we can offer better help to our friends who are having a hard time because of bug check errors.

References

You can read about bug checks in the Windows Device Driver Development Kit (DDK) which is documented in MSDN. These are the interesting sections:

Bug Check Codes - a list of bug check codes by number.

Blue Screen Data - a description of the format of a blue screen.

Other references:

Microsoft Online Crash Analysis Get automated results from crahes you've reported in the past.

The Windows Resource kits have good information about diagnosing and troubleshooting blue screens.

Intel Processors Machine Check Architectures in Microsoft Windows[/URL] describes the Machine Check architecture as implemented by Intel processors.

The AMD Machine Check Analysis Tool (MCAT) will read an MCA event record from your Windows system event log and present the data it contains in a human-readable format.

More help:

Ranma_Sao has graciously offered to help debug dump files. In his post, he asks for very specific diagnostic information about the problem, and requsts that you understand his time limitations and ability to respond. I can make a simiar offer; if you can send along the same information that Ranma_Sao asks for, I'll be happy to take a look at the problem—as my schedule allows.

Disclaimer
This posting is provided "AS IS" with no warranties and confers no rights. I do not speak on behalf of Microsoft.
 
Nice write up. Here's my contribution.

Here is the anatomy of the BSOD.
http://www.microsoft.com/resources/...Windows/XP/all/reskit/en-us/prmd_stp_exvb.asp

The most important information is the Bugcheck and Driver information sections. The best course of action is to note down the STOP: 0x000000xx, THE_TEXT_THAT_LOOKS_LIKE_THIS, and the driver information (if any), then google for answers as to what is causing the error and how to fix them. Usually someone else has already had the same issue and will be documented all over the web.

Also don't forget to use the MS KB to search for answers to your BSOD. www.microsoft.com and throw your STOP code in the search bar.
 
S1nF1xx said:
The best course of action is to note down the STOP: 0x000000xx, THE_TEXT_THAT_LOOKS_LIKE_THIS, and the driver information (if any), then google for answers as to what is causing the error and how to fix them. Usually someone else has already had the same issue and will be documented all over the web.

Thanks for the link. I don't have the XP Resource kit yet; I guess I should order it.

One of my key points is that using the stop code only isn't enough information. If you search only using that (because that's all you recommend writing down), you'll end up looking at unrelated pages and articles.
 
I agree. But the problem I run into with searching google and whatnot with the (0x000000xx, 0x000000xx, 0x000000xx, 0x000000xx) portion of it is that these codes are so computer specific you most of the time don't find anything. Ideally you'd want to search with this whole string first, but if you don't find anything you'll want to search for the generic STOP code, THE_TEXT_LIKE_THIS, and the driver (if listed).
 
Nice write up.

I would also add the infomation S1nF1xx posted, sometimes you can't see the event viewer depending on the error message, so all you get is the BSOD itself.

What you guys are calling excess info can be critical to resolving the exact error. Yes, often the stuff in the (0x0000000, 0x0000000, 0x0000000, 0x0000000) is exteraenous, but not always. As all of the info we are discussing is relevant, it should all be collected.

Also, I would link to Ranma_Sao's "help me; help you" thread where he offers to actually diagnose the BSOD dump files. VERY helpfull. Link.
 
S1nF1xx said:
I agree. But the problem I run into with searching google and whatnot with the (0x000000xx, 0x000000xx, 0x000000xx, 0x000000xx) portion of it is that these codes are so computer specific you most of the time don't find anything. Ideally you'd want to search with this whole string first, but if you don't find anything you'll want to search for the generic STOP code, THE_TEXT_LIKE_THIS, and the driver (if listed).

The point isn't to search with them; it's to compare them to what you find to make sure you're looking at the same error.

The first couple of values are codes that explain more about the error. If the error is an unhandled exception, for example, the first value indicates what exception wasn't handled. This certainly is not computer-specific; and goes to disambiguate the error message.

What you're calling THE_TEXT_LIKE_THIS is the symbolic name of the stop code.

I guess what you're really saying is that there's more knowledge about the error content that, if people knew it, would make them more effective at searching for information about the codes.
 
I really hate dirtying up a nice thread like this. :eek:

What I'm trying to/should have said, is that the entire STOP code is important, but if you're having trouble finding that exact code (which in my experience is almost always the case), the first part STOP: 0x000000xx is the generic "in a nutshell" error you're getting and to start there in your queries.

I should have just posted my link and STFU. :p
 
I've updated this, applying the good feedback I got. It's very hard to write content for this forum because there's no spell checker and no good WYSIWYG editing, so I hope this post is still readable. Maybe I'll just convert it to a Word document and publish it as a PDF.
 
I have updated this doc with a new section and some small edits for style.
 
I work for *edit, a random university* tech support, and although I am still a student, that guide was truly helpful. We have a laptop university, where every student has a laptop, and most of them are on windows. The amount of blue screens is unbearable. Thanks a ton for your help on this problem that has been plaguing us for a while. Much appreciated
 
Thank you for your kind words. I've taken a quick pass through the document to fix up a few typos and clarify a few paragraphs.

This post has been pretty well-received, and I've been thinking about writing something about how Windows memory management works.
 
Because of the arbitrary post size limit in the forum software here, I have to split the post into more than one part. That's pretty awkward, since they can't even be consecutive posts in the thread. For now, I've hacked-out the revision history:


Version 1.0, 22 October, 2005
  • First version.
Version 1.1, 25 October, 2005
  • Respond to feedback
  • Edit for clarity, typos
  • Moar Links!!1!
  • Disclaimer
Version 1.11, 28 October, 2005
  • Fix title
Version 1.20, 12 December, 2005
  • Fix Bug Check Codes" link
  • Small edits for style
  • Add "Specific Bug Check Codes" section
Version 1.21, 5 March, 2006
  • cleanup, typos
Version 1.22, 19 March, 2006
  • Add link to AMD tool
 
good stuff mikeblas.

maybe a mod can do something about putting another post between your first post and the first reply, so that you can edit for more info?
 
Excellent write-up. Thank you for a well-written, easily understandable article.

Basically, this post alone helped me move from the, "WTF, why is my computer crashing?" stage to the, "I have some idea of what's going on" phase.

A million thanks.
Bryan
 
Correct me if I'm wrong, but if an operating system is properly written, then it should never crash when a program that runs on it, does.

Wasn't that one of the big selling points of the Windows NT kernal vs. Windows 98, which was just a shell on the DOS operating system?

.
 
GoHack said:
Correct me if I'm wrong, but if an operating system is properly written, then it should never crash when a program that runs on it, does.
Just run the checked build of XP from your MSDN subscription set to show yourself the errors that the drivers (and application software!) you use every day cause. The vast majority of these are detected and handled by the retail OS.
 
Back
Top