Analysis of Hardware Failures on 1M Consumer PCs

HardOCP News

[H] News
Joined
Dec 31, 1969
Messages
0
How many hardware failures are due to overclocking? Underclocking? Are desktops more reliable than laptops? Name brand versus generic systems? If you have the time, I recommend you give this 14 page report from Microsoft a read. Thanks to pelo for the link.

A machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAMand disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age.
 
It was interesting, but all I got from it (whether it was meant or not...) was that overclocking caused a tremendous amount of the issues they witnessed.

They said laptops and brand named desktops were more reliable than "white box" machines...well duh, laptops and brand named desktops can't *USUALLY* be overclocked or altered too much.

DRAM and CPU failures all fall into the realm of overclocking, and of course the rates and frequencies would go hand-in-hand with quick failures...that's burn-in/testing!
 
The should filter results that the CPU is running on non-optimal/default clock settings.

Overclocking probably ruins any chance at having usable CPU data.
 
I know I've accidentally hit the "Send error report" button after my box came back up from a failed overclock attempt. I would bet those numbers are very skewed because of overclockers.
 
I thought this had been known for a while, unless Microsoft is just following up on their earlier data. Several years ago the Windows report feature sniffed out a lot of white box dealers selling overclocked systems (advertising a system with a 2.4Ghz CPU that's really an overclocked 2Ghz part with the the stock HSF) because they crashed in ways impossible for software alone.
 
It's a somewhat interesting read, but I think they should have spent more time on disk failures. My shop sees far more disk failures than hardware failures of all other types combined (in the ballpark of 10:1).
 
My shop sees far more disk failures than hardware failures of all other types combined (in the ballpark of 10:1).

I suspect that you see more disk failure at your shop because CPU/Memory failures can usually be "fixed" after a reboot while disk failures are of the permanent kind.
 
I know I've accidentally hit the "Send error report" button after my box came back up from a failed overclock attempt. I would bet those numbers are very skewed because of overclockers.

The whole idea of overclocking is to push the CPU and other systems "too far" on in order to find what the limits are. You're repeatedly crashing your system on purpose until you've found good, stable settings.
 
The whole idea of overclocking is to push the CPU and other systems "too far" on in order to find what the limits are. You're repeatedly crashing your system on purpose until you've found good, stable settings.

Orly? Please tell me more about this "overclocking" you speak of.
 
The whole idea of overclocking is to push the CPU and other systems "too far" on in order to find what the limits are. You're repeatedly crashing your system on purpose until you've found good, stable settings.

I wouldn't say "repeatedly" crashing your machine, but yeah, you're bound to crash a couple times if you're really trying to push it.
 
Orly? Please tell me more about this "overclocking" you speak of.

I can help with that:
pmiller2005tribergclocks.jpg

The clocks that are higher on the wall are over the lower clocks. They are overclocked and the lower ones are underclocked.
 
So who is CPU Vendor A with the 20x failure rate when OC and who is Vendor B with 4x?
 
It was interesting, but all I got from it (whether it was meant or not...) was that overclocking caused a tremendous amount of the issues they witnessed.

They said laptops and brand named desktops were more reliable than "white box" machines...well duh, laptops and brand named desktops can't *USUALLY* be overclocked or altered too much.

DRAM and CPU failures all fall into the realm of overclocking, and of course the rates and frequencies would go hand-in-hand with quick failures...that's burn-in/testing!

Nononono. Laptops and brand named desktops were more reliable than the "white box" machines while ignoring the overclocked "white box" machines. Basically, whether you overclock or not, the other stuff is more reliable.

The DRAM and CPU failure numbers are provided with OC'd versus not OC'd.

You've got to read a bit more thoroughly.
 
Nononono. Laptops and brand named desktops were more reliable than the "white box" machines while ignoring the overclocked "white box" machines. Basically, whether you overclock or not, the other stuff is more reliable.

The DRAM and CPU failure numbers are provided with OC'd versus not OC'd.

You've got to read a bit more thoroughly.

It's been a long time since I read it, but I recall seeing on Tom's Hardware Guide ~2001 they used to note that the clock speed was set just a tiny bit higher than would be correct specs for a particular chip to make a motherboard seem to be a better performer among competition. It's possible that practice is still true to this day and that the overclock is so small that if falls within Microsofts + or - 5% tolerence for what isn't considered overclocked. On the other hand, a retail machine might be slightly underclocked (but still within 5% to reduce the potential for support calls and expenses associated with failures since OEMs don't worry as much about performance, but they are concerned about haivng to replace parts). If things like that are going on behind the scenes, it might explain why a retail PC has fewer problems. It'd almost have to be that since pretty much everyone buys the same parts from the same major players.
 
Quality control, quality assessment

We've seen this first hand with some companies over the last few years. There's a reason Asus has been slipping and it's because they're just as prone to motherboard failures as are other companies. With big OEMs it's a different story. The machines are packaged together and must work (they preinstall the OS and applications) whereas PC builders buy parts that may offer better features, aren't necessarily as thoroughly tested.

The laptop statistic is what struck me as the most odd. The preconfigured/prepackaged/prebuilt machines didn't but laptops being more reliable? I guess that could potentially have to do with the PC sales figures being dominated by laptops so they're tested more stringently? who knows :)
 
The laptop statistic is what struck me as the most odd. The preconfigured/prepackaged/prebuilt machines didn't but laptops being more reliable? I guess that could potentially have to do with the PC sales figures being dominated by laptops so they're tested more stringently? who knows :)

I didn't understand this either. Maybe more bottom-of-the-barrel desktops are sold? Because laptops would certainly seem to fail more often/be more failure prone.
 
Did they include stats of how many systems Windows Update hosed because of a driver update they recommended downloading from Microsoft Update?
 
While a crash from overclocking looks to them like a hardware failure, it is not failed hardware. Sure, it crashes but change back to default settings and it is fine. They may actually fail at a higher rate down the road though.

They indicated that they could only detect DRAM errors that were in a certain 1.5% of kernel memory. From my experiences, DRAM is the most common hardware failure among new machines but as the machine ages hard drives become an increasingly more common failure. This is not counting laptop hard drives which I typically just assume is caused by someone dropping said laptop.

Also, the fact that laptops are less likely to fail seems strange to me. The reasons I guess would be that most people buy the cheapest laptop they can find which is usually running an older, slower, cooler CPU on an established process node. I know a lot of people running laptops but don't know anyone who bought one with four cores. Also, people tend to break their cheap laptops physically (hard drive shock, broken screen/hinges, and broken power jack). They treat them like commodity items so the laptop doesn't last long enough to develop a real hardware failure.
 
What a stupid paper.

Look at hard disk drive manufacturers. They are decreasing warranty periods from a standard 5yrs to now 1-2yrs.

These hard drive makers act as if they are working with new technology here. So many years of hdd knowledge and yet drives are getting shorter warranty support, and are failing more often.

I believe they are purposefully making expensive and faulty hdds to push people towards storing data in the cloud. The cloud will reduce privacy 100% . No physical access to your data. Pathetic. Complete security risk.

Additionally creating drives/hardware that do not last a long time, creates an environmental problem. Their objective should be "let's create something that works and lasts". They just want you to come back and buy another drive if it fails, waste money on RAID systems, and in general unnecessarily re-buy hardware when if it worked you'd continue using it.
 
Also, people tend to break their cheap laptops physically (hard drive shock, broken screen/hinges, and broken power jack). They treat them like commodity items so the laptop doesn't last long enough to develop a real hardware failure.
Seems likely, I've seen a lot of this personally. It's easy to confuse "not working" with "hardware failure" in terms of CPU faults, etc. and in that respect laptops might win out. I still maintain desktops have fewer issues over their lifetime for most people, but low-end PCs are definitely commodity items and a cheap laptop makes more sense to most folks than plugging in a tower, monitor, keyboard, mouse etc.
 
Just great, there go my aspirations to OC my next PC. :( :mad:

Microsoft said:
Among our many results, we find
that CPU fault rates are correlated with the number of cycles
executed, underclocked machines are significantly more reliable
than machines running at their rated speed, and laptops
are more reliable than desktops.
The part that surprised me the most (really, the only thing that surprised me from MS' study) is that laptops were more reliable than desktops. WTF?!
 
Why do you think that would be? How easy is it to overlock a mobile CPU? Maybe there's a correlation...
 
Why do you think that would be? How easy is it to overlock a mobile CPU? Maybe there's a correlation...
On that note, I would've like to have seen stock clocked PC vs. stock clocked laptop (which would be all laptops).

The reason I claimed I was surprised that laptops were more reliable than PCs are b/c laptops typically run hotter due to their smaller footprint and they are more susceptible to being dropped, unlike PCs.
 
Back
Top