Troubleshooting random reboots, Debian 6 + Xeon E5-1650

ctrlbrk

Weaksauce
Joined
Mar 6, 2011
Messages
73
Hi guys,

I've built a new server (specs below) to ship off to my datacenter for co-lo, to replace a rock-solid but older piece of hardware.

The new server is randomly rebooting. I could use some advice on narrowing it down.

I've built hundreds of servers in my former line of work, this is not my first dance -- but these days I work for myself (trading) and don't have piles of hardware laying around like I used to at the office job. So I can't just throw down a new set of hardware and see if problem goes away.

Specs:
SuperMicro SuperServer 5027R-WRF
SuperMicro X9SRW-F
Intel Xeon E5-1650
64GB Kingston DDR3 1600 ECC REG CL11 1.5V (2 x KVR1600D3D4R11SK4/32G)
4xWD 500GB RE4 WD5003ABYX

Debian 6
Kernel 3.2.0 (installed from backports)

I have hammered on this system hard with everything passing with flying colors. I was going to ship it off to the datacenter within a few hours, and then out of the blue it rebooted. With no load, nothing going on, zero --- I heard it reboot from the other room.

This is the second time the system has randomly rebooted. The first time was the very first time I ran sysbench on it as a burn-in test. I chalked it up to a sysbench problem because it rebooted within seconds of me pressing enter on the test. I proceeded to hammer on sysbench for days with huge load and it was completely stable.

Then the other day it was just idle with nothing going on, and bam - reboot.

I checked all the logs, nothing. There is no kernel panic. There is absolutely zero to go on. This would seem to point to hardware, but yet I have pounded on this system for the last couple weeks and have never had a single problem.

I can't ship this to the datacenter with a random reboot problem.

It has passed over 5 passes on memtest. I will let it go to ten, that alone will take another couple days.

Things I have thought of:
1) Reduce to 1 stick of memory and see if problem goes away. Well problem with this is that with the problem manifesting only randomly, and only every couple weeks, trying to narrow it down to memory (8 sticks) could take months. Plus they pass memtest.

2) Backup current system (clonezilla) and re-install a different kernel or distro, and see if problem manifests itself. Again, this is really not ideal because I want to run Debian 6. I could revert to 2.6.32 kernel but there were some nice speed improvements with kernel 3.2.0. And how long do I test on 2.6.32 kernel before I feel assured the problem was with 3.2.0? Weeks? Months? All the while, this server is sitting in my house instead of at the co-lo, so costing me money.

Other suggestions? I checked /proc/sys/kernel/panic and it was 0, which means it should not automatically reboot during a panic (plus there is no log anywhere indicating a panic).
 
test the PSU under load and then check memory and thermal compound. In my experience random reboots are usually PSU related
 
test the PSU under load and then check memory and thermal compound. In my experience random reboots are usually PSU related

I have tested everything under load best I can using a combination of sysbench concurrent passes. Perhaps you can clarify an exact test procedure?

This is also a 1+1 HRP psu, so I think it reduces the chances that PSU is to fault. That also reminds me, I checked the logs during POST to check for any errors reported, and there are none. Also if the PSU tripped an alarm state (low voltage, etc) the chassis would have thrown an alert on the front panel.
 
you got to take the PSU out and put it on a tester, A simple trick to try is load up the BIOS and monitor the voltages, if you see them swinging around, replace PSU.
 
you got to take the PSU out and put it on a tester, A simple trick to try is load up the BIOS and monitor the voltages, if you see them swinging around, replace PSU.

And you've seen this type of PSU problem with the symptoms I've described? Running full-load for days on end without a problem, and then being perfectly idle for hours on end when all the sudden a random reboot? I was not even logged in to the console, no tasks were running.
 
it's not the first time it's ever happened and it won't be the last

Can you recommend a tester that will work with my config - redundant power (HRP)? I guess its more of a test of the module itself, can test both PSU's installed as well as each individually by using the 24p header?
 
Back
Top