New server, randomly freezes?

Kainzo · Oct 1, 2012

So I'm hosting a new server for Minecraft and everything seems fine for several days to weeks. Randomly it will just "lose" internet at the host and require a power cycle to come back online. (Edit

To go more indepth - the machine itself does not respond to SSH/any connections. When I use my IP-KVM (SpiderDuo) the machine does not respond to keyboard commands - for lack of a better word, its as if it is powered off, but the power management states it has power and doing a power-cycle makes it come back (after shutting down/coming back on)

Debian 7.0
100Mbps
3-6TB monthly usage

Update: below are the list of components used in this server (from newegg)
Antec 4U Case
Intel BOXDX79SI MOBO
i7 3960x CPU
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106015
Low Profile GPU
32GB of G-Skill 1600 DDR3
Intel RAID SATA 8

I've attempted to gather logs but I don't see anything out of the ordinary, here's what I've tried so far.
I've run a memtest86+ for 90 hours at my house before shipping it off to a colo (oplink.net) There were zero errors and it went over the ram 6-10 times during this time.
I've attempted to go back tot he default kernel on Debian 7.0 thinking that my "dev" kernel may have had a leak or some other issue that isnt solved.

Edit 1: I'm using the standard Kernel (3.2) for Deb 7 and it is still having issues.
Edit 2: I have attempted to enable speed-stepping and c3/c5 in the bios - but it still has crashed/frozen
Edit 3: I am unable to use cpufreq-info - it tells me the CPU drivers are unknown.
Edit 4: I am currently testing the ram at 1333 instead of 1600.

Having to handle this remotely, I'm not sure where to go next, the datacenter guys have been great but arent paid to troubleshoot my machine, any thoughts?

YeOldeStonecat · Oct 1, 2012

Really hard to say. Is this a home built cloner server? Or a tier-1 brand server?
What hardware components?
Look at BIOS resets, BIOS updates, is there a RAID controller? Firmware updates on the controller? This hardware platform certified to support Debian? Did Debian run on this hardware for a while before shipping off? Like a good burn in/test for compatibility test?

Kainzo · Nov 2, 2012

Sorry for the mega late reply.
We're running an Intel Raid Controller and setup a raid 6. I noticed that the raid card got very hot, almost burning me when I got near the heatsink.

This machine has been having issues that I thought I resolved, it went one week without freezing, froze once and then I changed some settings and it didnt freeze again for another week.

Now, it has frozen 5 times at the host, twice in the last few days. Its in a data center so I'm not sure what I can do to troubleshoot from here without incurring some hefty fees.

RocketTech · Nov 2, 2012

I was thinking cooling or Power Supply. Maybe ask if they can move it to the bottom of a rack?

boss6021 · Nov 2, 2012

What brand of NICs? I had a client that had Realtek NICs and would lose connection frequently. If you have some Intel NICs laying around, I would give them a shot.

Mackintire · Nov 2, 2012

You didn t say what the RAID card was. If it is a PERC you need mojo air flow to keep it stable. That's why they are only certified to run in Dell servers that have the required airflow. I 'm sure that it will work in other servers, but only if it has the minimum required airflow.

Again I am grabbing at straws here until you disclose the RAID card brand and model.

green91 · Nov 3, 2012

While not linux related, I have run into issues with Intel RAID in the past causing 15+ second freezes. Ended up being RAID driver related.

jadams · Jan 5, 2013

Did you resolve this?

I think I always stumble upon your Minecraft threads. I was in your other thread about CPU usage and I saw your more recent one regarding a DDOS.

Just before we shut down our MC server we were also getting freezes. The MC server would not respond, nor would it respond to SSH connections. At this point I really dont remember how the machine responded at the console. The only thing I remember seeing was that it was pegging two of the host machine's cores. It was a VM so I was able to access it remotely even w/ out an SSH connection via the VM host. I just booted it back up. I'll keep an eye on it and see if it happens again.

I dont think its hardware related though as none of my other *nix VM's have this problem either. We were running ours on Ubuntu.

Now I only host the server, and one of my buddies runs the server and handles the OS. He was never able to figure it out.

Kainzo · Jan 5, 2013

This one was a CRAZY one to resolve. It was actually the videocard drivers "too up to date" and we had to roll back to very stable ones. It was a good catch by my linux sys admin.

We do not "use" a gpu but its required to get KVM's at our old host.

jadams · Jan 5, 2013

interesting, i wonder what gpu is used in our vm's

Kainzo · Jan 6, 2013

jadams said:
interesting, i wonder what gpu is used in our vm's

We were using Nvidia 8400? I think. It was a low-profile card.

Kainzo · Feb 19, 2013

Alright - I was wrong. The GPU wasnt the only issue (if it was at all) the machine is still constantly freezing. It's most popular times seem to be between 1AM to 9AM, but I think it has frozen at all times.

I have changed settings in the bios to re-enable speed-stepping and the ability for the CPU to put cores to sleep - same issue

I since have re-enabled those settings in the bios and downclocked the ram to 1333 to ensure the ram wasnt too strained.

I'm up for any suggestions and I'm just about to drop $4500-5000 on a new machine so I dont have to worry about it anymore.

Kainzo · Feb 19, 2013

Update: below are the list of components used in this server (from newegg)
Antec 4U Case

i7 3960x CPU
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106015
Low Profile GPU
32GB of G-Skill 1600 DDR3
Intel RAID SATA 8

Mackintire · Feb 19, 2013

Be specifc... if you machine only losing connectivity or is it hard locking. AKA no one can log in locally. This is where iLo starts to really make a difference.

Your intel RAID card is a clone of the LSI SAS9260-8I

If you do not have the RAID console installed you need to install it to see what's going on with the drives and card. The debug log is quite useful.

You said you ran Memtest86+ for 90 hours and that is good. But if you didn't run tests 3, 4,5 and 7 exclusively for a minimum of 20+ passes each you could miss quite a lot. Some of those tests stress certain features and components, not immediately repeating a given test can allow those "items/objects" to not become as stressed as they would have been become if you ran a single test at a time.

The motherboard you used is also bleeding edge. You could have issues with Linux support on that motherboard.

Speedstep/dynamic turbo are two of the bigger possibilites. If you are using the factory heatsink on your CPU there's a possibility that the unit could be thermally heading north and the dynamic turbo feature is going a little crazy. I've seen this occur when using older C+ code compliled on borland version 4 and 5 or an older OS like windows xp.

My suggestions:

Install the RAID console...the LSI one should also work. Force a consistency check run and verify the output in the debug log.

Turn off Turbo, Speedstep and hyper threading. ReEnable each 1 at a time. give it a week between changes.

ReRun Memtest86+ v4.2 run tests 3,4,5 and 7

Outside of what has already been suggested.....add in two Intel Nic cards and disable the onboard nics. I've seen strange issues where the motherboard nics are going bad and it takes out the entire board's connectivity.

This is just another reason there are desktops and Servers and the hardware is a little different between the two. It gets really expensive when you have to go down to the datacenter to fix things repetitively.

Kainzo · Feb 19, 2013

To go more indepth - the machine itself does not respond to SSH/any connections. When I use my IP-KVM (SpiderDuo) the machine does not respond to keyboard commands - for lack of a better word, its as if it is powered off, but the power management states it has power and doing a power-cycle makes it come back (after shutting down/coming back on)

Going to reply to the rest soon.

Why two intel NICs ? I already have one NIC on top of the already onboard - I dont mind getting another but not sure if its needed...

Mackintire · Feb 19, 2013

I was referring to turning off the onboard nics (in the bios) and using external nics.

The quantity does not matter. Just as long as the onboard motherboards resources are not in use.

This is just a suggested test....one of many you may have to perform to figure this all out.

Kainzo · Feb 20, 2013

Mackintire said:
I was referring to turning off the onboard nics (in the bios) and using external nics.

The quantity does not matter. Just as long as the onboard motherboards resources are not in use.

This is just a suggested test....one of many you may have to perform to figure this all out.

Almost certain I have already turned off the board NICs... I'll confirm that next time it crashes.
I know for a fact I made sure all onboard stuff was turned off.. (sound/etc)

Kainzo · Mar 7, 2013

Just an update - 2 weeks right now with no crash.
I down-clocked my ram from 1600 to 1333 (it came stock 1600). I believe the motherboard can't support that speed in XMP.

Wanted to let you guys know in case anyone else had ghost freezing.

Mackintire · Mar 7, 2013

FYI... I have never had an issue like that I wasn't able to verified in Memtest +

New server, randomly freezes?

Kainzo

Gawd

YeOldeStonecat

[H]F Junkie

Kainzo

Gawd

RocketTech

2[H]4U

boss6021

Limp Gawd

Mackintire

2[H]4U

green91

Limp Gawd

jadams

2[H]4U

Kainzo

Gawd

jadams

2[H]4U

Kainzo

Gawd

Kainzo

Gawd

Kainzo

Gawd

Mackintire

2[H]4U

Kainzo

Gawd

Mackintire

2[H]4U

Kainzo

Gawd

Kainzo

Gawd

Mackintire

2[H]4U