Virtualization lab without ECC

luckylinux · Nov 21, 2012

I got my hands on some new virtualization hardware (I5-2400 and I5-3470) which doesn't support ECC RAM.
I'll be using ESXi or Xenserver (still thinking about that) with storage on a central server (E5-26XX with 64GB ECC RAM - Registered). I think I'm fairly safe from a memory corruption point of view (none of the VM are mission critical or managing critical data) and I can always take backups of the full VM (or even the filesystem using ZFS snapshots) in order to avoid that, as well as planning some memtests overnight to make sure RAM is not faulty.
Or do you think there are still some potential problems concerning data integrity?

I know ECC RAM is always better (most of the other hosts I have are ASUS AM3+ / ECC RAM / X2-X3-X4 or FX-4xxx|FX-6xxx|FX-8xxx which wouldn't be concerned by this problem), still I got a nice deal on these Intel processors I couldn't refuse.

Any opinions?

Child of Wonder · Nov 21, 2012

I've been running non-ECC memory in my ESXi lab for over 2 years. No issues.

I would prefer ECC memory but if one of my hosts out of 4 has an issue I have little beef restoring or rebuilding a couple VMs.

Grentz · Nov 21, 2012

ECC is a bit better, but not like a night and day difference. You will be fine.

It more comes into play in mission critical farms, where you have lots of hosts, TBs of memory, and tons of VMs running on individual hosts.

TCM · Nov 21, 2012

Child of Wonder said:
I've been running non-ECC memory in my ESXi lab for over 2 years. No issues.

How would you know you have no issues?

Grentz · Nov 21, 2012

TCM said:
How would you know you have no issues?

Stability, crashes, etc.

TCM · Nov 21, 2012

Grentz said:
Stability, crashes, etc.

Which are only the worst symptoms of memory errors. Do you checksum your data regularly?

luckylinux · Nov 21, 2012

TCM said:
Which are only the worst symptoms of memory errors.

So basically anything non-ECC is simply BAD, BAD and again BAD?
So under no circumstances should anyone ever use non-ECC RAM?

Too bad I got the deal on a I5/I7. Would've I gotten it on a Xeon I'd have grapped it without thinking twice about it.

TCM said:
Do you checksum your data regularly?

Back to the main subject of the topic: the data itself (personal files, photos, videos, ...) will be stored on a central server (running FreeBSD 9.1 with ZFS v28) having ECC RAM (64GB for now).

On that same server (with ECC RAM) the virtual machines will be shared via NFS (or SMB if possible) to the VM hosts booted from USB (they'll run Esxi or Xenserver, whichever best fullfills my needs). By doing backups on the server side (with ECC) of the VM images served to the NON-ECC VM hosts, I think that data corruption should be somewhat "less important" since I could always revert to a previous state.

I know that by using non-ECC RAM I could get my RAID(z) array corrupted, data corruption, reboots, crashes, ... That's the reason why I bought an ECC central server.
My doubt concerns mostly the deployement of non-ECC VM hosts with storage on ECC Storage server. Wouldn't this configuration (thanks to backups/snapshots) be better than "bare-metal non-ECC PC"?

Thank you all for your answers.

Child of Wonder · Nov 21, 2012

TCM said:
Which are only the worst symptoms of memory errors. Do you checksum your data regularly?

Do I checksum my home lab? Did you just seriously ask me that?

Non-ECC memory is fine for a lab at home with non important data on it. If I have problems I'll test the memory and *gasp* spend 50 whole dollars to get another 16GB.

TCM · Nov 22, 2012

I could reply by asking you if you seriously distinguish important from non-important data.

If it's not important, why store it at all? You might classify data as unimportant because you can reproduce it easily. That doesn't mean this is a reasonable thing to do.

An OS I setup for testing might not be important data because I can easily reinstall it. I still protect it with ECC because for the time it runs, I want it to be reliable.

If you just setup OSes for the fun of it and never do anything with them, go ahead and use normal memory. Better overclock it for best effect.

meatincereal · Nov 22, 2012

so in your logic, why even sell non-ECC memory?

pelo · Nov 22, 2012

non-ECC memory has gotten pretty good over the years at reducing errors, but if you're crunching numbers 24/7 then the benefits of ECC memory can pay off.

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10–10−17 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per millennium, per gigabyte of memory.[2][4][5] A very large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference.[4] The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10 × 10−9 error/bit·h), and more than 8% of DIMM memory modules affected by errors per year.

I've heard of 1-bit, per hour, per gig of memory as a rule. While that may not seem like much, that's only per gig of memory. If you multiply that times 64 then you've got 64 bit errors per hour. Multiply that by 24 hours a day, 7 days a week and you've got a nightmare on your hands

For non-critical workloads and most consumer workloads (and even most enterprise/business workloads), it's completely unnecessary. But for a lot of HPC stuff and mission-critical servers it's still very much a must-have feature.

Expensive? Yes. But worth it? That ranges from "no" to "absolutely"

TCM · Nov 22, 2012

meatincereal said:
so in your logic, why even sell non-ECC memory?

No, in your logic.

In my logic: "Why even buy non-ECC memory?"

Child of Wonder · Nov 22, 2012

TCM said:
I could reply by asking you if you seriously distinguish important from non-important data.

If it's not important, why store it at all? You might classify data as unimportant because you can reproduce it easily. That doesn't mean this is a reasonable thing to do.

An OS I setup for testing might not be important data because I can easily reinstall it. I still protect it with ECC because for the time it runs, I want it to be reliable.

If you just setup OSes for the fun of it and never do anything with them, go ahead and use normal memory. Better overclock it for best effect.

In my home lab, filled with easily replacable VMs, I don't bother with distinguishing between important and non-important data. It's all non-important. It's for learning purposes. If I was concerned about a VM, I would regularly back it up.

For a home lab, non-ECC memory is fine.

tonyb · Nov 22, 2012

How exactly is non-ECC memory putting your data at risk?

meatincereal · Nov 23, 2012

TCM said:
No, in your logic.

Putting words in my mouth.

tonyb said:
How exactly is non-ECC memory putting your data at risk?

Whenever data passes through your memory and is written to disk, there's a risk this data is corrupted. The risk is higher with non-ECC memory because non-ECC memory can not detect and correct bitflips.

See http://en.wikipedia.org/wiki/Soft_error

Relevant quote:

wikipedia article said:
system-level soft error:
These errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The computer tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. The bad data bit can even be saved in memory and cause problems at a later time.
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed, in which case the recovery procedure must include a reboot.

Unless you're very paranoid about your data, in general, you will be fine with non-ECC in non-mission critical environments.

However, if you're buying new, you might aswell go with ECC, seeing as the price on memory has gone down so much.

/dev/null · Nov 23, 2012

TCM said:
I could reply by asking you if you seriously distinguish important from non-important data.

If it's not important, why store it at all? You might classify data as unimportant because you can reproduce it easily. That doesn't mean this is a reasonable thing to do.

An OS I setup for testing might not be important data because I can easily reinstall it. I still protect it with ECC because for the time it runs, I want it to be reliable.

If you just setup OSes for the fun of it and never do anything with them, go ahead and use normal memory. Better overclock it for best effect.

Not sure why you use computers since not every memory error can be detected, even with ECC. You should keep everything on paper.

tonyb · Nov 23, 2012

Since most laptops/desktops use non-ECC memory, you'd think the risks we're talking about here would result in lots of data corruption, all day 'errday.

In reality, we rarely see anything of the sort. Most of the time, a memory problem (which are rare, and tend to be a really, really bad DIMM) causes a system or application to crash (and yeah, you might lose data that way). But it's not like JPGs have a red pixel flipped green, word documents have spelling mistakes put in, and message board posts have spalling errors insarted, or other such data corruption.

meatincereal · Nov 26, 2012

tonyb said:
Since most laptops/desktops use non-ECC memory, you'd think the risks we're talking about here would result in lots of data corruption, all day 'errday.

In reality, we rarely see anything of the sort. Most of the time, a memory problem (which are rare, and tend to be a really, really bad DIMM) causes a system or application to crash (and yeah, you might lose data that way). But it's not like JPGs have a red pixel flipped green, word documents have spelling mistakes put in, and message board posts have spalling errors insarted, or other such data corruption.

I've actually had bad memory cause 2TB of data corruption before it was detected. Luckily I had backups. The corruption would in fact show itself in spelling errors in documents, and random artifacts in movie files.

Danith · Nov 27, 2012

now you guys make me want to get ECC memory for my gaming computer

Grentz · Nov 27, 2012

ECC is not as important as some in this thread make it out to be. Yes in applications where you are processing data 24/7 or applications that are very memory dependent, but not at all in most regular user scenarios. By the logic here, we should all be having issues with our data and systems, where do you think a lot of the data originates? It comes from systems with non-ECC ram.

By the way, if you want to go down this path we should all just be running mainframes, because in comparison...any normal server or computer is extremely prone to corruption and processing issues.

bAMtan2 · Dec 2, 2012

meatincereal said:
so in your logic, why even sell non-ECC memory?

Non-ecc memory is common because Intel restricts their common platforms to using non-ecc memory because it is another controlled feature which allows them to charge higher prices for the higher end hardware

I could put that simpler but that is it. Cars are the same way

koolaidkitten · Dec 4, 2012

Been building servers on non ECC ram for years and for the most part they are solid. However the absolute most solid server I ever built was a Dual AthlonMP and it had ECC ram. My current ESX box in my sig might have a small purple screen of death twice a year maybee? Other than that I don't see any difference.

ClariorHincHonos · Dec 4, 2012

TCM said:
No, in your logic.

In my logic: "Why even buy non-ECC memory?"

You're funny. You should step off the SRS train and relax.

OP, for home lab use, NON-ECC is just fine. For mission critical data, depending on the situation, maybe go ECC.

TCM · Dec 4, 2012

ClariorHincHonos said:
You're funny. You should step off the SRS train and relax.

For mission critical data, depending on the situation, maybe go ECC.

Ad hominem and junk advice. Got anything else? Thought so.

ClariorHincHonos · Dec 4, 2012

TCM said:
Ad hominem and junk advice. Got anything else? Thought so.

That is your opinion, and that is fine. But remember it's only an opinion and not a standard/law/etc. Same goes for my opinion.

I wonder what you do that makes you so keen on ECC. What field do you work in?

Virtualization lab without ECC

Limp Gawd

2[H]4U

Fully [H]

Gawd

Fully [H]

Gawd

Limp Gawd

2[H]4U

Gawd

Weaksauce

2[H]4U

Gawd

2[H]4U

Limp Gawd

Weaksauce

[H]F Junkie

Limp Gawd

Weaksauce

2[H]4U

Fully [H]

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd