ZFS CPU Choice

Tau · Jul 8, 2014

I am looking for a bit of feedback on what CPU to put in my soon to be rebuilt ZFS storage server.

Old spec was Q6600 8GB ram, 21 spindles in 3x 7 drive raidz60.

I recycled the CPU/mobo/RAM out of it into another project so its basically a ground up rebuild.

I plan on moving it up to 2x 10 drive raidz60 across an LSI 1068e and HP SAS expander that I already have, on FreeBSD 10.

Now to the question, since I have re purposed the mobo/cpu/ram from it, I plan on putting a spare Intel server board that I have kicking around in it with 32GB ECC RAM. I cant decide if i want to put dual quadcores in it, or dual dual cores. I have both available and on hand, if memory serves E5405's, 5050's and 5060's.

I doubt I will ever run into a CPU limit in the box, however I plan to play around with dedupe, and LZ4 compression (most likely compression on the entire pool, and dedupe on a couple datasets).

Just looking for some input on what other people think to help try and make up my mind.

danswartz · Jul 8, 2014

No, no, no! Do NOT mess with dedupe, or you will be back here in a few months asking how to fix your system and be told 'copy everything off and reinitialize the pool from scratch!'. That said, 4 cores should be more than enough unless you will have multiple 10gb enets driving it.

patrickdk · Jul 8, 2014

If you touch dedup, you need lots and lots of ram, DO NOT DO THIS.

As long as you don't use gzip, a dual core 2.5ghz is fine, for 5 gigabit connections. If you want to use gzip, then you will want as much cpu power as you can put into it.

Tau · Jul 8, 2014

danswartz said:
No, no, no! Do NOT mess with dedupe, or you will be back here in a few months asking how to fix your system and be told 'copy everything off and reinitialize the pool from scratch!'. That said, 4 cores should be more than enough unless you will have multiple 10gb enets driving it.

If memory serves dedupe can be enabled on a dataset level... or is it pool wide?

I remember reading somewhere that it was 10GB of ram per 1TB deduplicated, or somewhere along those lines.

It will have 4x trunked gigE for data transfer and it may get a fibre card for faster esxi replication.

ZFS wise I have deployed quite a few fairly large boxes, however have yet to toy with dedupe/compression so new territory in that regard. This current box is still running open solaris 06.09

***EDIT*** Freenas docs say 5GB ram per TB of deduplicated data... so not that heavy. the freebsd zfs tuning guide is also quoting 5GB per TB, so I think thats a safe guideline to use as im sure they err on the side of performance a bit as well.

danswartz · Jul 8, 2014

It isn't just ram usage. Google for horror stories about this. Dedup has never really been something to use. Very few use cases unless you have many cloned virtual machines with 95% duplicated blocks. You can get much of the benefit by using compression instead. Yes, dedup is per dataset, but it has very subtle ramifications that make it effectively impossible to turn off again. Unless you don't care about rebuilding your pool or know exactly what you are up to, just don't do it.

omniscence · Jul 9, 2014

The problem is that if your DDT grows above a certain size (in relation to RAM), ZFS can no longer keep it in RAM and basically has to load it from disk for each block, slowing down the pool to a crawl. This can even make it impossible to get your data off in a sensible extent of time. It also eats away the memory available for the ARC.

Based on the record size of your zvols you may not even benefit from dedup. If your zvol record size is too small while using an ashift of 12, you are wasting a lot of space depending on your vdev layout, possibly even more than you would gain from dedup. A large record size is harder to deduplicate on the other side. Only VM backstorage will realistically benefit from this, bulk data like media files should be deduplicated on file level. By the way, you can use zdb to estimate how well your pool can be deduplicated.

I would say it is something that can be tested with a moderately sized all-SSD pool, those are not that large, there may still be some benefit in cost (not the case for spindles), and they would not basically freeze once the DDT exceeds the available memory (which should not happen as it is not that large with an SSD pool).

patrickdk · Jul 9, 2014

The socalled 5gb per 1tb data ratio only works if you use 128k recordsizes for all your data, that means all your data files is large, and your not using zvol's. Basically only for large file backup storage.

So if you have a normal system with say, 30k to 50k avg block size, you will be using almost 4x that amount, 20gigs per 1tb.

This can be enabled per dataset, but when deleting items, it has to check per pool. This causes a huge issue when deleting snapshots, as it has to check on the pool level, that it wasn't dedupped for every block.

patrickdk · Jul 9, 2014

omniscence, I have not seen that issue in real live, that larger block sizes are harder to dedup.

with a 64k blocksize, I get a 2.3x dedup rate.
with a 4k blocksize, I get a 2.5x dedup rate.
with a 512 blocksize, I get a 2.9x dedup rate.

atleast over several hundred windows2008r2 server vm's.

Tau · Jul 9, 2014

omniscence said:
The problem is that if your DDT grows above a certain size (in relation to RAM), ZFS can no longer keep it in RAM and basically has to load it from disk for each block, slowing down the pool to a crawl. This can even make it impossible to get your data off in a sensible extent of time. It also eats away the memory available for the ARC.

Based on the record size of your zvols you may not even benefit from dedup. If your zvol record size is too small while using an ashift of 12, you are wasting a lot of space depending on your vdev layout, possibly even more than you would gain from dedup. A large record size is harder to deduplicate on the other side. Only VM backstorage will realistically benefit from this, bulk data like media files should be deduplicated on file level. By the way, you can use zdb to estimate how well your pool can be deduplicated.

I would say it is something that can be tested with a moderately sized all-SSD pool, those are not that large, there may still be some benefit in cost (not the case for spindles), and they would not basically freeze once the DDT exceeds the available memory (which should not happen as it is not that large with an SSD pool).

I was thinking it would be good for the VM backup dataset as the core of pretty much all the machines is the same. However If the gain is not that great compression may give me a pretty similar result without the performance penalty. I did know that once the DDT no longer fit in RAM it would start to page (L2ARC if present, then down to disk).

This would be pretty small scale, Im sure the dataset would never exceed 1TB, (probably closer to the 500-700GB mark (before dedupe)) However if its as much hassle as people are making it out to be I might not even bother for a few hundred GB saved... when compression will net me a few hundred on its own.

danswartz · Jul 9, 2014

Are you sure the ddt goes into the arc? That doesn't sound right to me...

omniscence · Jul 9, 2014

danswartz said:
Are you sure the ddt goes into the arc? That doesn't sound right to me...

I'm quite sure. The DDT is metadata, which should be limited to 1/4 of the ARC by default and can be influenced by some tuneables. The DDT is not "build" in RAM, but has to be actually on disk somewhere as it has to exist when the pool comes up. It will just be cached in RAM as it is the basically highest priority metadata that has to be looked up for every block that is supposed to be written to the pool. It can be propagated down to the L2ARC, but even this is orders of magnitudes slower than the RAM and any data in the L2ARC consumes a certain amount of ARC memory.

patrickdk said:
omniscence, I have not seen that issue in real live, that larger block sizes are harder to dedup.

with a 64k blocksize, I get a 2.3x dedup rate.
with a 4k blocksize, I get a 2.5x dedup rate.
with a 512 blocksize, I get a 2.9x dedup rate.

atleast over several hundred windows2008r2 server vm's.

Interesting numbers. Your data shows that they are harder to deduplicate, although not by as much as I though. On the other hand a dedup rate like this when there are hundreds of VMs is not very promising. I would think that any Windows installation has a lot of common static data. Do you adjust the NTFS cluster size to the ZVOL block size?

Tau said:
I was thinking it would be good for the VM backup dataset as the core of pretty much all the machines is the same. However If the gain is not that great compression may give me a pretty similar result without the performance penalty. I did know that once the DDT no longer fit in RAM it would start to page (L2ARC if present, then down to disk).

This would be pretty small scale, Im sure the dataset would never exceed 1TB, (probably closer to the 500-700GB mark (before dedupe)) However if its as much hassle as people are making it out to be I might not even bother for a few hundred GB saved... when compression will net me a few hundred on its own.

It is a nice feature, but I would not entrust my most important data to it. As this is [H] I would say try it on a non-production system, but you should keep the option to just delete the pool and recreate it without suffering fatal data loss if things go haywire. Basically on every occassion someone here asks for it, they get talked out of it by several people.

danswartz · Jul 9, 2014

Ah, okay. Yeah, that makes sense...

Tau · Jul 9, 2014

omniscence said:
I'm quite sure. The DDT is metadata, which should be limited to 1/4 of the ARC by default and can be influenced by some tuneables. The DDT is not "build" in RAM, but has to be actually on disk somewhere as it has to exist when the pool comes up. It will just be cached in RAM as it is the basically highest priority metadata that has to be looked up for every block that is supposed to be written to the pool. It can be propagated down to the L2ARC, but even this is orders of magnitudes slower than the RAM and any data in the L2ARC consumes a certain amount of ARC memory.

Interesting numbers. Your data shows that they are harder to deduplicate, although not by as much as I though. On the other hand a dedup rate like this when there are hundreds of VMs is not very promising. I would think that any Windows installation has a lot of common static data. Do you adjust the NTFS cluster size to the ZVOL block size?

It is a nice feature, but I would not entrust my most important data to it. As this is [H] I would say try it on a non-production system, but you should keep the option to just delete the pool and recreate it without suffering fatal data loss if things go haywire. Basically on every occassion someone here asks for it, they get talked out of it by several people.

It does seem like it, Everything on the pool is currently replicated to a backup solution so I could play around with it... However after thinking about it for a savings of sub 500GB i dont really care to bother with it, or the potential hassle that comes with it.

danswartz said:
Are you sure the ddt goes into the arc? That doesn't sound right to me...

When the DDT becomes to big to fit in RAM it spills down into the L2ARC (if present)

So now that I have been swayed not to bother with Dedupe, the question comes back to how much CPU to put in it... Duals or Quads....

omniscence · Jul 9, 2014

I would say dual dualcores should be enough. LZ4 is very cheap. I use a single (although Haswell) quadcore that also runs some VMs and has a 10 Gbe controller. I guess that most of the system load comes from the block level encryption I have between the disks (11 disk RAIDZ3) and ZFS. The system almost never goes to 100% load.
It also depends on price and power requirements. I just add that the LSI controllers you use cannot be used with drives larger than 2 TB, but you probably know that already.

patrickdk · Jul 9, 2014

The issue is, while they are the same, and the blocksize of ntfs is the default of 4k on all of them, that isn't the real issue.

The real issue is there is only a few gigs of common data in each vm, even if you dedup all of that prefectly. The issue is all the other data in that vm that can't be dedupped cause it's the uniq working set of that vm. Specifically, logs, databases, ...

Tau · Jul 9, 2014

omniscence said:
I would say dual dualcores should be enough. LZ4 is very cheap. I use a single (although Haswell) quadcore that also runs some VMs and has a 10 Gbe controller. I guess that most of the system load comes from the block level encryption I have between the disks (11 disk RAIDZ3) and ZFS. The system almost never goes to 100% load.
It also depends on price and power requirements. I just add that the LSI controllers you use cannot be used with drives larger than 2 TB, but you probably know that already.

Yeah the 2TB drive limit on these controllers is actually what they are being moved up to (current 1TB drives) moving it up to 2TB drives. One of the pitfalls of using older hardware.

There shouldn't be a lot of load on this box, about 7 workstations accessing it for data storage/retrieval, these workstations are also backed up to it, and I will probably add it as a datastore for the ESXI cluster as another backup target.

Looking on Ark the quads are actually 15Watts lower consumption then the Duals... I will have to double check the models but that might be worth saving the bit of power.

Dark Shade · Jul 16, 2014

Not to hijack the thread, but I have a similar question: Would I be better off with a hex-core L5639 @ 2.13GHz or quad-core W3540 @ 2.93GHz? Same generation of processors, both Xeons, one with more cores at a lower clock vs. fewer cores at a higher clock. Which would be more beneficial for ZFS? Basic config, no dedup, using compression and 40GB L2ARC with 2x2TB in RAID 1 (for starters)

danswartz · Jul 16, 2014

I would go with a smaller number of faster cores...

patrickdk · Jul 16, 2014

Normally I would also.

But if this is for home, I would defently go with the l5639, 60w vs 130w
Also will depend on what system you run this on.

Atleast using illumos, it balances the load over all cores perfectly, so it's not a real issue.
I have no idea about zfsonlinux though.

ST3F · Jul 16, 2014

Dark Shade said:
Not to hijack the thread, but I have a similar question: Would I be better off with a hex-core L5639 @ 2.13GHz or quad-core W3540 @ 2.93GHz?

Dual Core is enough for San / NAS ZFS.
Basicly ; the more GHz you have, the more efficiency zfs will get you.
... but one other thing : W3540 can't handle more than 24GB Ram !
-> http://ark.intel.com/products/39719...-8M-Cache-2_93-GHz-4_80-GTs-Intel-QPI?q=W3540
vs up to144 GB for 56xx architecture.

And as you may know : ZFS loves Ram

Cheers.

Dark Shade · Jul 16, 2014

Thanks for the input folks. This setup would start with 12GB ECC in a motherboard that only supports up to 24GB. OS is illumos + napp-it.

Dark Shade · Jul 18, 2014

Well, I am at the end of my rope here. The motherboard I ordered that should have gotten this project started ended up being completely wrong for the case/config/cooler that I wanted to use. I've had it.

I have three 1366 processors (920, W3540, L5639), a 1250w modular PSU with plenty of SATA power connectors, 8x2GB sticks of DDR3 ECC, cases and network cards and all kinds of shit, and I absolutely refuse to justify a $200 used x58 motherboard from fleabay. That's outrageous.

Does it make sense to sell all of the processors and upgrade to something more modern, like socket 1155?

ST3F · Jul 18, 2014

I would say : if your already own a motherboard, stay with W3540, it wil be ok and costless for your need.

If you think selling the entire system around this CPU and buy new system to avoid the electricity biil, the difference ROI + electricity between yours and a new could be right balanced in 8 years...

But as you don't have a motherboard, try to sell all cpu, keep ram, and buy something like x9scm-iif + e3 1230v2 or x9srl + E5 1620

Dark Shade · Jul 18, 2014

Yeah, no motherboard, just every other piece (my old x58 died on me, starts up but does not POST)

I was also thinking that a newer system would have SATA3 vs SATA2 on x58

drescherjm · Jul 18, 2014

Dark Shade said:
I was also thinking that a newer system would have SATA3 vs SATA2 on x58

That part should not really make a difference. I mean for a ZFS server. I mean that will not help you any for hard disks only SSDs. And would you get > 300 MB/s for a cache SSD anyways?

ST3F · Jul 18, 2014

Moreover, the pool's disk are on controllers, not in the motherboard, exept it has the lsi chip...

Dark Shade · Jul 18, 2014

ST3F said:
But as you don't have a motherboard, try to sell all cpu, keep ram, and buy something like x9srl + e3 1230...

Will an X9SRL work with an E3? If so, I may just do that, since I have some RDIMMs here I'd like to use over these UDIMMs...

ST3F · Jul 18, 2014

Supermicro X9SCM-IIF with Xeon E3-1220V2 yes
... but ECC, not ECC Reg.

Supermicro X9SRL with E5 1620 yes
...with variety of Ram.
http://www.supermicro.com/products/motherboard/Xeon/C600/X9SRL.cfm
http://ark.intel.com/products/64621...-E5-1620-10M-Cache-3_60-GHz-0_0-GTs-Intel-QPI

Mastaba · Jul 21, 2014

why not ECC Reg?

drescherjm · Jul 21, 2014

Mastaba said:
why not ECC Reg?

Because REG ECC will not work with any E3 processor or the lga115X platforms. You need unbuffered dimms for these platforms.

ZFS CPU Choice

Limp Gawd

2[H]4U

Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Gawd

Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

Limp Gawd

[H]ard|Gawd

Gawd

Limp Gawd

[H]ard|Gawd

2[H]4U

Gawd

Limp Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

[H]F Junkie

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

[H]F Junkie