Designing a ZFS based company Fileserver

JohannesA

n00b
Joined
Oct 4, 2012
Messages
21
Hello,
I'm working for a visual effects company and I'm designing our new file server. We are planning to use zfs on linux. I would kindly ask your advice or suggestions on our setup.

Task:
-----
Replacing our Fedora 24 bay sata system with something that has more punch. Currently we use a 4 x 10GB Bond in a HP switch and a Hardware Raid + xfs.

Requirements:
-------------
- 50 TB usable space
- saturating 1GB Ethernet for file transfer to at least 10 -15 workstations simultaneously via SMB (~ 115MB/s) - image sequences of files normally around 5-20 MB each.
- keep it simple: No cluster file system, 1 Machine and 1 Storage Domain Space, No active fail over we can live with some downtime - backups are rsynced to a backup system


Problems in the past:
---------------------
- Fedora
- Machines IO would max out at 4 streams although we never quite nailed it down who's fault it was. I suspect Hardware raid and old samba version that wouldn't scale properly.
- When 16 Render Nodes where doing constantly small r/w operations we could easily bring the machine to its IO max.

New Setup
---------
Hardware:
Dual Xeon XXX
Super Micro 24 bay enclosure
> 256 GB RAM
4 x 10GB
MegaRAID SAS 9361-8i
24 x 4GB SATA 7200/24-7 disks
1 x System disk

Raid Setup:
6 Vdevs à 4 Disks Raid 1+0 = 48TB

ZIL:
2x OCZ RevoDrive 350 PCIe SSD 240 GB mirrored

L2ARC:
2x OCZ RevoDrive 350 PCIe SSD 960 GB striped


Software:
Ubuntu 14.04
Zfsonlinux
Sernet Samba 4

Questions:
----------
I would like to have your opinion on following Questions:

1. Is this Raid Setup "OK"
2. Do I need Hot Spares if this is not critical - Server Room is 2 doors away to swap disks if failing - we would need to have some spare drives laying around of course.
3. Is the Revodrive a good idea as ZIL/L2ARc
4. Do I really need to throw so much hardware on cache if I'm planing to have plenty of Ram
5. Good proven alternatives on Hardware setup especially ZIL/L2ARC
6. Do you think I can fulfill my requirements with this setup
7. Which Xeons are suitable - other than ZFS compression and file services nothing else will be running on this machine

I would love to hear your opinion.
Thank you
Joe
 
Last edited:
I don't think that the RevoDrive is a good SLOG drive. It does not look like it has PLP at all.
What type and amount of sync writes do you expect? If it is only SMB you won't need a ZIL, as Samba normally only does async writes.

When you say 6 vdevs with 4 disks, do you 6 4-way mirrors? There is not something like a RAID 1+0 vdev.
I would say 4-way mirrors are a bit of a waste and use either 4 6-disk RAIDZ2 vdevs or 12 2-disk mirrors dending on IOPS requirements.
Do you mean 1 Gbe to each client or all together? To saturate 1 Gbe to 10 clients is very hard to achieve without using SSDs.
If you really need that you should consider goinbg all-SSD.

The optimal size of the L2ARC depends on the footprint of your workload. If the workstations just read data from the entire disk range but do not reuse the data, the L2ARC makes no difference.
 
Last edited:
Hi Omniscence,
Basically we are using just Samba. Good point about PLP - forgot about that.
Joe
 
use bsd, call ixsystems, get it cheap and already setup for you. Talk to engineers instead of salesman call it a day, will be faster, more stable and you know the hardware will work
 
I meant building a raid 1+0 with four disks. Adding these to one storage pool. If I understood ZFS right these "units" are called vdevs and can be combined to increase IOPS. My Space penalty would be 50%.

I meant 1Gbe for each client. We do massive inhouse video (image sequence) distribution.
 
use bsd, call ixsystems, get it cheap and already setup for you. Talk to engineers instead of salesman call it a day, will be faster, more stable and you know the hardware will work

Sounds like a salesperson.

I recommend sticking with Linux and if OP is capable of doing it, no need to call the company the guy I quoted works for.

Personally I wouldn't recommend OCZ anything.

What file format(s) will this be for? ZFS compression might be useless if these files are already compressed. Video and images don't typically compress well unless the file is poorly compressed or not compressed at all (i.e. raw).
 
Sounds like a salesperson.

I recommend sticking with Linux and if OP is capable of doing it, no need to call the company the guy I quoted works for.

Personally I wouldn't recommend OCZ anything.

I work for a company in Georgia, I've just bought a system from them and am a huge fan because of competency and price. was considering a netapp and went with them.
 
How much are you looking to spend? Why not go for an actual SAN system from a company like NetApp/EMC?

I'm guessing this is to be done to save money? I too would go for a openBSD platform, I would also look to use RAID 6 (2 parity disks, no major read performance hit)

I would definitely recommend having hot spares, if two disks fail at the same time or short succession it would be fubar and require all data to be restored.
 
@dandragonrage we use EXR, DPX, JPEGS -- mostly already compressed.

@Romale23 BSD is an interesting option as well as OmniOS, but we are favoring Linux because we just have more experience with it. The most important part is WHO will do the maintenance and who will do desaster recovery.
We're not big enough to have support contracts - so basically we build everything in house. This has disadvantages of course, but also advantages. This is how we did it the last ten years with great success.

We had the EMC people here - there systems are beautiful designed but there plain to expensive for us. The first system is cheap and if you need to upgrade than your screwed - we don't like dependencies to other companies.

@Flakes - basically we will pay what we need to pay to have a smooth system. First we wanted to see what our hardware requirements will be. But I suspect a system below 10000 € is realistic - half of the money will be just the spindles.

Since samba is using async file system calls. The ZIL will maybe be something that could be added later in time if our demands change.

Joe
 
@dandragonrage we use EXR, DPX, JPEGS -- mostly already compressed.

@Romale23 BSD is an interesting option as well as OmniOS, but we are favoring Linux because we just have more experience with it. The most important part is WHO will do the maintenance and who will do desaster recovery.
We're not big enough to have support contracts - so basically we build everything in house. This has disadvantages of course, but also advantages. This is how we did it the last ten years with great success.

We had the EMC people here - there systems are beautiful designed but there plain to expensive for us. The first system is cheap and if you need to upgrade than your screwed - we don't like dependencies to other companies.

@Flakes - basically we will pay what we need to pay to have a smooth system. First we wanted to see what our hardware requirements will be. But I suspect a system below 10000 € is realistic - half of the money will be just the spindles.

Since samba is using async file system calls. The ZIL will maybe be something that could be added later in time if our demands change.

Joe

I do know how that goes, good luck to you. The only piece of advice i have ends up costing more and that would be going with intel ssds since they die gracefully or samsung 850s since they last so long. Not sure 850s pcie drives are out yet. outside of that i really dont know enough about raid controllers or zfs on linux to say anything.
 
@Romale23 BSD is an interesting option as well as OmniOS, but we are favoring Linux because we just have more experience with it. The most important part is WHO will do the maintenance and who will do desaster recovery.
We're not big enough to have support contracts - so basically we build everything in house. This has disadvantages of course, but also advantages. This is how we did it the last ten years with great success.

I can't recommend OmniOS or anything OpenSolaris-based. Performance and hardware compatibility are crap compared to Linux and I've had all sorts of issues with various OpenSolaris-based distros. I've had multiple of them - most recently OmniOS - have major performance issues where my disk benchmarks are good, my network benchmarks are good, I tweak CIFS settings (also NFS which I tried) yet transferring files sucks. I've had entire OS installs be corrupted by updates (most recently NexentaStor) and all sorts of other issues. I use ZFSonLinux now and it is far better.

Stick to Linux. It is the easiest to use, most compatible (both hardware and software), highest performance (compared to any other *nix, at least), stable enough, secure enough (especially if you use something like GRSecurity or PAX), and perhaps most importantly, you already have experience with it.

I would take FreeBSD over OpenSolaris-based distros, though. I'd also take Solaris over OpenSolaris-based stuff. Didn't have problems with Solaris like I did with OpenSolaris.
 
Last edited:
Get a commercial solution. It is insane to build a company storage array on a hacked together Linux ZOL with no support. For home it is fine but would never recommend building your self. I've been using ZOL for quite some time without problems at home and can tolerate losing the array and have good backups. You do have a backup solution right?? Do you want your job on the line if this thing falls apart?

To give a good example, ZOL has a bad habit of imploding when you use all the disk space up or a edge use case corrupts meta data, rendering the pool inaccessible.
 
For almost a year I am using zfsonlinux for 1/2 of my work 70TB although everything important gets backed up to a tape archive at least 2 times on different media.
 
@TGK Thank you for your input. We're quite experience here with different technologies. We're having around 60 Workstations/Render Nodes half on Linux half on Windows. 5 Servers from Supermicro/Dell. We're traditionally handpick and run our machines on premise with great success. It's hacked but we believe in this strategy. The Hardware is server grade (supermicro - btw EMC is just a rebranded supermicro as well). We won't outsource the key part of our company just because it's easier. As I mentioned before EMC is too expensive and brings constrains that we don't wan't to face. The Nexenta guys where here aswell - we don't like the licensing model.

But this all is more or less offtopic. I would greatly hear more about your setups.
 
We're using a ZOL VM Server now for 10 month in production almost without any hickups. ZFS rocks.
 
I would take FreeBSD over OpenSolaris-based distros, though. I'd also take Solaris over OpenSolaris-based stuff. Didn't have problems with Solaris like I did with OpenSolaris.

OpenSolaris is dead since Oracle bought Sun years ago.
Oracle Solaris 11 is based on OpenSolaris just like Illumos with
NexentaStor, OmniOS, and SmartOS as major distributions.

I see more development at Illumos than at Oracle
and would not move from Solaris ZFS easyness to anything else even if Solaris & Co
restricts or prefers hardware like Intel chipsets and nics + LSI HBAs - I would prefer the same hardware with Linux
 
OpenSolaris is dead since Oracle bought Sun years ago.
Oracle Solaris 11 is based on OpenSolaris just like Illumos with
NexentaStor, OmniOS, and SmartOS as major distributions.

I see more development at Illumos than at Oracle
and would not move from Solaris ZFS easyness to anything else even if Solaris & Co
restricts or prefers hardware like Intel chipsets and nics + LSI HBAs - I would prefer the same hardware with Linux

Nothing you just said was anything I didn't know, nor did my post show that I didn't know that. All of that stuff is based on OpenSolaris even though that project itself is dead. I've run Solaris, OpenSolaris, OpenIndiana, Illumian (not Illumos), NexentaStor, and I briefly tried SmartOS. Some worked better than others but the problem there was that some update at some point broke my performance from saturating gigabit easily to ~30MB/s via both SMB and NFS even though my disk IO and network throughput benchmarked fine and I tweaked parameters for CIFS. I dunno. I just don't like it. As mentioned before, I also had lots of trouble with updates for other reasons. It was also harder to get the right LSI driver installed and hardware support was much worse in general.


Get a commercial solution. It is insane to build a company storage array on a hacked together Linux ZOL with no support. For home it is fine but would never recommend building your self. I've been using ZOL for quite some time without problems at home and can tolerate losing the array and have good backups. You do have a backup solution right?? Do you want your job on the line if this thing falls apart?

To give a good example, ZOL has a bad habit of imploding when you use all the disk space up or a edge use case corrupts meta data, rendering the pool inaccessible.


Bah, commercial solutions are pretty much all worse than a good ZFS-based system (unless maybe they use ZFS). Just because you pay someone else to do something doesn't make the job they do better than the job you can do. If they already have a custom server and are even considering it again, then they are fine with the idea. Your advice seems out-of-place to me. My job currently has some off-the-shelf solution and I would EASILY replace it with a homegrown ZFS solution. I am 100% positive that I could do a better job. Though the solution we use is not the best out there. EMC Symmetrix would probably be a ton better, but I haven't used that (oddly enough, though, I did work for EMC for a few years).
 
Last edited:
Get a commercial solution. It is insane to build a company storage array on a hacked together Linux ZOL with no support. For home it is fine but would never recommend building your self. I've been using ZOL for quite some time without problems at home and can tolerate losing the array and have good backups. You do have a backup solution right?? Do you want your job on the line if this thing falls apart?

To give a good example, ZOL has a bad habit of imploding when you use all the disk space up or a edge use case corrupts meta data, rendering the pool inaccessible.

what version of ZoL, is 0.6.3.1?
 
what version of ZoL, is 0.6.3.1?

I check in on the zfs bug tracker periodically to see when 0.6.4 is due out and browse the open bugs. Don't get me wrong I really like zfs and respect the work the developers put into it. I just see too many threads still where someone has a corrupted array they cant import and being asked by devs to recompile source to debug the issue, and the resolution is often over the period of weeks or more. That's fine for a hobby but would challange most IT professionals intestinal fortitude with the users unable to work and management yelling.

That being said if the risk is clearly understood by management and is tolerable based on available funds, have at it -- sounds like a fun project.

One advantage of a commercial solution is that they integration test solutions to save customers finding bugs and often have developers who are paid to provide bug fixes and help get the customer out of a jam. You're at best effort otherwise if you build it yourself and seek help via forums or bug submission.

OP appears to understand this so I'll quit going off topic from his original question.
 
Last edited:
@ JohannesA
I would be very careful in choosing hardware and as others have mentioned I would also recommend you to consider BSD (FreeBSD) rather than Linux.

As for hardware, you most likely don't need to go above the regular socket 1150 Xeon unless you want to use compression but beware of the performance penalties. Samba certainly won't help you getting great performance....

You also want a HBA not a RAID card (yes, there's a big difference). You'd want the LSI 9300-8i instead, even two depending on the load on the HDD arrays.

Since you're most likely not going to read back the same data that much you don't need overly much RAM.

I wouldn't go for OCZ on SSD, Netflix uses Crucial M5's (M500 drives I presume) and they handle a lot of data so those should be fine. You might want to consider M550 as they're about the same price but a bit faster. Have in mind that not all SSDs play nice with LSI cards so you might want to hook these up to the Intel controller.

As for HDDs go for Toshiba 4TB (preferably) Nearline HDDs or HGST (second choice).
//Danne
 
I check in on the zfs bug tracker periodically to see when 0.6.4 is due out and browse the open bugs. Don't get me wrong I really like zfs and respect the work the developers put into it. I just see too many threads still where someone has a corrupted array they cant import and being asked by devs to recompile source to debug the issue, and the resolution is often over the period of weeks or more. That's fine for a hobby but would challange most IT professionals intestinal fortitude with the users unable to work and management yelling.

That being said if the risk is clearly understood by management and is tolerable based on available funds, have at it -- sounds like a fun project.

One advantage of a commercial solution is that they integration test solutions to save customers finding bugs and often have developers who are paid to provide bug fixes and help get the customer out of a jam. You're at best effort otherwise if you build it yourself and seek help via forums or bug submission.

OP appears to understand this so I'll quit going off topic from his original question.

I think, this is OK when the fund is limited :D. just know ZoL limitation and be aware.
since OP is familiar on linux

I did install Open Indiana and had big issue that was unresolvable by me, and moved to Zol since has knowledge on linux.
Has been in ZoL Since 200,8 starting centos 6.X and moved to centos 7.X, nothing big issues happened as now.
just waiting btrfs raid5/6 stable in linux, since already using btrfs for / and /home.

commercial big plus is tech supports.

one bug that I faces on centos 7.0 -> #2575 CentOS 7. Not mount/import pool after reboot .
need to recreate hostid manually.
 
Last edited:
I check in on the zfs bug tracker periodically to see when 0.6.4 is due out and browse the open bugs. Don't get me wrong I really like zfs and respect the work the developers put into it. I just see too many threads still where someone has a corrupted array they cant import
AFAIK the cases of pool corruption that showed up on the tracker were all related to a bug in the SA based xattr's, which is optional and normally irrelevant with Samba fileserving and will be fixed in ZoL 0.6.4.

@OP:
Your performance target is possibly a bit unrealistic: Reading @ 115MB/s * 15 = 1725MB/s, that's ~71MB/s per disk assuming mirror vdevs. Consider that at the innermost cylinders most hard disks already fall below this value even in sequential transfers. Now add in the random seeks from concurrency and fragmentation (which will build up specially with COW filesystems) and you'll have much lower throughput.
Yes, ARC and L2ARC can help a lot, but that depends on your workload and working set size. Are all the workstations accessing the same data? Are image sequences read a single time or multiple? Caches can only serve data that has been requested at least once before and hasn't been evicted...

Nevertheless, a very interesting project, as a ZFS/ZoL user both in private in company context I'm curious to what the results will be if you go ahead with it.
 
@dizzy Thank you for your input. Do you have a link about Netflix's use of crucial SSDs - this would be a good selling point to my boss? Excellent point about the HBA instead of Megaraid. Do you have experience when to buy more than one controller - I understand that there is a way to arrange cabling from more than one controller for maximum throughput and for getting rid of the single point of failure (crosscabling). Do you know a metric when more than one controller makes sense - something like every 12 disks - to maximize throughput?

I know Samba is bad in performance - but it got a lot better - if it is necessary we are considering to contract a Sernet (maintainer of samba) engineer to come over - I talked to them at the cebit computer fair - they have prefetch patches for samba to prefetch sequential file read on folder access - exactly what we need.
 
@zrav The performance target is a theoretical benchmark. In normal use the application on the workstation is caching the image files locally - the files are read once in the morning when the staff comes in. But we do have some corner cases we wan't to address. Especially simulation render jobs with microreads and writes from 16-32 Rendernodes is killing the performance. We're considering a different approach here - basically not to use samba - so we can use ZIL as caching. But our renderfarm is based half on windows so we need to change our infrastructure what will not be done immediately - its on the roadmap.

We did have a setup testing 4 ZOL machines with a storage subnet and a cluster filesystem FHGFS (now BEEGFS) and ctdb with samba. We also considered storage virtualisation with ceph. Another interesting approach would be to use bittorentsync as a "global cache". But in the end of the day we don't have the manpower to maintain a complex system - so it all boils down to a simple (boring) fileserver.
I know it's all about trade-offs especially when there is a budget and manpower constrain.
 
The rendernode case should be much improved as they will access the same resource files so that gets cached nicely (still, do make an estimate of your average working set size per client) and SMB writes are async so they should get handled well with many vdevs (and you won't need/benefit from a SLOG device).
 
The ZIL is not a cache, it is an intent log. Async writes are already as good as it gets for write throughput. Everything that involves the ZIL will always be slower than async writes.

With 10-15 clients accessing the pool simultaneously you can now longer talk about sequential accesses. The disks will seek constantly as long as ZFS can't read the data from a cache. You really should test what workload a single client creates.
 
Don't actually use a RAID controller (as in make RAID arrays, passthrough is fine) if you're going to use ZFS.
 
@ JohannesA
https://openconnect.itp.netflix.com/hardware/index.html
Updated I/O Optimized Appliance and Updated Storage-Dense Appliance

I haven't looked at it but I wouldn't go above 2 drives per channel, just a guess on my end but somewhere above that I would expect to see a latency/performance degradation.

//Danne

That's a pretty nice setup, has to be fast as hell. Could use a X10SRH-CF motherboard and a Xeon E5-1620 V3 instead to squeeze out 18 1TB SSD's with SATA/SAS3 interfaces. Could potentially add a few more LSI HBA's to add more density but you'd have to evaluate PCI-E, CPU, and memory constraints at some point.

Build multiple boxes to hit storage capacity target, format each drive with btrfs, and then glue it all together for redunancy/io with GlusterFS. Could scale pretty much indefinitely that way.

Advantage of using btrfs is it supports TRIM and checksums. Configure GlusterFS to maintain two-three copies of all data and then keep a few SSD's as spares when one craps out.

Note how the solution skips any RAID in favor of multiple copies of each file on multiple servers, giving potentially much higher throughput if IO's can be split between more than one server during reads. You could even <gasp> purchase RHEL support later if you wanted :)
 
Last edited:
Do note that they use FreeBSD and UFS2 not Linux at all ;-)
I would however consider ZFS instead and as I said earlier I doubt you'll need anything more beefier than a Socket 1150 CPU.
//Danne
 
@diizzy Thank you for looking that up for me.
@Stanley Pain Yes that was mentioned before we will switch to the LSI 9300 HBA series
@omniscence - The "ZIL is not a write cache"-info is one of the most important one. I have reread all ZFS documentation again - and I'm amazed how many blogs and serverfault answers have this wrong as we did. A good one this probably saves us a hell of some money we can put into more RAM.
@TGK Like I have written we tested a cluster setup. From a technical point its much fun - but the administration overhead is to big to coup for us. It's just to much work - I believe if our storage needs change in 2-5 years we just buy a new machine with all SSD - although I would love to be more innovative here - but we are just not a "net"-based company - our scaling is moderate and if it s not then that means our budget will be much higher and we just could talk to EMC and buy some blue boxes. For btrfs simply we don't have experience - I prefer ZFS not based on facts more based on my guts.

Another thing that came up was SSHDs - unfortunately there too small right now.

http://www.storagereview.com/seagate_enterprise_turbo_sshd_review
http://www.seagate.com/de/de/intern...rd-drives/hdd/enterprise-performance-10K-hdd/

Just to answer the question that came up several times, we don't have proper metrics what the clients actually do put workload on the server. We're still searching for a (client side) tool to have proper long term statistics. Right now we use iotop/munin/bonnie++ on the server, but fail to have time to really get the right interpretation of the data. This has to be done today.
 
I like this thread :). I run a 32TB ZFS file server at home via ZFS on Linux. It's been rock solid for a year+ now. My only complaint about my build is that I don't have ECC RAM. Once you start getting into the large multi TB array sizes ECC RAM becomes a must.
 
Build multiple boxes to hit storage capacity target, format each drive with btrfs, and then glue it all together for redunancy/io with GlusterFS. Could scale pretty much indefinitely that way.

Advantage of using btrfs is it supports TRIM and checksums. Configure GlusterFS to maintain two-three copies of all data and then keep a few SSD's as spares when one craps out.

BTRFS is garbage and far less stable than ZFS, including the relatively young ZFSonLinux. With 3.17.2 I had my entire root partition corrupted in a difficult-to-fix way with BTRFS after KVM virtualization hardlocked the system one time - even though the VM was on a different partition and nothing should have been writing to the root partition at that time. I reinstalled with XFS and when KVM froze the system another 10 or so times, no problems at all. Did not have problems with ZFS which the VM was stored on but have not tried ZFS as root on Linux. But for a filesystem not built around data integrity to beat BTRFS in this way is absolutely ridiculous and downright unacceptable. I've also had some random corruption in BTRFS at work - not from system locks, either. Fortunately the problems I've had at work were much easier to repair. I've used EXT3 and 4 extensively (and some EXT2 in the past though less than 3) and also did not have the amount of issues I've had with BTRFS. And perhaps the biggest issue is that BTRFS does not have anywhere near the equivalent of zpool scrub. Certain types of corruption in BTRFS are next to impossible to fix because the tools to do so are very primitive.

BTRFS seems like a good idea but it is NOT there yet. It is DEFINITELY not even CLOSE to ZFS at this time.
 
The thing with zpool scrub is that it can't really repair a damaged filesystem.
It can only repair damaged blocks with redundant data, but if the the filesystem really gets corrupted, i.e. invalid metadata that has correct checksums, you are absolutely screwed since there is no repair tool.
And ZFS itself knows about its weaknesses. If it can't import a pool it often bravely suggests that you destroy the pool and restore from backups.
The quality of ZFS is very good however, you rarely read about an unimportable pool.
 
I've never had any sort of issue like that with ZFS and scrubbing has always accomplished everything I have needed it to. I can understand that ZFS is not perfect, either, but it's a LOT better than BTRFS, at least as of 3.17.2 (which is still very recent). I would probably say that BTRFS is the worst common filesystem in the Linux kernel at the moment and would trust my data with EXT* and XFS and JFS and ReiserFS more than BTRFS. BTRFS is made of glass. All too easy to shatter. No thanks.
 
I agree, BTRFS is still way too early in it's development stage to use for anything short of actual, real testing.
 
I agree, BTRFS is still way too early in it's development stage to use for anything short of actual, real testing.

And I have ~35 TB of company data (medical imaging research) on btrfs (for several years) as well as a second ~35 TB of company data on zfsonlinux (for 1 year). There is 1 important difference I use raidz2/3 but do not use btrfs raid5/6. Instead btrfs sits on top of linux software raid5/6. I do not trust the raid5/6 implementation yet.

With all this said I have all important data backed up at minimum 2 times to a tape archive.
 
Last edited:
Back
Top