Good ZFS build?

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Long time lurker, first time posting. Want your guys thoughts on my ZFS NAS build (Solaris 11).

Uses:
Mostly SMB
Some NFS
~45TB Useable to start. Expand to 100-150 over the next year by adding to the back end.
Maybe 2 or 3 separate ZPOOLS.

Hardware:

Front End:
The only SuperMicro server that Oracle will sell Solaris support for is SuperServer 6016XT-TF which is a Westmere board. So I plan to pair that with an Intel X5690 (high clock speed required because of single threaded nature of SMB 2).
128GB ECC.
2 IBM M1015 in I.T. mode.
2 x Samsung Pro 840 Pro's for L2ARC.
4x 1Gb Ethernet in LACP/802.3AD

Back End
Supermicro SDS-SC847E26-RJBOD1 (45 tray JBOD)
45 Western Digital Red 2 TB Drives.
ZFS Mirrored/Stripped VDEV (RAID 10).

Does this look like a good build?
 

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,703
Is there a reason you want the closed solaris 11? Anyway, what kind of I/O mix do you envision? I like raid10 for random I/O. If it's mostly media, I'd go instead with something like 7 raidz2 vdevs with 6 drives in each - losing 1/3 of the space instead of 1/2. Throw the other 3 drives in as hot spares maybe?
 

TeeJayHoward

Limpness Supreme
Joined
Feb 8, 2005
Messages
11,149
Front End:
The only SuperMicro server that Oracle will sell Solaris support for is SuperServer 6016XT-TF which is a Westmere board. So I plan to pair that with an Intel X5690 (high clock speed required because of single threaded nature of SMB 2).
128GB ECC.
2 IBM M1015 in I.T. mode.
2 x Samsung Pro 840 Pro's for L2ARC.
4x 1Gb Ethernet in LACP/802.3AD
I assume that Oracle support is important to you, so we'll stick with the supported system. I would recommend getting the lowest-power, quietest version of the CPU you can. For a storage server, even a ZFS one, CPU power isn't as important as you'd think. For example, at work, we've got an weak E5 powering our array. It never breaks 10% CPU utilization, and this thing gets hammered. An L5630 would provide more than enough "oomph" to remove the CPU as a bottleneck, reduce the power bill, and allow the fans to spin down a bit. 128GB is probably overkill as well - It's true that ZFS loves RAM, but there's a point at which diminishing returns kicks in. The 840Pro is an excellent consumer drive, but it really wasn't built to be a L2ARC cache. There are better options out there, albeit a little bit more expensive. I may be wrong, but I don't think the M1015 has SFF-8088 ports on it, so you may want a different card as well - Otherwise, how would you hook it up to the back end you list below?

Note that LACP/802.3AD link aggregation does not allow you to send a file from one server to another at 4GB/s. It allows 4 clients to access one IP at 1GB/s. You're looking at buying an extremely pricely Oracle support contract, so I'll assume you knew that, but it'll help others who might end up finding this post via web searches.

Back End
Supermicro SDS-SC847E26-RJBOD1 (45 tray JBOD)
45 Western Digital Red 2 TB Drives.
ZFS Mirrored/Stripped VDEV (RAID 10).

Does this look like a good build? How should I size the VDEV's? Should I go with one large VDEV for the added stripped speed? Any other considerations?
We use the SC846E26 at work, which is very similar to the 847 you listed. It's an excellent chassis, and works flawlessly for it's intended purpose.

VDEV sizing is dependent on a number of things. What I like to do is label the following in order of importance:
  • Size
  • Uptime
  • IOPS
  • Sequential Read
  • Sequential Write

Once you've got that, (or even better, a minimum number for each of those!) you can compare the speeds of different raid types and vdev configurations. There's a couple of posts on the [H] where folks have built systems similarly to the one you've listed, and many of them include benchmarks. If you haven't already, I'd recommend looking at what they built, and the results they got, and form your own opinion on VDEV sizing.

The other consideration I have for you is backup. Anyone who can afford an Oracle contract normally has a tape system already in place. Do you have enough ports on your switch to add another NAS, though? What about tapes? Physical storage space? Is your offsite location capable of handling the extra tapes?

And it's seldom an issue, but does your datacenter have enough power for the new NAS? If you're approaching the limits on your UPS, adding another couple hundred watts might require an additional upgrade.
 
Last edited:

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Is there a reason you want the closed solaris 11? Anyway, what kind of I/O mix do you envision? I like raid10 for random I/O. If it's mostly media, I'd go instead with something like 7 raidz2 vdevs with 6 drives in each - losing 1/3 of the space instead of 1/2. Throw the other 3 drives in as hot spares maybe?

Closed because of the support as TeeJay said. I originally wanted to do RAIDZ2 but sometimes the write load is heavy and would be very slow. There is also sometimes a lot of random I/0.

I assume that Oracle support is important to you, so we'll stick with the supported system. I would recommend getting the lowest-power, quietest version of the CPU you can. For a storage server, even a ZFS one, CPU power isn't as important as you'd think. For example, at work, we've got an weak E5 powering our array. It never breaks 10% CPU utilization, and this thing gets hammered. An L5630 would provide more than enough "oomph" to remove the CPU as a bottleneck, reduce the power bill, and allow the fans to spin down a bit.

Really? I assumed that for SMB I needed a lot of single threaded speed since there could be up to 70 machines connecting to this NAS at once. I would love to save some money on the CPU. That CPU costs around $1,500.

128GB is probably overkill as well - It's true that ZFS loves RAM, but there's a point at which diminishing returns kicks in.

I see. Should I start with 64 and see how it does, leaving room to upgrade later?

The 840Pro is an excellent consumer drive, but it really wasn't built to be a L2ARC cache. There are better options out there, albeit a little bit more expensive.

With the money saved on the CPU I can definitely look into a better option for the L2ARC. Any recommendations?

I may be wrong, but I don't think the M1015 has SFF-8088 ports on it, so you may want a different card as well - Otherwise, how would you hook it up to the back end you list below?

Good call. Any LSI HBA's you would recommend?

Note that LACP/802.3AD link aggregation does not allow you to send a file from one server to another at 4GB/s. It allows 4 clients to access one IP at 1GB/s.

Yea all the client machines are hooked up through 1Gb/s ethernet anyway. So the most needed for "one transmission" is 1Gb/s.

You're looking at buying an extremely pricely Oracle support contract, so I'll assume you knew that, but it'll help others who might end up finding this post via web searches.

Yea $1,000 a year. If I become confident in ZFS after a year, maybe I'll move the pool over to something like OpenIndiana.

VDEV sizing is dependent on a number of things. What I like to do is label the following in order of importance:
  • Size
  • Uptime
  • IOPS
  • Sequential Read
  • Sequential Write

Once you've got that, (or even better, a minimum number for each of those!) you can compare the speeds of different raid types and vdev configurations. There's a couple of posts on the [H] where folks have built systems similarly to the one you've listed, and many of them include benchmarks. If you haven't already, I'd recommend looking at what they built, and the results they got, and form your own opinion on VDEV sizing.

It's hard for me to calculate IOPS requirement. I can only see the IOPS of our current NAS which is overtaxed. It's usually around 4,000-5,000 at load. The R/W ratio for the video department 30/70, whereas it's 70/30 for the 3D department. I'll look around for the other people's builds. If you want to link me to any I'll check them out as well.

The other consideration I have for you is backup. Anyone who can afford an Oracle contract normally has a tape system already in place. Do you have enough ports on your switch to add another NAS, though? What about tapes? Physical storage space? Is your offsite location capable of handling the extra tapes?

I'm going to turn our current Thecus NAS boxes into the backup drives. I'll probably use RSYNC.

And it's seldom an issue, but does your datacenter have enough power for the new NAS? If you're approaching the limits on your UPS, adding another couple hundred watts might require an additional upgrade.

Good point. I guess we need just enough to do a safe shutdown.
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,041
some aspects, you may consider:

With any compress beside LZ4 you have a significant CPU load and despite a fast CPU a lower performance. Consider larger disks without compress and opt. use smaller CPUs.

High capacity disks like the WD Reds are not very good regarding I/O performance. With a massive Raid-10 config, this can be good enough. I would consider a pool from multiple Raid-Z2 (6 or 10 disks) for "filer use" paired with a second "high speed pool" build from enterprise SSD only (like Intel 3700).

From my experience and according to http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/ I would prefer Hitachi over WD disks. I would also use 4 TB disks instead that many 2TB ones.

For that many users I would use 128 GB RAM for a large ARC cache and skip the L2ARC unless arcstat reports a bad cache hit rate. You may start with 64GB.

I would use a 10 Gbe interface from Intel and avoid LACP.
A cheap solution is the Intel X540-T1/2 for copper paired with a 10 GBe capable switch (ex Netgear).

As an option to Oracle you can cosider OmniTi (OmniOS) where you can buy support as well.
 
Last edited:

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
some aspects, you may consider:

With any compress beside LZ4 you have a significant CPU load and despite a fast CPU a lower performance. Consider larger disks without compress and opt. use smaller CPUs.

High capacity disks like the WD Reds are not very good regarding I/O performance. With a massive Raid-10 config, this can be good enough. I would consider a pool from multiple Raid-Z2 (6 or 10 disks) for "filer use" paired with a second "high speed pool" build from enterprise SSD only (like Intel 3700).

From my experience and according to http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/ I would prefer Hitachi over WD disks. I would also use 4 TB disks instead that many 2TB ones.

For that many users I would use 128 GB RAM for a large ARC cache and skip the L2ARC unless arcstat reports a bad cache hit rate. You may start with 64GB.

I would use a 10 Gbe interface from Intel and avoid LACP.
A cheap solution is the Intel X540-T1/2 for copper paired with a 10 GBe capable switch (ex Netgear).

As an option to Oracle you can cosider OmniTi (OmniOS) where you can buy support as well.

After looking online it's not clear to me if LZ4 works in Solaris 11. It seems to be a part of the Illumos Kernel. Can I use it with Solaris or would I have to go Omni/FreeBSD?

I agree it's not ideal to use the large drives, but I need to have some density as our server room is getting tight. I settled on 2TB disks because they're big enough but not so big as to decrease our IOPS/GB. Correct me if I'm wrong but wouldn't our IOPS/GB go down if we did 22 4TB drives vs 45 2 TB drives?

Quote from your link:
We wish we had more of the Western Digital Red 3TB drive

I agree I would rather do 10Gb Ethernet but our network infrastructure is already in place and only has 10Gb SFP. That means an expensive card and cable.

I'm going to look into Omni now. I'm very interested. I'll give them a call.
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,041
I/O scale with number of vdevs so 20 mirrors of 2TB disks gives you better values than 10 mirrors of 4TB disks. But even the 20 mirrors are not as good as a single Enterprise SSD - and number of disk failures scale with number of disks as well....
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
I/O scale with number of vdevs so 20 mirrors of 2TB disks gives you better values than 10 mirrors of 4TB disks. But even the 20 mirrors are not as good as a single Enterprise SSD - and number of disk failures scale with number of disks as well....

Unfortunately Enterprise SSD is well beyond the budget. Your point on increased disk failure with increased disk count is noted.

Speaking of SSD, should I run the OS on an MLC in its own pool? And should it be a mirrored VDEV?
 

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,703
If it's some solaris OS (including open source versions), the OS is on its own pool by definition, although not all the installers allow mirroring. Even if not, it's easy to add another SSD to the OS pool mirror afterward...
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
I didn't realize those Reds were "intellipower". I'm thinking of spending the extra 30% on WD SE drives.

This is a stupid question but I'd be remiss if I didn't ask: Does data going to the NAS ZPOOL ever traverse the OS HDD? Or is it straight from RAM to NAS ZPOOL?
 

MarkL

Limp Gawd
Joined
Aug 19, 2010
Messages
202
This is a stupid question but I'd be remiss if I didn't ask: Does data going to the NAS ZPOOL ever traverse the OS HDD? Or is it straight from RAM to NAS ZPOOL?

Straight from RAM > target pool. The OS HDD's are not used at all for data going to other
 

TeeJayHoward

Limpness Supreme
Joined
Feb 8, 2005
Messages
11,149
Really? I assumed that for SMB I needed a lot of single threaded speed since there could be up to 70 machines connecting to this NAS at once. I would love to save some money on the CPU. That CPU costs around $1,500.
What I stated is true... To a point. The more correct statement would be, "It depends". I can envision a situation in which our E5 is inadequate. For 70 users, I personally believe the 5630 would be adequate. It would have higher utilization than I would like, but it would work. Is there any chance you could put together a test environment with your proposed workload via some sort of traffic generator and prove me wrong? I'd hate to recommend a cheaper CPU only to have you upgrade it later.

I see. Should I start with 64 and see how it does, leaving room to upgrade later?
That's how I would go about it. I vaguely recall someone stating that ZFS had issues over 128GB of RAM. I know from personal experience that 32GB is wasted on a mere 4 users on my home system. At work, we've got some 48GB of RAM, and most of that's un-utilized as well. I'd buy a 64GB kit, but get larger DIMMs as though I were planning on installing 128GB.

With the money saved on the CPU I can definitely look into a better option for the L2ARC. Any recommendations?
Actually, for a read cache (L2ARC), the 840 Pro is great. I'm sorry, I misread that originally. For a write cache (ZIL), something like a ZeusRAM, I believe, is the currently recommended "SOOOO FAAST!" device.

Good call. Any LSI HBA's you would recommend?
For Solaris 11.1, you need something that's compatible above anything else. I don't think the drivers for even LSI's 2-generation-old 9200-series are included by default, but I do know that you can download the drivers from LSI and install them. We're running a 9271-8i, oddly enough, (I guess we had it laying around?) and we have a SFF-8087-to-SFF-8088 PCI bracket thingie. If you could find a bracket like that, you could use the M1015 and save a few bucks. Otherwise, any LSI SAS 2200-based card is a good bet. Just find a nice cheap one with a SFF-8088 and call it good. I've got a post on the [H] somewhere detailing how to install the drivers if their installer doesn't want to work right.

If I become confident in ZFS after a year, maybe I'll move the pool over to something like OpenIndiana.
BEWARE! No zpool versions above 28, and no ZFS version above 5 will be portable to another OS. So when you create your pool, use "zpool create -o version=28 -O version=5". This bit me a year or so back. Don't let it bite you!


It's hard for me to calculate IOPS requirement. I can only see the IOPS of our current NAS which is overtaxed. It's usually around 4,000-5,000 at load. The R/W ratio for the video department 30/70, whereas it's 70/30 for the 3D department. I'll look around for the other people's builds. If you want to link me to any I'll check them out as well.
Okay, with 5K+ IOPS required, you're going to be looking a a number of vdevs. Oracle themselves have a formula you can use to guestimate what kind of setup you need:
Code:
iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))
It's a little older, but it might help a bit.

As for links, I was specifically thinking of packetboy's journey through the SuperMicro/LSI world.
ZFS Monster - Phase I
ZFS Monster - Phase II - Sorry, can't find a link?
ZFS Monster - Phase III
ZFS Monster - Phase IV

I apologize if any of this doesn't make sense. I think I'm getting sick - my thinking's gone all fuzzy!
 
Last edited:

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,703
"BEWARE! No zpool versions above 28, and no ZFS version above 5 will be portable to another OS. So when you create your pool, use "zpool create -o version=28 -O version=5". This bit me a year or so back. Don't let it bite you!"

This!
 

TeeJayHoward

Limpness Supreme
Joined
Feb 8, 2005
Messages
11,149
"BEWARE! No zpool versions above 28, and no ZFS version above 5 will be portable to another OS. So when you create your pool, use "zpool create -o version=28 -O version=5". This bit me a year or so back. Don't let it bite you!"

This!
If you REALLY want to be sure, you can create your zpool using something like OmniOS, and then import it into Solaris 11.1.

Also, I just did the numbers on the 5K IOPS... Holy moly, you're looking at 64 drives in RAID1+0, or 64 VDEVs for raidz/2/3. (hundreds of disks) That's a chunk o' disk! Either that, or you can't trust the blog I linked!
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Straight from RAM > target pool. The OS HDD's are not used at all for data going to other

Thank you.

What I stated is true... To a point. The more correct statement would be, "It depends". I can envision a situation in which our E5 is inadequate. For 70 users, I personally believe the 5630 would be adequate. It would have higher utilization than I would like, but it would work. Is there any chance you could put together a test environment with your proposed workload via some sort of traffic generator and prove me wrong? I'd hate to recommend a cheaper CPU only to have you upgrade it later.

That's how I would go about it. I vaguely recall someone stating that ZFS had issues over 128GB of RAM. I know from personal experience that 32GB is wasted on a mere 4 users on my home system. At work, we've got some 48GB of RAM, and most of that's un-utilized as well. I'd buy a 64GB kit, but get larger DIMMs as though I were planning on installing 128GB.

Actually, for a read cache (L2ARC), the 840 Pro is great. I'm sorry, I misread that originally. For a write cache (ZIL), something like a ZeusRAM, I believe, is the currently recommended "SOOOO FAAST!" device.

For Solaris 11.1, you need something that's compatible above anything else. I don't think the drivers for even LSI's 2-generation-old 9200-series are included by default, but I do know that you can download the drivers from LSI and install them. We're running a 9271-8i, oddly enough, (I guess we had it laying around?) and we have a SFF-8087-to-SFF-8088 PCI bracket thingie. If you could find a bracket like that, you could use the M1015 and save a few bucks. Otherwise, any LSI SAS 2200-based card is a good bet. Just find a nice cheap one with a SFF-8088 and call it good. I've got a post on the [H] somewhere detailing how to install the drivers if their installer doesn't want to work right.

BEWARE! No zpool versions above 28, and no ZFS version above 5 will be portable to another OS. So when you create your pool, use "zpool create -o version=28 -O version=5". This bit me a year or so back. Don't let it bite you!


Okay, with 5K+ IOPS required, you're going to be looking a a number of vdevs. Oracle themselves have a formula you can use to guestimate what kind of setup you need:
Code:
iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))
It's a little older, but it might help a bit.

As for links, I was specifically thinking of packetboy's journey through the SuperMicro/LSI world.
ZFS Monster - Phase I
ZFS Monster - Phase II - Sorry, can't find a link?
ZFS Monster - Phase III
ZFS Monster - Phase IV

I apologize if any of this doesn't make sense. I think I'm getting sick - my thinking's gone all fuzzy!


Thanks! This might end up being the Front-End for multiple JBODs so I'm heavily considering sticking with the faster CPU.

As for my 5,000 IOPS number, I'm not sure if it's accurate. What I did was run Dell DPACK, which connected to our Thecus NAS over SSH and ran for 4 hours while measuring IOPS. I'm not sure how accurate that was because the Thecus is only 12 7200RPM 3 TB disks in RAID 10. So could it really be ~5,000 IOPS?

I'll be sure to use that config to make sure the pool is portable.

Looking into those links now!

EDIT: Just realized, will I be limited by the SAS connection? Assuming I choose 44 of the WD SE drives that are rated at 165MB/s but let's use 100MB/s for argument's sake. The JBOD has 2 expanders. So if 22 drives are connected per SAS cable, won't I be limited to 6Gb/s (750MB/s) reads, instead of the theoretical 22 x 100MB/s I could be getting?
 
Last edited:

patrickdk

Gawd
Joined
Jan 3, 2012
Messages
744
You should only be concerned with one of two numbers.

Either your work load is streaming (backups, movies, ...), and you care about having enough raw bandwidth.

Or you care about having enough iops. 22 sata disks will only give you 22*80 random iops. So that means each iop could be 1.2MB in size, and you would not max out a single sas cable.
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
You should only be concerned with one of two numbers.

Either your work load is streaming (backups, movies, ...), and you care about having enough raw bandwidth.

Or you care about having enough iops. 22 sata disks will only give you 22*80 random iops. So that means each iop could be 1.2MB in size, and you would not max out a single sas cable.

What exactly constitutes a streaming/sequential workload?
For instance, let's say I copy 100 JPG's in one operation from a host machine to the NAS. From that point on, if try to copy those 100 JPG's from NAS to host, is that sequential?

Also I'm a little confused by your math. If it's 22*80*1.2, isn't that 2112 MB/s?
 

patrickdk

Gawd
Joined
Jan 3, 2012
Messages
744
Each sas cable, running sas2 handles 2400MB/sec, when using sata3 disks.

I also noted your opted for the E26 case, this has dual expanders. You will not connect any cables to the second expander, only the first one, unless you start using sas drives.

You will want to connect the 4 sas ports, 2 to the front expander, and 2 to the back expander, and ignore the 2nd front and back expanders. This will give you 4800MB/sec to the front, and 4800MB/sec to the back. Or you could do it to chain to another chassis.
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Thanks Patrick.
So I guess I should switch to the E16 to save some money.

To remove single points of failure, does this setup sound ok to you?

2 HBA's each with 2 SFF8088 ports. Each HBA connected to both the front and the back expander (1 cable to each end). And each vdev mirror has one drive connected to the front expander, and one drive connected to the back expander.

Forgive my ignorance but I still don't understand how each SAS cable can handle 2400MB/s. I'm assuming it has to do with multiple lanes? Did you mean 6Gb * 4 Lanes = 24Gb/s?

bYDTXj7.png
 
Last edited:

zrav

Limp Gawd
Joined
Sep 22, 2011
Messages
181
It's hard for me to calculate IOPS requirement. I can only see the IOPS of our current NAS which is overtaxed. It's usually around 4,000-5,000 at load. The R/W ratio for the video department 30/70, whereas it's 70/30 for the 3D department.
Note that due to ZFS caching the actual IOPS hitting the disks might be lower than that. For reads it all depends on the size of the working set vs. ARC + L2ARC size. If all the clients do is playback terabytes of video from start to finish (large working set), you simply won't have enough RAM to cache it all. But if the clients loop over short footage segments, as is often the case with video work, reads might end up being cached nicely. The same is true for multiple 3D render nodes accessing the same scene asset files. Try guesstimating the working set, maybe 64-128GB works nicely after all.
For writes you of course still need the proper vdev setup to handle the load. Also if you can live with losing up to a couple of seconds of data in case of a crash/powerloss, you can disable sync writes, which helps immensely.
Reads likely won't benefit much from L2ARC as it's limited to 8MB/s writes, and I assume your working set shifts quicker than that. If you want to try it tho, I'd recommend the intel S3700, excellent performance and wear properties and relatively affordable.
 
Last edited:

patrickdk

Gawd
Joined
Jan 3, 2012
Messages
744
That setup should work ok. Last I saw, getting the E16 instead of the E26 only saved you like $200, so it might be worth it just to have the E26, and upgrade to sas disks sometime, but if not :)

Doing a setup like that, you would want to enable likely failover multipath. I am not sure that sata supports multible initiators. Atleast when I was reading on sata with interposers, it was normal for it to dump the sata cache when switching sources, so I would assume from that sata doesn't support it, but I haven't checked personally.

Yes, each sff-8088 cable is 4 sas lanes, so 600MB/sec * 4 lanes.
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Well now that I think about it, would multipathing be useless? Since I'm doing mirrors, and each expander will have one of the drives from each mirror in it, can't I just connect each expander to one HBA? Even if one HBA dies, the other HBA still has one drive from each mirror. Will the ZPOOL survive this?

Sorry I don't know how to word that any better.

Also, I can't find any Solaris documenation on what some people call "Wide Porting" (SAS link aggregation). Is there another term for this?
 

patrickdk

Gawd
Joined
Jan 3, 2012
Messages
744
The multipathing is really up to you, how you want to handle failures. If it doesnt matter if it's down for an hour, and it's onsite. It can be alittle annoying to deal with multipathing.

Nothing really to know about wide-sas. The term wide, is normally used when more than one sas lane is used. 4wide, or 8wide. It just happens automatically, nothing to configure or setup. But does require the ports from and two the expander be from the same device. So hba to expander, or expander to expander.
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Ok so just to confirm: 2 cables going from one HBA to one expander will provide 8x SAS lanes bandwidth automatically?

Also the SuperMicro specifications and documentation aren't really clear. Does the 847-26 in fact have 2 8088 ports per expander?
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
Anybody know of any documentation on ARC to L2ARC ratio? I heard L2ARC carves out metadata and other space in ARC. Would 2 striped 512gb 840 Pro's as L2ARC affect the ARC too much?
 

spazoid

Limp Gawd
Joined
Jun 16, 2008
Messages
289
First of all, you wouldn't stripe your cache devices.
Second, the ARC usage from L2ARC depends on the data ZFS is caching. 320 bytes per entry IIRC.
No matter what, 1 TB of L2ARC is huge, and will require a lot of RAM. Unless you know your workload very well, try running without any L2ARC and check your hit ratio with arcstat.pl first.
 

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,703
Truth. My workload of 1/2 dozen VMs hit 90+% in ARC. I wasted a couple hundred on two 128GB L2ARC devices to see a hit rate of less than 50% in them, because the 10% that miss in ARC are really random (apparently), so the L2ARC does little good.
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
First of all, you wouldn't stripe your cache devices.
Second, the ARC usage from L2ARC depends on the data ZFS is caching. 320 bytes per entry IIRC.
No matter what, 1 TB of L2ARC is huge, and will require a lot of RAM. Unless you know your workload very well, try running without any L2ARC and check your hit ratio with arcstat.pl first.


Truth. My workload of 1/2 dozen VMs hit 90+% in ARC. I wasted a couple hundred on two 128GB L2ARC devices to see a hit rate of less than 50% in them, because the 10% that miss in ARC are really random (apparently), so the L2ARC does little good.


I know you wouldn't stripe ZIL but why wouldn't/couldn't you stripe L2ARC? The reason I want a large L2ARC is because a 3D render farm of about 70 computers all load the same assets at the same time when a render starts.
 

danswartz

2[H]4U
Joined
Feb 25, 2011
Messages
3,703
L2ARC doesn't work that way (striped) and can't be made to. If you have multiple L2ARC devices, they are used round-robin (which is basically striping). The point (not to be pedantic) is that you can't tell ZFS "stripe these guys!" Striping ZIL (SLOG to be more accurate) would be pointless anyway, since they are driven by latency not throughput (not much really gets written to an SLOG device...)
 

CopyRunStart

Limp Gawd
Joined
Apr 3, 2014
Messages
155
L2ARC doesn't work that way (striped) and can't be made to. If you have multiple L2ARC devices, they are used round-robin (which is basically striping). The point (not to be pedantic) is that you can't tell ZFS "stripe these guys!" Striping ZIL (SLOG to be more accurate) would be pointless anyway, since they are driven by latency not throughput (not much really gets written to an SLOG device...)

Gotcha. Well I guess I'm pulling the trigger on hardware then.
 

spazoid

Limp Gawd
Joined
Jun 16, 2008
Messages
289
The reason I want a large L2ARC is because a 3D render farm of about 70 computers all load the same assets at the same time when a render starts.

This sounds like a perfect reason to NOT use L2ARC. You'd want as many of those assets in ARC when they're being loaded, and your 70 computers probably aren't requesting the data at the exact same time, so even if the data is on disk when the first computer requests it, the next 69 should get it from ARC.

As danswartz also mentioned, spending money on L2ARC up can very possibly end up being a waste of money. I did the same thing, even though it was only a 120 GB device. I saw 90+% hits in ARC, the 10% that missed only had a 10% hit ratio in L2ARC, so only about 1% of my data was too random for ARC, but not random enough to get purged from L2ARC.
 

brutalizer

[H]ard|Gawd
Joined
Oct 23, 2010
Messages
1,599
You can use IBM M1015 with external disk cases if you buy a cheap sff8087-to-sff8088 bracket. Google such brackets.

Regarding the SC847E26-rjbod1, it has two SAS expanders. Each expander handles 20ish disks. If you connect expander1 to expander2 (chain them) you only need one sff8088 cable from head to backend, and it will give you 2400MB/sec.

If you separate the expanders from each other, and connect two sff8088 cables from M1015 to expander1 and expander2, you will get 2 x 2400 = 4800MB/sec.

I have been told by patrickd. Please correct me if I am wrong.
 
Top