ZFS Thoughts: Best Inexpensive ZIL?

Gigas-VII · Mar 7, 2011

Introduction
I'm looking for some feedback on my upcoming ZFS build, but I'll get to that later. I have a very good understanding of the mechanics of ZFS, and the general requirements for each subsystem therein. My concern here is specifically the ZFS Intent Log, or ZIL.

Update One
Update Two
Update Three
Gallery

Background on the ZIL
The ZIL is used to store the intended changes before they are committed to disk in the zpool. Writes to the ZIL act like somewhat of a cache: Random writes on the ZIL can be committed to the zpool as contiguous writes. By default, the ZIL is interleaved with the rest of the storage space on the zpool, but you can assign a specific device to host the ZIL. The device used to host the ZIL is known as the SLOG. Ultimately, the IOPS of the SLOG are a bottleneck for the zpool: changes must first be made to the ZIL, then copied to the zpool, then committed. If the SLOG can't keep up with the zpool, the aggregate performance will decrease. If performance decreases enough, ZFS will (if I understand correctly) abandon the dedicated SLOG.
NOTE: Should the ZIL be lost, ZFS may have difficulty recovering the zpool. For this reason, SLOGs should be mirrored if possible. EDIT: Apparently, this behavior has been fixed in v28, so mirroring may not be as critical.

So, which devices make for the best SLOG? By conventional thinking, three classes of devices, in order of maximum IOPS:

1) NVRAM: battery-backed DRAM devices. The IOPS here are off the charts, especially if the device uses the PCIe bus directly, rather than SATA.

2) SLC SSD: Enterprise-grade Flash NAND-Flash Devices. The IOPS are much less than NVRAM, but still much better than traditional HDDs. A concern with these is the onboard DRAM cache: any writes to the ZIL pending in local cache will be immediately lost in the event of unexpected loss-of-power. Ideally, the cache should be backed with a supercapacitor or dedicated battery. 3rd-gen Intel SLC SSDs should meet this requirement.

3) 10k+ RPM SAS: Enterprise-grade HDDs. These drives have the highest IOPS of conventional HDD designs as they maximize IOPS and seek times over throughput (as throughput can be aggregated, but IOPS and seeking occur per drive). Faster than consumer-grade HDDs, but 2 orders of magnitude slower than SLC SSDs.

Comparisons
Deciding between these three technologies generally boils down to budgetary constraints: NVRAM devices are exceedingly expensive ($30+ per GB), and somewhat rare. SLC SSDs cost $10-20 per GB. SAS drives are usually around $3 per GB.

NVRAM devices are also the most volatile: any power outage that lasts more than a few hours will jeopardize the contents of the ZIL, the loss of which can corrupt the entire zpool. However, they are not worn by write cycles. All SSDs have a limit to how many writes they can sustain, and while they have different strategies to minimize writes (compression, caching, over-provisioning, TRIM), it is writes themselves that put these devices into the recycling pile, and since the ZIL is specifically for synchronous writes, SSDs can be rapidly consumed by the very role we assign them.

While SAS drives are prone to read errors, lost DRAM cache, and insufficient IOPS, they are able to sustain loss-of-power indefinitely, and a near-infinite number of write operations. They just aren't quite fast enough to make an ideal SLOG.

My Thoughts
Interleaving the ZIL with the zpool data does reduce zpool performance; HDD heads do not need to seek from data to ZIL, back to data, then back to ZIL, just to perform a simple change. The easiest solution is to move the ZIL to a dedicated HDD in the zpool, which eliminates the seek bottleneck and IOPS throughput, but also reduces ZIL throughput dramatically. Using a special devices for the ZIL role makes a lot of sense, but enterprise-grade solutions like NVRAM are not cost-effective outside of fortune-500 companies. SSDs will need to be replaced regularly (6-18 month lifespan), at a cost of at least $350 per drive, or minimum $700 to maintain a mirror.

Proposal
Two 2.5" 10-15kRPM SAS drives in a ZFS mirror will have greater IOPS than a normal 3.5" 5-7kRPM commodity storage drive. If WD Velociraptors were used, a pair of the 150GB model would be sufficient for capacity, and could bring the cost down below $200 total.

The Questions
Would the 2x SAS configuration I proposed above be sufficient IOPS for ZFS to make use of that vdev as a SLOG? Would it offer significant performance over the interleaved ZIL?

If I have any misunderstanding of ZFS mechanics, please correct me. I want to understand this technology inside and out.

sub.mesa · Mar 7, 2011

You should not mirror SLOG devices; doing so only adds partial protection. Current generation SSDs show corruption on sudden power loss; that is your biggest problem.

To use SLOG safely:
1) use ZFS pool version 19 or above so removal of SLOG does not mean loss of the entire pool
2) use SSDs with supercapacitor protection to protect your SLOG from corruption

SLC SSDs were used because they did not implement the HPA mapping techniques of todays modern SSDs, which can lead to corruption on sudden power loss. This now also applies to modern SLC drives, so there is no reason to prefer SLC over MLC.

So what you need is an Intel G3, Marvell C400 or Sandforce SF2000-family with supercapacitor. Those SSDs should be available between now and 2 months. Do not buy an SSD without supercapacitor for this task, unless it's design is resilient against sudden power loss; such as the FusionIO products which do not need a supercap for maintaining data integrity.

As alternative to SSDs, you can use DRAM-backed devices often with CompactFlash port for backup and some supercap or battery to hold up power. That can work too, but seems more fragile and has limited sequential write speeds (limited to one SATA cable; 200MB/s) - kind of disappointing. I've not seen PCI-express RAM-disk product where you can use DDR3 DRAM and have high write throughput; that would be the holy grail of SLOG performance.

By the way. Not all writes go through the SLOG device; only sync writes (often involving metadata) will go through the SLOG. Some NFS workloads are particularly heavy on this, but generally the performance difference would be modest. Using a SLOG does mean the data disks themselves do not have to seek as often, and thus maintain their high sequential I/O stream. A SLOG only needs to be a few gigabytes large, depending on the speed of the pool and the workload. 2GB SLOG is perfectly usable; i would probably use at least 4GB. You only need write performance; SLOG devices are only read from in case of emergency (when rebooting and the SLOG is needed to maintain consistency for the pool by rewinding to previous transaction group).

I wouldn't use HDDs for SLOG; your benefit is small the real benefit is a device with good seeking performance for sync writes. So modern MLC NAND SSDs would be the most ideal for this type of I/O. Most performance per buck.

Gigas-VII · Mar 7, 2011

So a single 3rd-gen Intel X25-M, 100GB model would be sufficient if not overkill?

EDIT: I appreciate the feedback, by the way. I had forgotten the term SLOG, and the detail about sync writes only.

sub.mesa · Mar 7, 2011

Well that detail about sync writes is important, since that means a 100MB/s SLOG device can push 200MB/s or more to the pool, considering only some requests are sync writes while most are async writes.

For NFS workloads, and workloads from Virtual Machine images, you would get alot more sync writes instead and thus the SLOG would be far more utilized and it could mean 100MB/s SLOG = 100MB/s pool performance. So in this regard SLOG could harm performance.

So what do you want:
1) you want the SLOG to stop wild dips in write performance and make writes much more fluent and smooth.
2) prevent huge write flushes that stall I/O for multiple seconds
3) have enough write performance on the SLOG to achieve your overall performance target

I think the Intel G3 with 160MB/s write is pretty decent for SLOG. It has a supercap, has good random write IOps performance, and good sequential write. You can RAID0 the SLOG for more performance.

With multiple SSDs, say you have 4x Intel G3 80GB or something, you could RAID0 them all and use both for SLOG (only 4GiB or so) and the rest for L2ARC. Note that L2ARC requires RAM too. Though i should warn that L2ARC is primarily good for smaller files and not for large pictures. All it would accelerate is the metadata of those files, which is a decent gain but perhaps not worth to buy 4 SSDs for. For other workloads this might make a lot of sense though.

For your situation, one modern SSD would be great to accelerate everything. You just need to make sure you use a ZFS system where losing the SLOG does not mean losing the entire pool. This used to be reserved in ZFS pool version 19 (removal of SLOG devices). But i've read on FreeBSD mailinglist that this code is already in the FreeBSD 8.2 released ZFS v15; i still have to verify this, but if true that would mean you can use SLOG safely already in FreeBSD.

justin2net · Mar 7, 2011

Not sure if the 3rd gen Intel X25 M will have a supercap. The X25-E 3rd probably will have one.
Vertex 3 won't, Vertex 3 Pro will.
C400? Not sure.

Mindflux · Mar 7, 2011

I'm not sure I agree with lack of mirroring SLOG/ZIL devices. If you have two and one happens to die you ideally shouldn't lose whatever writes weren't committed at the time. That may depend on the type of failure you see though.

Losing your solitary ZIL device could result in a relatively large chunk of data going MIA. Yes it'll fall back to RAM, but at the expense of whatever never got written to the pool. For the Intel G2 SSD's they have 150MB of NVRAM before it even commits to the SSD memory which could also be problematic upon failure. Yes, prior to ZFS v 19 you couldn't even bring your pool back up if the ZIL went MIA, that's fixed. But you still have to account for the potential missing data.

As for sharing L2ARC/ZIL between a vdev of SSD's that have been partitioned is not only not supported, but highly discouraged by many ZFS gurus.

Gigas-VII · Mar 8, 2011

justin2net said:
Not sure if the 3rd gen Intel X25 M will have a supercap. The X25-E 3rd probably will have one.
Vertex 3 won't, Vertex 3 Pro will.
C400? Not sure.

I've read that the X25-M does have a supercap. They're available now. the X25-E won't, but it will be SLC flash.

Mindflux said:
I'm not sure I agree with lack of mirroring SLOG/ZIL devices. If you have two and one happens to die you ideally shouldn't lose whatever writes weren't committed at the time. That may depend on the type of failure you see though.

Losing your solitary ZIL device could result in a relatively large chunk of data going MIA. Yes it'll fall back to RAM, but at the expense of whatever never got written to the pool. For the Intel G2 SSD's they have 150MB of NVRAM before it even commits to the SSD memory which could also be problematic upon failure. Yes, prior to ZFS v 19 you couldn't even bring your pool back up if the ZIL went MIA, that's fixed. But you still have to account for the potential missing data.

As for sharing L2ARC/ZIL between a vdev of SSD's that have been partitioned is not only not supported, but highly discouraged by many ZFS gurus.

I agree that sharing the SLOG device makes no sense. I don't think I need L2ARC at all because my build will have 32GB of DDR3-1333 ECC, with only 10TB of primary pool space.

sub.mesa · Mar 8, 2011

X25-E uses E-MLC NAND now; SLC is now rarely used, just like i anticipated.

The G3 and C400 should have a supercap; but we'll have to wait for their actual arrival since Intel changes their SSD offerings one time already. We don't know how the delay on 25nm NAND has changed the picture. But latest info is that both G3 and C400 will have a supercap, and some Sandforce 2000-family versions too. I wouldn't buy the OCZ Vertex 3 Pro though; i would go for a more reputable brand for an important device like SLOG.

Why would sharing the SLOG make no sense to you?
1) you cannot/should not use SLOG when it can cause you to lose the entire pool
2) if that condition has been met, usage of SLOG is not that unsafe at all
3) you can use it without mirror config
4) you can use it for multiple pools

Multiple SSDs in a mirror could both corrupt at the same time if both were abruptly disconnected (unsafe shutdown). So this construct would only provide a minimum protection.

You simply should not use the SLOG feature until the loss of SLOG device is handled gracefully. When lost, recent patches allow ZFS to rewind to an earlier transaction group. Without these patches, your entire pool will be gone, this can happen when using mirrored SLOG so simply do not use SLOG feature until it is safe.

Mindflux · Mar 8, 2011

sub.mesa said:
X25-E uses E-MLC NAND now; SLC is now rarely used, just like i anticipated.

The G3 and C400 should have a supercap; but we'll have to wait for their actual arrival since Intel changes their SSD offerings one time already. We don't know how the delay on 25nm NAND has changed the picture. But latest info is that both G3 and C400 will have a supercap, and some Sandforce 2000-family versions too. I wouldn't buy the OCZ Vertex 3 Pro though; i would go for a more reputable brand for an important device like SLOG.

Why would sharing the SLOG make no sense to you?
1) you cannot/should not use SLOG when it can cause you to lose the entire pool
2) if that condition has been met, usage of SLOG is not that unsafe at all
3) you can use it without mirror config
4) you can use it for multiple pools

Multiple SSDs in a mirror could both corrupt at the same time if both were abruptly disconnected (unsafe shutdown). So this construct would only provide a minimum protection.

You simply should not use the SLOG feature until the loss of SLOG device is handled gracefully. When lost, recent patches allow ZFS to rewind to an earlier transaction group. Without these patches, your entire pool will be gone, this can happen when using mirrored SLOG so simply do not use SLOG feature until it is safe.

As I stated above, slicing up your SLOG device is not supported or suggested. That's precisely WHY you shouldn't share SLOG. Because if you come to ask for help on a solaris mailing list and you tell them you're slicing up your SSD's to share across pools or to share between ZIL/L2ARC the first thing any of the ZFS gurus are going to tell you is that's not supported, so fix that first and if the problem persists then we'll carry on trying to debug.

What version of ZFS rewinds transactions in case of lost SLOG? What good does that do you if you've lost important writes due to SLOG failure on something like a VM datastore? Sure it'll roll back, but I think that'd still be catastrophic in some cases.

sub.mesa · Mar 8, 2011

Well keep in mind Solaris interns are different from FreeBSD. Solaris cannot transfer FLUSH commands to partition device nodes; only to whole devices, if memory serves me correctly. This limitation does not apply to FreeBSD; FreeBSD also detects ZFS differently than Solaris does.

If you feed a bare disk to ZFS, it will automatically partition it and use the raw device node instead. I believe this has to do with above limitation.

Aside from this limitation, there is nothing to prevent you from sharing your SLOG on FreeBSD platform. One day this limitation likely will be gone on Solaris platform as well.

FreeBSD has the GEOM framework; an advanced I/O framework where several modules can attach to eachother. BIO flushes on FreeBSD work perfectly across these layers, and so there is nothing that prevents you from RAID0ing several SLOGs together and pass that to ZFS. Due to the flexible nature of GEOM, implementing TRIM-over-RAID was trivial too. So RAID0ed SLOG disks can be easily wiped and reset to factory condition; only takes 8 seconds or so.

So be careful when applying advice from Solaris platform to other platforms. Solaris has its own share of unique bugs due to their design, which in some ways is rather outdated compared to other operating systems. FreeBSD has extended these rough edges, and i would hope IllumOS could make ZFS less 'solaris-specific'; the removal of Python dependency was a good first step.

Mindflux · Mar 8, 2011

Very nice. I've used FreeBSD since 2.2.7 and in general like it considerably more than Linux variations. If for anything the fact the world and most of the userland tools are maintained by the same group of people. This brings consistency to the 'world', where over on Linux it's just a rabble of guys hacking away at new versions.

As soon as FreeBSD is at least up to the latest ZFS version before Oracle shut the door I'll consider that an option. I know -HEAD and -9 are up to 28(?) so my wish will soon come to fruition.

I just hope that both Illumos and the FreeBSD ZFS guys can add features that Oracle is adding (such as Crypto, yes I know FreeBSD has GELI) without having to wait for Oracle to release a codebase (which they said they'd do upon releasing production ready versions of Solaris)

sub.mesa · Mar 8, 2011

ZFS v28 would be backported to 8-STABLE in one or two months; the ZFS v28 patch is maturing. Booting from 4K pools now works (ashift=12) so alot of missing bits are now present and the patchset is maturing rapidly as it is being tested by quite a few people.

The ZFS crypto stuff and other versions would be released as CDDL, but always behind official Solaris releases; so i guess about half a year delay or so. I'm more interested in FreeBSD-specific features:
- booting from RAID-Z/2/3
- automatic swap ZVOLs
- flush commands on any device node
- import/export commands
- further integration with geom

IllumOS did remove the Python dependency, which already benefits FreeBSD. I indeed hope both projects can benefit from eachother and share their work on ZFS. But even if ZFS stays like it is now in FreeBSD, i think it's one hell of a filesystem unmatched by any alternative. Btrfs could rival is one day, though, even though Btrfs is only a filesystem and not volume manager.

jwinsor566 · Mar 8, 2011

Watching an Illumos webcast the other day looks like they will be adding TRIM support for ZFS (late 2011 early 2012). This will be big when it comes to maintaining performance on SSD's.

Can anyone speak on actual devices that they have tested or used as ZIL (preferably in Nexenta) and found to be good performers?

sub.mesa · Mar 8, 2011

You need mainly sequential write. I used 4 Intel X25-V 40GB that can write 45MB/s each, that makes 180MB/s aggregate performance, which i can confirm with gstat. I don't use them for SLOG though, only used for testing.

SLOG is very good if you want ZFS to have smooth write performance. Random writes to your pool will make the HDDs do sequential I/O. That is the power of ZFS and if you can utilize it well, it can have profound impact on your performance.

It's a shame we're still at 1Gbps networking though; i'm itching to upgrade to 10GBaseT sometime the products are cheaper and sold through retailchannels. But 1Gbps leaves plenty of opportunity to optimize for random I/O both writes (SLOG) and reads (L2ARC).

MarkL · Mar 8, 2011

sub - How stable would you say v28 is in 9-CURRENT ?

sub.mesa · Mar 8, 2011

near production-quality. The first two patchsets were not merged to -CURRENT; only now with the third patchset, -CURRENT runs ZFS v28 natively without patches. Pawel (main ZFS developer on FreeBSD) said that it could MFC (MergeFromCurrent) to 8-STABLE in one or two months, indicating that unless anything weird pops up, this is going to be the real thing.

For real data, you should wait until at least it merges to 8-STABLE, which in time will become 8.3-RELEASE. At that time it would be labeled as mature enough to replace the current stable v15 version. So generally this is very good news and in summer you can run ZFS v28 stable on FreeBSD, which is great! 4K boot support now also works, which was still lacking in previous patchsets.

hotzen · Mar 8, 2011

Is it "only" booting from 4k-pools or does it also make the gnop-4k pool-initialization obsolete?

Metaluna · Mar 8, 2011

sub.mesa said:
So be careful when applying advice from Solaris platform to other platforms. Solaris has its own share of unique bugs due to their design, which in some ways is rather outdated compared to other operating systems. FreeBSD has extended these rough edges, and i would hope IllumOS could make ZFS less 'solaris-specific'; the removal of Python dependency was a good first step.

My biggest concern with ZFS on Solaris at the moment is that it seems clueless about 4K sector drives. Or, at least, the 4K support is has is useless for the majority of drives on the market which intentionally lie about sector size. You can Google around and find various discussions, patches, and speculations on this topic but there doesn't seem to be any authoritative statement from any Oracle or IllumOS devs about the status of 4K support in current and future releases, and/or recommended workarounds and best practices. FreeBSD ZFS doesn't do much better by default, but you can use the GEOM/gnop feature to make a drive report its sectorsize correctly. This, combined with your findings on ideal vdev sizing, seems to fix up most of the issues. I just wish there was one OS that combined FreeBSD's I/O framework with Solaris' CIFS and ZFS implementations.

hotzen · Mar 9, 2011

I just used ZFSGuru to initialize the pool with gnop 4k/shift=12, then exported it. Booted into SE11 and upgraded, then imported. Works.
Just FWIW, when somebody wonders how to combine FreeBSD Geom überness with Solaris...

Metaluna · Mar 9, 2011

Interesting. I never had much luck getting SE11 to recognize ZFSGuru-created pools, but I'll play around with it some more. Maybe I'll try creating the pool on raw disks instead of letting ZFSGuru partition them, as I've read that Solaris doesn't like partitions.

Gigas-VII · Mar 9, 2011

In the OP, I mentioned that I had been preparing a ZFS build list. I have now made my final decisions and ordered the components. Here's the list:

Chassis: NORCO RPC-4224 with SFF-8087 cables (one reverse-breakout)
Motherboard: Supermicro H8DG6-F with integrated IPMI and LSI-based SAS controller, terminating in SFF-8087 ports.
CPU: 2x AMD Opteron 6128 (Magny-Cours, octacore, 2.0 GHz)
RAM: 8x 4GB ECC DDR3-1333
Disks: 10x Samsung Spinpoint F4 HD204UI 2TB
PSU: Corsair AX-750 with necessary molex splitter and EPS12v splitter

My plan is to add an SAS expander in the future, as well as an external SAS controller, and an SSD (or two) to be the SLOG. I intend to use double-sided tape to attach the SSDs to the inside of the chassis walls, rather than using any of my precious 3.5" hotswap bays.

The last thing I think I need is to decide on a 2-port ethernet adapter to aggregate with the 2 onboard ports (802.3ag I think, my other server uses it, and my switches support it). Additionally, I'm debating making an iSCSI fabric of some kind, using a dedicated VLAN or maybe a whole switch.

Regarding the drives, I intend to create 10x 2-drive mirrored vdevs in a single zpool, with the intention of adding additional vdevs as necessary. I did buy an extra drive to use as a hot-spare but I may instead put it in an external chassis for the time being, and buy another one later.

hotzen · Mar 9, 2011

I wonder why no one is even considering the new Seagates... The new Seagate Barracuda Green 5900.3 seems to be pretty sweeeet. Nice setup btw, have fun

I never had much luck getting SE11 to recognize ZFSGuru-created pools, but I'll play around with it some more. Maybe I'll try creating the pool on raw disks instead of letting ZFSGuru partition them, as I've read that Solaris doesn't like partitions.

If I remember correctly I created the pool on raw disks and had to use the -D switch on zfs import or upgrade to search for deleted/hidden pools.

BTW, is there some common knowledge whether ZFS performance is lower on whole disk vs. 4k-aligned partitions/slices?

Gigas-VII · Mar 9, 2011

hotzen said:
I wonder why no one is even considering the new Seagates... The new Seagate Barracuda Green 5900.3 seems to be pretty sweeeet. Nice setup btw, have fun

I went with the Samsung drives based on a lot of good reviews. They've been out for a while and appear to be quite reliable. Also, thanks

sub.mesa · Mar 9, 2011

Metaluna said:
Interesting. I never had much luck getting SE11 to recognize ZFSGuru-created pools, but I'll play around with it some more.

Solaris doesn't properly support the GPT partitions ZFSguru creates, but if you use GEOM formatting then it would be like whole disks to Solaris. The start LBA would be zero.

So this should work:
1) format disks with GEOM
2) create a pool with 4K sectorsize override
3) export the pool
4) import the pool on Solaris platform again
5) check ashift with:
zdb -e <poolname> | grep ashift

ashift=9 means optimized for 512-byte sectors
ashift=12 means optimized for 4K sectors

hotzen said:
Is it "only" booting from 4k-pools or does it also make the gnop-4k pool-initialization obsolete?

The change in FreeBSD bootcode simply allows you to boot ZFSguru/FreeBSD from pools created with 4K sectorsize, which until now was not possible.

The sectorsize override using geom_nop (GNOP) still works, as it forces the ashift to be 12 instead of 9 which stays as permanent settings, even after a reboot when the disks themselves are detected as 512-byte sectors again.

One important side effect from this method is: you can add native 4K sector disks in the future! This is NOT POSSIBLE with normal ashift=9 pools, as far as my tests indicate. So using the sectorsize override is good if you ever plan to add real/native 4K sector disks in the future.

hotzen said:
BTW, is there some common knowledge whether ZFS performance is lower on whole disk vs. 4k-aligned partitions/slices?

There is no reason to expect performance differences. But you could have problems with on Solaris platform where data integrity is lost when using partitions. Whole-disk partitions seem to work though, i don't know the details. But it's an important limitation that you should be aware of.

FreeBSD has its excellent GEOM I/O framework to handle stuff like this, Solaris doesn't have this and that results in more limitations and pitfalls, such as loss of flush capability on partitions which is crucial to maintain integrity. GEOM acts like a chain, and for example flush commands (or TRIM commands*) can be sent down the chain. So you could have a encryption layer connected to raid0 layer connected to a partition connected to a disk. The commands sent down the chain will all be handled properly. Such a framework does not exist on Solaris platform, to my knowledge.

* actually they are BIO_DELETE commands where until they reach the ata/ahci driver they get transformed into TRIM commands for SSDs or a comparable command for CompactFlash devices.

Metaluna said:
My biggest concern with ZFS on Solaris at the moment is that it seems clueless about 4K sector drives.

They probably will solve it at one point, but FreeBSD has an advantage here with the GEOM I/O framework.

I just wish there was one OS that combined FreeBSD's I/O framework with Solaris' CIFS and ZFS implementations.

So what are you really missing on FreeBSD platform? Samba often has performance issues, but it's possible to get very smooth performance. It wouldn't be a kernel-level CIFS driver like Solaris, but recent versions of Samba do appear to perform better on FreeBSD platform than in the past.

With ZFS v28 around the corner, almost everything you would want is in, including extensions FreeBSD made like boot support and automatic swap devices and graceful imports.

If there was one thing i was jealous of in Solaris, it would be ZFS memory allocation. This tends to be a problem in FreeBSD where manual tuning is required. In Solaris it works smooth out of the box. Prefetching comes to mind here, which causes problems on FreeBSD systems with little RAM so it get's disabled by default. On Solaris, there appear to be no problems with prefetching even on lower RAM systems. On FreeBSD you need 8GiB RAM in order for prefetching to be enabled by default. After tuning you can enable it with lower RAM though, it tends to increase sequential read scores considerably.

As with most choices in life, each has their pro's and cons. You need to figure out which pro's are important to you and which cons you can live with. For me, that choice is FreeBSD since i'm already reasonably familiar with this OS and very much like its design and advanced features. I also think it's a very safe home for ZFS; FreeBSD can import any CDDL code without license problems and continues to work on ZFS itself.

A bit offtopic but still interesting: FreeBSD could play a major role in other operating systems in the future. Ubuntu running on FreeBSD is already working, including with ZFS support! This Debian/kFreeBSD project is a dropin replacement for the linux kernel, but it uses GNU userland so it feels very much like a linux system. Such things are possible and if matured could provide a very interesting platform. Perhaps in the future you can choose the kernel of your OS. ;-)

Gigas-VII · Mar 9, 2011

Thanks for all the discussion, guys

I am not yet decided on which OS to run, so these explanations are helping me quite a bit. I had been thinking OpenIndiana, but now I'm thinking ZFSGuru, because I have a little experience with FreeBSD: I'm running a pfsense box as my border gateway and firewall. I think I would get to know FreeBSD better if I had 2 systems running it.

Then again Illumos looks pretty cool, and OpenIndiana should be exciting when it finally launches...

Gigas-VII · Mar 19, 2011

I thought I might update this thread with some progress. If I could rename it, I would. To any Mod listening, could you rename it for me?

All my parts have arrived, other than my HP SAS Expander. I'm noticing a few small issues with the placement of the SFF-8087 connectors on my board, when combined with the NORCO RPC-4224. Basically, there isn't much room for the SFF-8087 connector and cable. I'm looking for suggestions on how to get it to fit, outside of cutting away at the 120mm fan partition to include a nice gap for the SFF-8087 connectors.

Still haven't decided on a particular OS. I intend to play around with a few different ones. Part of me says "latest and greatest" meaning Solaris 11 Express... but I must admit that I don't like Oracle at all, and don't especially want to give them all my personal information just to get a download link. Illumos+OpenIndiana really excites me, but I can't find a timetable for release.

justin2net · Mar 19, 2011

Does that motherboard fit in the Norco chassis? The motherboard is SWTX.

Edit: Nevermind, read your latest post. Care to post some pictures of the motherboard placement?

_Gea · Mar 19, 2011

Gigas-VII said:
Still haven't decided on a particular OS. I intend to play around with a few different ones. Part of me says "latest and greatest" meaning Solaris 11 Express... but I must admit that I don't like Oracle at all, and don't especially want to give them all my personal information just to get a download link. Illumos+OpenIndiana really excites me, but I can't find a timetable for release.

see http://wiki.openindiana.org/oi/OpenIndiana+Releases
other stable option is NexentaCore

Current OpenIndiana is more or less the last open OpenSolaris build 148.
In its basic features its quite stable. I use it already on 4 of my machines since january without problems jet (virtualized, CIFS and NFS use in my ESXi All-In-Ones)

Gea

Gigas-VII · Mar 19, 2011

justin2net said:
Does that motherboard fit in the Norco chassis? The motherboard is SWTX.

Edit: Nevermind, read your latest post. Care to post some pictures of the motherboard placement?

I will get right on it. Just have to bust out the camera gear. The system is so big that photographing it will be tricky but getting shots of the one problem area should be simple enough.

EDIT:

And yes, it fits superbly otherwise. The RPC-4224 has really nice standoffs and plenty of tapped holes for different mount arrangements. I've been very impressed with it.

Flintstone · Mar 20, 2011

hotzen said:
I just used ZFSGuru to initialize the pool with gnop 4k/shift=12, then exported it. Booted into SE11 and upgraded, then imported. Works.
Just FWIW, when somebody wonders how to combine FreeBSD Geom überness with Solaris...

I first bought into the FUD and tried this myself when I got my first 4k disks, but it didn't work. FreeBSD is good for some things, but ZFS is not one of them.

Just get a patched zpool binary and everything is solved in a much cleaner and faster way.

When using Solaris don't listen to much to the FreeBSD guys - some of them don't really want to admit to themselves that ZFS on FreeBSD isn't really a good solution (don't take my word for it - just try them both). I've been running two ashift=12 pools now for several months and haven't had any problem (even the 7 disk zpool2 isn't sluggish at all - 2 TB extra storage was more important to me than sticking with 6 or 10 disks).

Flintstone · Mar 20, 2011

sub.mesa said:
So this should work:
1) format disks with GEOM
2) create a pool with 4K sectorsize override
3) export the pool
4) import the pool on Solaris platform again
5) check ashift with:
zdb -e <poolname> | grep ashift

ashift=9 means optimized for 512-byte sectors
ashift=12 means optimized for 4K sectors

sub.mesa:
We all know you love FreeBSD, but please don't make people do this. Just get the ashift=12 patched zpool binary.

hotzen · Mar 20, 2011

I wonder why it is any better to use hacked source-code to force the ashift=12 than GNOP-convincing ZFS to set that parameter by its own educated guess?

sub.mesa · Mar 20, 2011

Flintstone said:
sub.mesa:
We all know you love FreeBSD, but please don't make people do this. Just get the ashift=12 patched zpool binary.

Don't use the ashift-hacked Solaris binary; that hack has costed quite a few people the data of their entire ZFS pool. That workaround is DANGEROUS.

This workaround is completely safe; no hacked or patched ZFS needed. So i don't exactly understand your objections? If there's anything you should NOT do, then it is using the dangerous patched ashift binaries on Solaris platform; don't gamble with your data!

Flintstone · Mar 20, 2011

hotzen said:
I wonder why it is any better to use hacked source-code to force the ashift=12 than GNOP-convincing ZFS to set that parameter by its own educated guess?

As far as I am concerned - the whole ZFS implementation in FreeBSD is "hacked source-code" so this is a pretty strange position to take.

I've read/written an guestimated ~60TB to my two arrays which was created with the modified binary. Never seen any kind of trouble. Given that solaris had so much trouble recognizing FreeBSD created zpools I'm not exactly unsure about my position on this.

Flintstone · Mar 20, 2011

sub.mesa said:
Don't use the ashift-hacked Solaris binary; that hack has costed quite a few people the data of their entire ZFS pool. That workaround is DANGEROUS.

Please back this up with references. I think you might be confusing the hack someone made to assume every pool it used was ashift=12. That clamed some pools and was stupid. I seem to recall that patch was made in the kernel.

sub.mesa · Mar 20, 2011

1) not sure why you would say FreeBSD has hacked source code and Solaris does not? not really interested in such FUD either.
2) the fact that YOU did not have problems and chose to gamble with your data, doesn't mean others won't have problems and don't want to gamble with their data.
3) Solaris doesn't have 'so much trouble' to import FreeBSD-created pools, it only takes one command. It just doesn't work with GPT-formatted partitions, since Solaris does not properly support GPT, and as such ZFS cannot find the 'entrance' to its metadata on disks. GPT partitions do not start at LBA 0 offset and thus need to be supported. If you format the disks with GEOM or not format at all, you should not have this problem and you can import without problems on Solaris platform.

Do not use the hacked ashift binaries; using this workaround is much safer.

Flintstone · Mar 20, 2011

sub.mesa said:
1) not sure why you would say FreeBSD has hacked source code and Solaris does not? not really interested in such FUD either.
2) the fact that YOU did not have problems and chose to gamble with your data, doesn't mean others won't have problems and don't want to gamble with their data.
3) Solaris doesn't have 'so much trouble' to import FreeBSD-created pools, it only takes one command. It just doesn't work with GPT-formatted partitions, since Solaris does not properly support GPT, and as such ZFS cannot find the 'entrance' to its metadata on disks. GPT partitions do not start at LBA 0 offset and thus need to be supported. If you format the disks with GEOM or not format at all, you should not have this problem and you can import without problems on Solaris platform.

Do not use the hacked ashift binaries; using this workaround is much safer.

With all respect - this is again a lot of FUD

1) Solaris is the source. It's where ZFS was created. The architecture between Solaris and FreeBSD is different, and hacks have to be made to make stuff work. Everything ZFS related works on Solaris - that isn't the case on FreeBSD. My only gripe is that performance is horrible bad on FreeBSD on my hardware.
2) Again this is just a undocumented claim restated. Please point to sources that claim this. I know you're more active than me on a lot of forums and should be able to bring some evidence to back this up. What you're suggesting is still just as much a unsupported workaround as using the patched zpool binary.
3) I don't doubt this. I didn't have the expertise to know about all the ways to handle drives in FreeBSD.

sub.mesa · Mar 20, 2011

1) It is kind of funny to say FreeBSD is a bunch of hacked sourcecode, especially regarding sector sizes since FreeBSD is the only OS in the world who has proper sectorsize code; it would natively support 4K disks; something that won't work on Solaris yet. Solaris expects/hardcodes 512-byte sectorsize in alot of places; having a different sector size will not work well here, without drastic changes. Due to FreeBSD's proper sectorsize handling, Solaris pools can benefit from creating the pool on FreeBSD. As you may not know, FreeBSD has fixed many of ZFS' shortcomings or strong integration with Solaris. Proper code to determine ashift at creation time is one of those things. Calling that a hack is just ironic and funny, while at the same time rather stupid.

2) ZFS used to be hardcoded to ashift 9 (= 512-byte sectors); the hacked binary changes that hardcoded value to 12 (=4K sectors). This works on 512-byte disks, but used to result in a number of corrupted pools where going back to non-patched binary did not help; the whole pool was permanently corrupted. References on the OpenSolaris mailinglist and FreeBSD forums; though don't ask me to search for it -- i would rather spend my time on something else.

3) then why did you object to the FreeBSD method? It's safe, it works and it's rather easy. It allows you to optimize your Solaris ZFS pool for 4K disks without danger; isn't that the best recommended method? Not sure you asked me to stop recommending this to people and instead point people to a method which potentially could kill all their data? No thanks!

Don't take this personal though; if you love Solaris platform that is absolutely fine! I just don't like the Solaris vs. FreeBSD hostility very much. If you feel i'm your competitor of some kind, then why don't you simply ignore me instead? Saves alot of energy usable elsewhere.

This thread used to be about the ZFS Intent Log, now it's made in a low-quality Solaris vs FreeBSD thread; that kind of things make me feel sick.

Gigas-VII · Mar 20, 2011

Quick Update: I've created an album of photos of my build so far. I'll probably add more as I go. Here's the link:

http://kaishi.imgur.com/zfs_server

hotzen · Mar 20, 2011

That is some heavy noctua-cooling, is it really required? (despite looking awesome

ZFS Thoughts: Best Inexpensive ZIL?

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

n00b

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

n00b

2[H]4U

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

n00b

Supreme [H]ardness

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd