ZFS ZIL Question

Skud · Jan 13, 2011

Hi All,

Looking to implement a home ZFS NAS/SAN in the near future to be used to serve a Hyper-V failover cluster via iSCSI.

I was originally planning to use 4 x 1TB WD Black SATA drives. The pool would consist of two mirrors of two drives.

I would use a low cost 32 or 64GB SSD for the L2ARC and a 4 or 8GB SSD for the ZIL.

However, I am unsure about the SSD for the ZIL. It's my understanding that any writes would first have to pass through the ZIL before being committed to disk. Does this mean that write performance for the whole pool is limited to the write speed of your ZIL device?

Thanks!!
Riley

omniscence · Jan 14, 2011

While my experience with ZFS is limited, I read through a lot of the documents describing how it works.

In optimal cases the ZIL device is only written to, never read from. It is just a non-volatile backup for the RAM cache. This also means a ZIL device larger than the available RAM cache is wasted space. Just caching the writes in RAM would be faster (and can be achieved by disabling the ZIL) but in case of a power loss or server crash data would be lost. Further, the ZIL is only going to be used for synchronous writes, which means that the write operation will block till all data is commited to non-volatile storage. Asynchronous writes can obviously be cached purely in RAM.

For optimal performance this means that the write speed of the ZIL should be larger than either your network or your pool if possible, but it does not limit the write speed directly. If the SSDs get saturated, the writes will just block till they are commited to the HDDs, which will increase the latency and reduce the IOPS. For long sequential writes you should still see pool/network speed. The problem is that some SSDs get slow as hell if their internal cache is disabled. The optimal ZIL device is a battery backed RAM disk, maybe even with a flash back-buffer.

Skud · Jan 14, 2011

omniscence, thanks for the reply..

I've done a lot more research and here is what I've found (some of which omniscence has already said):

1) The ZIL is only used in the case of synchronous writes.

2) By default, iSCSI and NFS use synchronous writes.

3) I've seen mention that only synchronous writes up to a certain size will use the ZIL. So, I guess when the write is over xxx size it will bypass the ZIL and go straight to disk?

4) Synchronous writes on ZFS are considered "proper" and will be slower than on other file systems. Other file systems will only make sure the data is sent to the drive, but not necessarily flushed from the drive's cache to the disk. This can be disabled in ZFS, but not really recommended.

5) If you have a large number of synchronous writes that go to the ZIL device then your write speed for the pool is limited by that device. So, it's good to use a write-optimized SSD for the ZIL.

6) Not all SSDs are safe for using as a ZIL device. A number of current-gen SSDs (including Intel) don't obey the flush command from the host OS and keep data in the cache. This means that you would lose log data if there was a power failure. Older versions of ZFS would not allow you to mount a zpool with a corrupt log device (and you can't import a zpool without the zlog!!). This has been fixed for some time though (not sure about FreeBSD though, it still uses an older version).

Some SSDs, like the OCZ Vertex 2 Pro, have a "super-cap" (capacitor) that will flush the data from the cache in the event of a power failure. The only other option is to find a "non-caching" SSD to use as a ZIL device, but I'm not sure which ones are caching vs. not.

So, all-in-all using an SSD for the ZIL is a picky affair. You need an SSD that:

1) Has fast enough writes
2) Is non-caching and obeys the OS flush commands or has a super-cap.

Think that about covers it..

Riley

omniscence · Jan 15, 2011

It comes down to that. When I read into ZFS, SLC based disks were strongly recommended. As SSDs developed quite a bit since then you may get away with a MLC based one, but I could not find any recent suggestions about that. I have read that X25-E (SLC) SSDs obey disk flushes if then drive cache is disabled, but they will be much slower in this mode.

sub.mesa · Jan 15, 2011

If you want to use ZIL, you have to use an SSD which has a supercap capacitor, to avoid corruption on power loss. When SSDs lose power, they may overwrite data that wasn't intended to be written; an extremely serious corruption issue. When HDDs lose power, they keep everything that was already written to the media, they just lose what's in the DRAM buffer chip. This is a fundamental difference.

I can't find a reason to prefer SLC. It doesn't give you more write endurance per dollar/euro invested, so what exactly is the point here? The new Intel X25-E uses MLC memory instead of SLC, but still has improved write endurance over the older X25-E with SLC NAND.

The third generation SSDs with supercapacitor should be your focus when using SLOG. Also you need ZFS pool version 19 to use this feature without sweat, since otherwise losing the SLOG device or having a corrupted SLOG would mean you lose the entire pool!

For this reason, use any second and first generation SSDs (everything that is on the market today) only for L2ARC and only use upcoming third-generation SSDs for SLOG feature. Unlike SLOG the L2ARC feature is completely safe even if your SSD corrupts itself.

SLOG = 100% sequential write; no random I/O is done except for error recovery after recovering from a crash or interrupted power cycle; so only sequential write performance would be of interest here. Please note that Sandforce compression may not give as much benefit; at least all ZFS metadata is already compressed, and quite possibly your actual data as well.

Skud · Apr 11, 2011

Bringing this one back up...

I've got a system in testing consisting of the following:

ASUS M4A89GTD AM3 890GX
Athlon II X2 250
8GB DDR3 ECC
8 x Hitachi 5K3000 2TB
Corsair F40 SSD
Intel 1000/PT dual port NIC. Aggregated.
Norco 4216
LSI 9211-8i flashed with IT firmware
Running OI_148 and napp-it

No L2ARC or ZIL.

Code:

zpool status:
If you get 'too many errors' enter console command zpool status -v to get affected files 

  pool: pool01
 state: ONLINE
 scan: none requested
config:

	NAME                       STATE     READ WRITE CKSUM
	pool01                     ONLINE       0     0     0
	  raidz2-0                 ONLINE       0     0     0
	    c1t5000CCA369C599ECd0  ONLINE       0     0     0
	    c1t5000CCA369C5A320d0  ONLINE       0     0     0
	    c1t5000CCA369C5A4A1d0  ONLINE       0     0     0
	    c1t5000CCA369C6C625d0  ONLINE       0     0     0
	    c1t5000CCA369C724D5d0  ONLINE       0     0     0
	    c1t5000CCA369C78389d0  ONLINE       0     0     0
	    c1t5000CCA369C785F3d0  ONLINE       0     0     0
	    c1t5000CCA369C78B0Ad0  ONLINE       0     0     0

errors: No known data errors

Performance seems to be great so far. This is going to be a storage server for media, backups, and eventually will host some iSCSI targets for a hyper-v cluster.

Already I'm running into issues with strange performance in SMB and iSCSI (seq. writes). After a large amount of research it looks as though it's actually normal and what's referred to as ZFS "breathing". Since ZFS puts transactions (I/O) into groups the constant grouping and flushing causes what you see below. From what I understand, a separate log device will smooth these out and help performance.

This is a transfer of a ~35GB disk image. Speed is limited by the source machine (my WHS box). When I do the same transfer from my workstation to the ZFS server the transfer averages between 90 and 115MB/s. Same choppy graph though.

So, now that the new SSDs have been released is there a decent (and low cost!) one that is good for ZIL use? I really would love someone to release a good write-optimized 8 or 16GB SSD with a supercap.

In my travels I see a number of people referencing the Acard DRAM drives as a good solution as well as the "DDRDrive". The reasons seem to be that ZFS is hard on ZIL devices and an SSD's performance will degrade horribly over time. Of course the ones making this claim are the ones selling the DDRDrive. So, I'm not sure what/who to believe.

Thanks!!
Riley

justin2net · Apr 12, 2011

The Intel 320 series SSDs have a bank of capacitors and a design with "enhanced power-loss protection".

http://newsroom.intel.com/servlet/J...eries_Enhance_Power_Loss_Technology_Brief.pdf

Edit:
The Intel 320 SSDs have higher IOPS rating than the 510 SSDs. This might be relevant.

Also, I'm wondering if the garbage collection implementation is suitable for use as a ZIL...

Edit2:
The 160GB 320 SSD: R:270 MB/s W:165 MB/s R:39 000 IOPS W:21 000 IOPS
120GB 510 SSD: R:450 W:210 R:20K W:8K
4K R/W

Of course these theoretical numbers become less relevant after the SSD is well-used...hence the question about garbage collection capability...

Skud · Apr 12, 2011

I was looking at the Intel 320 series, specifically the 40GB one.

It doesn't have to be huge. From: http://www.nexenta.com/corp/content/view/274/119/

#1
The minimum size of a log device is the same as the minimum size of device in pool, which is 64 Mbytes. The amount of in-play data that might be stored on a log device is relatively small. Log blocks are freed when the log transaction (system call) is committed.

#2
The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 Gbytes of physical memory, consider a maximum log device size of 8 Gbytes.

#3
For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GBytes of log device should be sufficient.

#2 says that a decent ZIL size is RAM/2. I have 8GB now and maybe more in the future, so let's give some room and say in the future RAM/2 = 8GB for if/when I upgrade to 16GB of RAM.

So, I know my limiting speed with be the ethernet link. Right now, I have a Pro 1000/PT with two interfaces aggregated. I suppose, theoretically, I could get ~250MB/s if I have enough hosts hitting it at once. Let's go with that for the max I could see.

#3 says to keep my maximum throughput of 250MB/s I need 250MB * 10 seconds. So, 2.5GB.

So, it looks like I only need 8GB of ZIL to cover me for now and in the future.

So, as I surmised in my posts up at the start of the thread I assume that if I use an SSD with slow writes (Intel 320 40GB) then I'll only hurt myself because of the 45MB/s write performance.

Although it's considered bad practice I may be better served purchasing a larger (say 120GB) SSD that is faster and split it up with 8GB for ZIL and the rest for L2ARC. Only thing is, that I assume I need an SSD with a capacitor and at least 250MB/s write performance.

Looking here (again, these guys are pushing the DDRDrive, so they may be biased):http://www.ddrdrive.com/zil_accelerator.pdf it shows that the ZIL write pattern when running with one filesystem is predominantly sequential, however trends towards random when more filesystems are introduced.

This means that a ZIL needs to be:

1) Capacitor backed
2) Very good on random writes
3) High IOPS
4) Ideally, no more than 8GB (for my case, due to cost)

I don't think an SSD like this exists. Perhaps something like the ACard is a better option? DDRDrive is too expensive..

Riley

justin2net · Apr 12, 2011

But do you need your ZIL to be "that" fast? For a home server?

I think the 320 40GB will be just fine.

Edit:
And add to that, the Intel 320 is based on the proven Intel X25-M controller. I think that is more reliable than an ACard.

Skud · Apr 12, 2011

justin2net said:
But do you need your ZIL to be "that" fast? For a home server?

I think the 320 40GB will be just fine.

Edit:
And add to that, the Intel 320 is based on the proven Intel X25-M controller. I think that is more reliable than an ACard.

It does... It's more than a home server. It also does some business related tasks and will be hosting a number of VMs in a hyper-v cluster. So, I definitely don't want to do anything to jeopardize performance.

As far as I can tell using a substandard SSD (substandard in regards to ZIL) you can actually end up with worse performance than before.

Riley

olavgg · Apr 12, 2011

I'm really interested in this question. I'm also building a poor man's SAN for my business. The perfect SSD doesn't seem to exist (yet).

MACscr · May 5, 2011

So I see the formula of ram/2 = needed zil storage, but how do we know how much storage we need for l2arc? More just better? Whats the minimum? I also know they have a limited life span, can the number of hours/writes left be monitored through smart or some other monitoring tool?

Also, lets say i have 32gb ram in my ZFS server, so a 16gb ssd is needed for zil. If thats the case, what drive would you recommend? Should two be used in a raid 1 config for zil? I heard that version 28 of ZFS made is so that ZIL isnt as volatile as it used to be. Is this the case? Do we still need to have the came type of precautions as we did we older versions?

Sorry for all the questions. I am completely new to zfs and trying to gather as much info as possible before spending a couple grand on a system for building an zfs based iscsi san.

MACscr · May 5, 2011

Just found this awesome article that answered some of my questions: http://constantin.glez.de/blog/2011/02/frequently-asked-questions-about-flash-memory-ssds-and-zfs

Skud · Dec 29, 2011

I want to bring this back up because I just found some interesting information regarding the Acard 9010 units.

One of the hangups I had with the unit in the past is that it took so long to backup/restore from the CF card in the event of a power outage. They stated around 7 minutes to restore 16GB of data from a 233x CF card.

I saw this as an issue because I know it takes far less than 7 minutes for my ZFS server to boot up and expect my ZIL to be available. What would happen when it tried to read the ZIL and the drive was in the middle of a restore?

I looked at the Q&A for the 9010 and came across an interesting tidbit of information in regards to restoring data during bootup.

Specifically:

From the Q&A: http://dl.acard.com/manual/english/ANS-9010_9010B Q&A_E.pdf

Q6-4. My ANS-9010/9010B boots up as slow as a regular HD does, but it will have normal I/O speed
10 minutes later. Why is that?

A6-4: This is because the data in RAM disk is evaporated during power off, and restoring from the
CF while booting up. Because the computer direct access to the CF since the restoring is not
complete yet, the I/O speed would be lower than it suppose to be. It is recommended to apply
External Power Kit for users who are bothered by this symptom.

and...

Q3-9. How long does it take to restore data to ANS-9010/9010B from a CF? Should I wait until the
restore finishes before boot it up?

A3-9: Time required by restoring is varied. It depends on the capacity of RAM disk and the speed of
CF. For example, it takes about 7 minutes to restore data in 16GB from a CF with 233
speed. When booting up from an ANS-9010/9010B while restoring is in progress or not even
start yet, the system would directly access to the CF during restoring. Thus, ANS-9010/9010Bcan be boot up during or without data restoring. However, in such case, the booting up time is
about the time that booting up from a regular HD because the system accesses to a CF
instead of RAM modules. Users would enjoy the extreme fast I/O speed after the data
restoring finishes.

So, to me, this means that the device and data is fully accessible during the boot-up and restore process. While the data is being copied from CF to RAM it looks to the CF card and you get CF-card speed. When the copy is complete IO is directed to the RAM modules and you get the full performance.

So, for a period of up to 7 minutes after bootup (and only if the system was off longer than 4 - 5 hours and the battery was run dry, necessitating a restore from CF) you would have slower-than-normal performance.

It also appears they have solved a number of the performance questions I have after looking at some of the reviews online. They reference performance issues being resolved by firmware..

I think I've finally found my ZIL... With 8GB of DDR2 and a fast CF card..

Riley

brutalizer · Dec 30, 2011

sub.mesa said:
If you want to use ZIL, you have to use an SSD which has a supercap capacitor, to avoid corruption on power loss. When SSDs lose power, they may overwrite data that wasn't intended to be written; an extremely serious corruption issue. When HDDs lose power, they keep everything that was already written to the media, they just lose what's in the DRAM buffer chip. This is a fundamental difference.

This is news to me. Do you have more information on this?

Because ZFS is COW, disks are protected against these kind of problems. But maybe SSDs behave differently than disks?

Skud · Dec 30, 2011

I've ordered one of the Acard 9010 units and I plan to put 8GB of DDR2 in it in ECC emulated mode. This will give me slightly less than 8GB of ZIL. I also ordered a Kingston 8GB 400x compact flash card. The unit I ordered says it also comes with the external power adapter.

Hopefully it will be here next week. In the meantime, I will be doing some benchmarking so I have results to compare against. I don't use NFS, but I do use iSCSI and CIFS so my results will be based on that.

Riley

Abit667 · Dec 30, 2011

brutalizer said:
This is news to me. Do you have more information on this?

Because ZFS is COW, disks are protected against these kind of problems. But maybe SSDs behave differently than disks?

ZFS as a whole is copy on write, but that doesn't mean much to how the intent log functions.

The ZIL lets ZFS provide POSIX compliant synchronous writes, meaning that when an application issues a sync the filesystem needs to actually store the data safely when it tells the application it has processed that IO. Asynchronous writes don't have that guarantee, an application can write data and the filesystem can tell the application it has been written, but that data can then sit in cache somewhere before actually being transacted.

The reason the ZIL exists is because ZFS is transactional and can't really write data to the disk immediately every time an application wants to. So synchronous writes get put into the intent log and then transacted in with the rest of the IO. In order to keep that synchronous guarantee, the ZIL needs to be stable and hold its contents between reboots or crashes or whatever. The data in the ZIL is data that ZFS has said it has stored properly.

Basically, once ZFS has written something to the ZIL the data really needs to stay there until it can write it out to the pool. If power is lost or the system crashes or something the ZIL needs to be consistent from the time the last write was made to it. An SSD that can't do this will defeat the entire purpose of the ZIL and possibly cause corruption.

thefreeaccount · Dec 31, 2011

Skud said:
So, to me, this means that the device and data is fully accessible during the boot-up and restore process. While the data is being copied from CF to RAM it looks to the CF card and you get CF-card speed. When the copy is complete IO is directed to the RAM modules and you get the full performance.

It comes with an external power supply you're supposed to use so the power doesn't cut off every time you turn off the computer. I agree that this is the best solution for ZIL at the moment.

Skud · Jan 15, 2012

So, I received my unit and it looks quite nice. I put in 4 x 2GB OCZ DDR2 and then put the unit into "dual" mode. This splits the RAM into two banks of 4GB so I get the benefit of 2 x 300MB SATAII ports vs. one.

I added the drive as a ZIL stripe. I went this route vs. a mirror because I figured that the two drives (more like "channels" I guess) are the same PCB anyway, so if one side fails then probably the whole unit is going to fail and a mirror won't help.

I then started benchmarking and I noticed that I was getting fairly poor write performance when the block size was over 32kb. I couldn't get anything over 50MB/s whereas when previously by 32kb I would be seeing ~70MB/s and scaling up to the limits of my gigabit Ethernet connection. This was a COMSTAR iSCSI target with a Windows Server 2008R2 client running against it.

So, I removed the log devices and then created a new thin prov. iSCSI target on a pool with the two ACARD vdevs in a stripe. I ran ATTO against it and performance is great whether I have sync=always or sync=disabled.

So, I know the Acard gives good performance by itself, so why then does write performance tank when I use it as a log device? I'm thinking it's something with OpenIndiana or ZFS because I know the Acard is capable of much higher.

Here are the benchmarks:

sync=standard. By default iSCSI doesn't use sync writes. So, this is pretty much the same as sync=disabled.

sync=always, no ZIL. Pretty poor write performance

sync=always, with the two Acard chanels in a striped log configuration. Performance starts off okay, but then tanks.

sync=always, with one Acard channel as a log. Half of above the above with two channels.

sync=always using the Acard as a striped pool with a thin provisioned iSCSI target on it.. There is no log device in this pool, just the two Acard channels. Performance is great, exactly what I expect.

Thoughts? It doesn't make sense that I get excellent performance using it as a regular disk drive, but terrible performance as a log device.

Thanks!!
Riley

danswartz · Jan 15, 2012

This is indeed a mystery. That you see this with larger block sizes seems like a clue, but I have no idea offhand what could cause this.

thefreeaccount · Jan 15, 2012

Have you already disabled write cache flushing? What kind of storage is behind the log device?

Skud · Jan 15, 2012

System config is as follows:

AMD Athlon X2 250
16GB DDR3 ECC
LSI MegaRAID 9211-8I
8 x Hitachi 5K3000 2TB
OI_151a running napp-it
Intel PRO/1000 PT dual port PCI-e NIC

Here is a zpool status..

Code:

pool: pool01
 state: ONLINE
  scan: scrub repaired 0 in 8h20m with 0 errors on Sun Nov 27 11:50:10 2011
config:

	NAME                       STATE     READ WRITE CKSUM
	pool01                     ONLINE       0     0     0
	  raidz2-0                 ONLINE       0     0     0
	    c1t5000CCA369C599ECd0  ONLINE       0     0     0
	    c1t5000CCA369C5A320d0  ONLINE       0     0     0
	    c1t5000CCA369C5A4A1d0  ONLINE       0     0     0
	    c1t5000CCA369C6C625d0  ONLINE       0     0     0
	    c1t5000CCA369C724D5d0  ONLINE       0     0     0
	    c1t5000CCA369C78389d0  ONLINE       0     0     0
	    c1t5000CCA369C785F3d0  ONLINE       0     0     0
	    c1t5000CCA369C78B0Ad0  ONLINE       0     0     0
	logs
	  c2t4d0                   ONLINE       0     0     0
	  c2t5d0                   ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: resilvered 8.54G in 0h1m with 0 errors on Tue Jan 10 15:10:38 2012
config:

	NAME          STATE     READ WRITE CKSUM
	rpool         ONLINE       0     0     0
	  mirror-0    ONLINE       0     0     0
	    c2t2d0s0  ONLINE       0     0     0
	    c2t0d0s0  ONLINE       0     0     0

errors: No known data errors

Well, isn't this odd.. I know I set nocacheflush=1 prior to this and actually saw a DROP in performance (it was between having the Acard as the ZIL and no ZIL at all). However, I just set it again and performance is now what it should be. The only things that have changed are:

1) I updated the firmware on the Acard from 2.04 to 2.05. According to their website the only change was the added support for the 9010-B unit.

2) This time I set nocacheflush in a different way. This time I used "echo zfs_nocacheflush/W1 | mdb -kw". Last time I used "set zfs:zfs_nocacheflush = 1". Perhaps this is the cause?

So, I guess that now begs the question: What implications will this have? I bought the Acard to use as a log device because I needed my iSCSI targets to be on stable storage. So, will setting nocacheflush=1 negate any benefits of using synchronous writes? Also, why would performance be okay when using the Acard as a separate drive, but tank as a log device?

Here is the updated benchmark.

sync=always, nocacheflush=1, two Acard channels as the log device.

In a couple of weeks I have some 10Gb dual port CNAs coming. That should be interesting!!

Riley

Skud · Jan 15, 2012

I've read a bit more into the nocacheflush command and it seems this tells the array, disk, or controller (disk in my case since the 9211-8 doesn't have any cache) to flush the contents of its cache to stable storage. Now, if you have a fancy array or controller with 2GB of battery-backed cache this would take some time to flush that data each time ZFS tells it to. Hence, lower performance. This issue was recognized and there were changes made in later versions of ZFS that changed the command from "flush your cache" to "flush your cache, unless protected" (by a BBU, NVRAM, etc.). The thinking is that if your controller has any sort of battery backed cache then it should be taken advantage of.

The reason the nocacheflush property exists is because sometimes a controller with a protected cache still treats the "flush unless you have protected cache" command as a real flush command and actually flushes the cache. In these instances you SHOULD set nocacheflush=1 to workaround faulty implementations. However, if you don't have a BBU or other form of protected cache then setting nocacheflush=1 will leave any data sitting in the cache of an unprotected device vulnerable to loss.

So, in my case, I need to make sure that the write cache on my disks is disabled. Since the LSI 9211 doesn't have any cache then the disk cache is the only cache in the chain.

Riley

Skud · Jan 15, 2012

So, as I expected when I disabled the cache on the drives I got similar performance as before. I start to see performance issues over 16kb transfers. I'm still not sure why though, but it seems to be pointing to my disks.

So, here's the recap with iSCSI performance:

Code:

SYNC		ZIL	nocacheflush	Drive Cache	Performance
standard	N		0		Y	Good
always		N		0		Y	VERY Poor
standard	Y		0		Y	Good
always		Y		0		Y	Poor
always		Y		1		Y	Good
always		Y		1		N	Poor

I'm thinking the only way to salvage this may be to get a good RAID controller with a BBU. Though - I guess leaving the write cache on the disks enabled won't be too much of an issue since I have a good UPS and shutdown configured. Does the contents of a drive's cache survive and get written after an O/S panic?

Riley

thefreeaccount · Jan 16, 2012

It is not flushing on your disks that you need to disable, but rather flushing on your acard. ZFS flushes every write to the ZIL, which is intended to force an HDD to move data from its RAM cache to non-volatile storage. Your Acard, being battery protected, has its own internal mechanism for flushing data to non-volatile storage in the event of power failure. If it does flush data to non-volatile storage after every write, it's going to write very slowly.

Skud · Jan 16, 2012

thefreeaccount said:
It is not flushing on your disks that you need to disable, but rather flushing on your acard. ZFS flushes every write to the ZIL, which is intended to force an HDD to move data from its RAM cache to non-volatile storage. Your Acard, being battery protected, has its own internal mechanism for flushing data to non-volatile storage in the event of power failure. If it does flush data to non-volatile storage after every write, it's going to write very slowly.

Correct me if I'm wrong, but there is no way to disable flushing on specific drives other then to disable the write cache disk by disk.

I do see poor performance when I disable the write cache on the individual disks. This tells me that they are caching data on themselves and since I know my controller doesn't have a BBU or cache then that data is at risk. Most RAID controllers or HBAs with a battery-backed cache will (or should) automatically disable the write cache on a drive and use its own cache instead since it's battery backed.

With the data we're talking about, the ZIL is already out of the picture. The ZIL is on the receiving end of the data, not the back-end storage side. ZFS has issued the command to the disks to commit the data and they're only committing it to their own cache, not the actual disks. The problem is that ZFS thinks the data is on stable storage when it's not. It's in the disk's cache.

As far as I can tell, the only way to resolve this issue is to have a disk controller with its own cache and BBU. That's the only way data will be protected on the physical storage side of things besides disabling the cache on each disk.

So, that being said, what's a good SATA/SAS controller that works in IT mode with cache and BBU?

Riley

Skud · Jan 16, 2012

Well, after even MORE reading it looks like only being able to disable cache flushing on a per-drive basis isn't exactly true. setting nocacheflush=1 will disable cache flushing on the whole system.

However, after reading the "ZFS Evil Tuning Guide" (http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_the_Write_Cache) it seems you can tell the Solaris drivers that a specific drive Vendor and Drive ID has onboard NVRAM to protect the cache and that it should not send flushes to that device.

Once I can schedule some downtime on my ZFS box I'm going to set the above on the Acard and re-enable cache flushing. We'll see if that works.

However, I'm still not certain about the cache on the disk drives themselves. I don't think that data path is entirely protected.

Thanks!!
Riley

Skud · Jan 18, 2012

So, after seeing that setting nocacheflush=1 gets me back to the performance I expect I went on further to find out exactly where the performance bottleneck lies. Disabling the write cache on the Acard did nothing, but disabling the write cache on the individual disk drives did. It got me the exact same performance I had prior to setting nocacheflush. That told me where the issue was.

Thefreeaccount gave me another idea. Instead of setting nocacheflush, which disables cache flushing globally, I found the information to disable cache flushing on a per-device basis. Actually, the command is to tell the OS that the device has NVRAM-backed cache and to not worry about telling it to flush.

Starting with the Acard I added the necessary lines to my /kernel/drv/sd.conf and rebooted - no change in performance. Then, I switched around and changed the lines for the disks. I rebooted and performance is back.

So, what does this tell me for sure?

1) The write cache is enabled on the disks.
2) There is something going on with transfers with over a 32kb block size.
3) There is something else that goes on when you add a ZIL besides what is normally known. When I added the separate ZIL device by over 32kb transfer size performance tanked.
4) The Acard isn't the issue. Benchmarking it separately shows great performance.
5) It could be the Hitachi 2TB disks themselves. Unfortunately, I don't have an extra one to test by itself. Maybe someone else does?

So, I'm at a loss as to what's going on here. However, one thing I DO know is that there are a lot of ZFS systems out there at risk due to the individual physical disk's write cache being enabled. Basically, anyone that has the individual disk caches enabled and are using a disk controller without a BBU are at risk. How much of a risk is up to each person, but in my opinion it's actually a much greater risk than disabling the ZIL altogether because it can cause pool corruption. It appears that the ONLY safe way to use ZFS on a storage device (disks, arrays, etc.) is to make sure your back-end storage has NVRAM protected cache OR disable the write cache on the device. This goes for any filesystem though, but I think many people (including myself) have been told that "ZFS is safe" because of all the checksumming and other safety features it performs. But - if the drives don't write the data to disk properly then that feature is nullified. Though, ZFS will tell you if that data is corrupt.

Riley

thefreeaccount · Jan 19, 2012

However, one thing I DO know is that there are a lot of ZFS systems out there at risk due to the individual physical disk's write cache being enabled.

Not sure what you are trying to say here (write cache is not a risk; we have cache flushing), but have you had a chance to further troubleshoot this issue? How often are the tx groups being committed? Is the zil being used? Do the disks show as busy? Is write latency high? Have you already tried disabling disksort?

I'm curious because I'm in the process of setting up a very similar system myself.

Skud · Jan 19, 2012

thefreeaccount said:
Not sure what you are trying to say here (write cache is not a risk; we have cache flushing), but have you had a chance to further troubleshoot this issue? How often are the tx groups being committed? Is the zil being used? Do the disks show as busy? Is write latency high? Have you already tried disabling disksort?

I'm curious because I'm in the process of setting up a very similar system myself.

You're right. If cache flushing is enabled then the data is not at risk. However, from what I've seen a lot of people seem to think turning off cache flushing is a safe thing to do and gives you a free performance boost. Unfortunately, unless your disk system has a good caching mechanism then your data it at risk.

Turning off cache flushing should only be used in a few select scenarios:

1) Troubleshooting to see if you have ZFS unfriendly storage
2) If ALL your storage on the ssytem has NVRAM backed cache
3) If you don't care about your data

As for the troubleshooting, I think I've stumbled upon the (or main) issue. Tuning the zfs_write_limit_override value seems to have eliminated the poor performance I was seeing when over 64kb transfers. I saw on another forum a user getting better performance when setting this to 384MB, however, I saw no change using that value. Increasing to 512MB netted a decent improvement, and setting it to 1GB gave me pretty much the same performance as I had when I disabled cache flushing. Unfortunately, I haven't found a whole lot of specific information about this setting, other than that it's normally set to 0, the value is derived from the amount of system memory, and it's usually a value where that amount of data takes ~5 seconds to flush to disk.

I also set zfs_vdev_max_pending to 5, down from 10. However, I saw little performance improvement from this.

Riley

thefreeaccount · Jan 20, 2012

Thanks a lot for your information. I don't understand why so many apparently "normal" setups seem to need weird tunings. People say this is for "enterprise" hardware but the fact is that oracle is mostly hawking cheap sata disks just like ours.

aviat72 · Jan 22, 2012

I got a whole batch of MTRON PRO 7500 16GB SLC Drives.
Here are the CrystalDiskMark results.

The random write results are low, but that seems to be consistent with this generation of disks; they retailed for almost $800, four years ago... Will they be good enough for a ZIL?

Notice that there is a fairly large variation in the random write performance (6x) based on the size of the data-file being used (50MB to 4000MB) with it asymptotically approaching 0.6 MB/S => 600KB/S => 150 IOPS on the 4K writes. That suggests data-locality on the write-side of the SSD is becoming a factor somehow.

The random read performance is quite good; in fact it is better than a Crucial M4 64GB

I think the firmware used in these drives was not optimized for doing write caching. There is some software which you can install which can give you an order of magnitude faster results
=================================
Results from ATTO are below. I ran it under two different settings, and slightly different size of the total file size. There is quite a bit of sensitivity for the low sized write results depending on the total length of the file.

thefreeaccount · Jan 23, 2012

AFAIK only way to tell is to look at your current zil usage (zilstat)

MasterCATZ · Apr 25, 2012

Sorry I have not read this Thread yet

I came across a i-ram drive pretty cheaply ( both an pci and an sata version )

how ever with the max 4 gig Ram its a bit of a turn off
( I have a fair amount of old 1 gig sticks floating around )

will the ZIL work within 4 gig space ?

I am planing on using Root On ZFS with SSD cards

is the RAM drive worth while ? ( saving ssd's from unneeded writes )

wishing I could find an hyperdrive5 cheaply ( ACARD ANS9010) but all just under $500

of cause using FreeBSD and hoping for faster compiling ...

brutalizer · Apr 25, 2012

Abit667 said:
ZFS as a whole is copy on write, but that doesn't mean much to how the intent log functions.

The ZIL lets ZFS provide POSIX compliant synchronous writes, meaning that when an application issues a sync the filesystem needs to actually store the data safely when it tells the application it has processed that IO. Asynchronous writes don't have that guarantee, an application can write data and the filesystem can tell the application it has been written, but that data can then sit in cache somewhere before actually being transacted.

The reason the ZIL exists is because ZFS is transactional and can't really write data to the disk immediately every time an application wants to. So synchronous writes get put into the intent log and then transacted in with the rest of the IO. In order to keep that synchronous guarantee, the ZIL needs to be stable and hold its contents between reboots or crashes or whatever. The data in the ZIL is data that ZFS has said it has stored properly.

Basically, once ZFS has written something to the ZIL the data really needs to stay there until it can write it out to the pool. If power is lost or the system crashes or something the ZIL needs to be consistent from the time the last write was made to it. An SSD that can't do this will defeat the entire purpose of the ZIL and possibly cause corruption.

Yes, I agree on all this. Thanx for the recap.

But back to my question. Or do you claim you answered my question? I dont see the connection with your answer to my question.

drescherjm · Apr 25, 2012

MasterCATZ said:
Sorry I have not read this Thread yet

I came across a i-ram drive pretty cheaply ( both an pci and an sata version )

how ever with the max 4 gig Ram its a bit of a turn off
( I have a fair amount of old 1 gig sticks floating around )

will the ZIL work within 4 gig space ?

I am planing on using Root On ZFS with SSD cards

is the RAM drive worth while ? ( saving ssd's from unneeded writes )

wishing I could find an hyperdrive5 cheaply ( ACARD ANS9010) but all just under $500

of cause using FreeBSD and hoping for faster compiling ...

Can't you add some ram to your system and compile in tmpfs? That would be much faster than one of these RAM drives.

John

Skud · Apr 25, 2012

MasterCATZ said:
Sorry I have not read this Thread yet

I came across a i-ram drive pretty cheaply ( both an pci and an sata version )

how ever with the max 4 gig Ram its a bit of a turn off
( I have a fair amount of old 1 gig sticks floating around )

will the ZIL work within 4 gig space ?

I am planing on using Root On ZFS with SSD cards

is the RAM drive worth while ? ( saving ssd's from unneeded writes )

wishing I could find an hyperdrive5 cheaply ( ACARD ANS9010) but all just under $500

of cause using FreeBSD and hoping for faster compiling ...

Here is a good guide on ZIL sizing..

#1
The minimum size of a log device is the same as the minimum size of device in pool, which is 64 Mbytes. The amount of in-play data that might be stored on a log device is relatively small. Log blocks are freed when the log transaction (system call) is committed.

#2
The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 Gbytes of physical memory, consider a maximum log device size of 8 Gbytes.

#3
For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GBytes of log device should be sufficient.

Riley

MasterCATZ · Apr 26, 2012

more RAM would mean 16 gig sticks .... I have not seen such beast with-out being ecc

I run a lot of Virtual PC's RAM disappears quickly

that and www'ing chews up heaps of memory .. I can easily get 4 gig's worth of web pages up per browser ( opera / FF / Chrome )

but using RAM instead of NVRAM kind of defies the whole point of a ZIL
( may as well disable it )
*edit* actually reads message "Can't you add some ram to your system and compile in tmpfs? That would be much faster than one of these RAM drives"

that's a pretty damn good idea , I think I thought of that I while back putting a jail in RAM .. but lack of Free RAM is the problem
any one know if AMD's Trinity APU will have a mainboard out their with 8x RAM slots ? downside of leaving an Intel tri channel mainboard

back on topic

is the ZIL per dev array >? or can multiple pools use the same ZIL location ?

I am not to worried about the speed ( any thing will be better then HDD )
I can live with 100 meg / sec I just want faster IO
as this seems to be what slow's down my compiling

I am planing on using 3x SSD cards in ZFS raidz mode for root install plus the i-RAM 4 gig for ZIL leaving some sata ports for burners and e-sata

then my HBA's manage the rest of the ZFS storage pool

tho my old Adaptec RAID system was 600 meg write 900 meg read
but I mostly use these for storage

stevebaynet · Apr 26, 2012

MasterCATZ said:
is the ZIL per dev array >? or can multiple pools use the same ZIL location ?

Zil and L2arc are per pool only, not global

extide · Apr 26, 2012

What about using Intel 311 or 313 drives for ZIL? I have been thinking about one of those... but havent bit one anything yet. Comments?

ZFS ZIL Question

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

2[H]4U

Gawd

n00b

Gawd

n00b

Gawd

Limp Gawd

n00b

n00b

Gawd

[H]ard|Gawd

Gawd

Gawd

Gawd

Gawd

2[H]4U

Gawd

Gawd

Gawd

Gawd

Gawd

Gawd

Gawd

Gawd

Gawd

Gawd

Gawd

Limp Gawd

Gawd

n00b

[H]ard|Gawd

[H]F Junkie

Gawd

n00b

Limp Gawd

2[H]4U