Preventing ZFS Rot - Long-term Management Best Practices

packetboy

Limp Gawd
Joined
Aug 2, 2009
Messages
288
Somehow I'm experiencing what I'm going to call ZFS Rot...by this I do NOT mean (bit rot) but rather gradual deterioration of read and write performance across a pool that has been around for many years, has lots of data and lots of snapshots.

When I first built this array, was getting 800MB/s write and nearly 1000MB/s seq read from it. Now, three years later, I'm lucky if I can get 100MB/s read or write.

System is:

* Supermicro X8DTL
* Norco 4020 enclosure
* 9 x Hitachi 2TB 3Gbps
* HP SAS Expander
* LSI 9200-8e
* OI 151a4

For the longest time I was sure that I was having some kind of hardware problem (CPU, power management, SAS, disks, etc.) , however, when I do DD tests directly to the /dev/rdsk for each physical disk, I get 130MB/s per drive and 800MB/s when doing multiple DDs in parallel.

The real eye opener was when I added 6 Hitachi 4TB drives to the enclosure, connected them up to the exact same SAS expander and created a brand new pool. I run the same exact Filebench seq read test (single threaded even)...now I get 800MB/s when the test file is being created and get 900MB/s read!

This validates to me that the hardware is fine and my ZFS pool has somehow degraded.

I even did a ZFS send of one of the smaller volumes from the existing pool to the new test pool and then run the filebench test on the transfered pool..runs perfectly.

Now, the slow pool does have 12TB of 14TB used, it has 96 zfs volumes and 705 snapshots.

I need to get this pool back up to a more acceptable level of performance, but at this point I'm not exactly sure what attributes of the pool are causing such an extreme degradation in performance.

For example, I have 5TB chunk of data is could move to the new pool..should I expect that to then magically make the original pool perform fine (somehow I think not).

It was my understanding that having dozens of volumes within a pool and even thousands of snapshots was not a big deal..am I wrong? Perhap I need to pare these down?

Or is the pool somehow 'fragmented' and needs to be recreated from scratch?

I've been battling the performance on this server for about 8 months now, and I see now why I was so frustrated..I had been making some assumptions about ZFS being able to deal with whatever I threw at it 'automagically'...I still thing ZFS is fantastic, but I'm pretty sure there are some long-term best practices you need to follow in order to keep your pools 'healthy'...unfortunately the details on what those are seem to be undocumented.

Thoughts?
 
I haven't experienced or noticed decreased performance yet but have read concerns of copy on write filesystems like zfs fragmenting and performance decreasing because of lack ability to defrag. Theory goes you need to zfs send all your filesystems to a new pool, then after everything is cleaned off the original pool send them all back to the original. Also your 2tb free space out of the 14tb appears low. If and when you move all the zfs filesystems off of the original pool try to expand the pool with more disks so that you have more free space which should suppress the regression longer. Let us know what happens.
 
It is very important to keep enough free space for COW. I don't know the magic number on ZFS, but on NetApp, when you hit 85% used in aggregate, performance degrades dramatically.

You can add more vdev to expand capacity, however, how is the IO being distributed among old vdevs and new vdes? Read/write will be uneven which creates performance issue.

I said it before but many disagreed with me. I say it again, we need reallocate/bp_rewrite if we want to maintain performance. For home, not a big deal, for production, not everyone has the luxury to have another zpool to move data around and not everyone is allowed to do it, and why should I?

ZFS has many great features that even tier 0 storage vendors don't have, but there are some basic features ZFS needs to implement.
 
This is caused cause it's COW. the raw speed you get when it's empty, is cause everything is written and then read seq from the drives.

Over normal usage, your write to the whole drive many times, and delete stuff, and you end up creating random free spots of variable size.

new writes go into those free spots seq, so writes stay pretty consistantly good, but now that large movie you just stored, is written all over the place in the available free spots, and reading becomes random.

This is worse and worse the more full your drive is. This happens also on ext(2/3/4), but needs to be much fuller to notice the effect.

My work performance systems I'm keeping under 50% usage. Backup and large file storage, I'll fill up, as it won't fragment.
 
.....
This is worse and worse the more full your drive is. This happens also on ext(2/3/4), but needs to be much fuller to notice the effect.

My work performance systems I'm keeping under 50% usage. Backup and large file storage, I'll fill up, as it won't fragment.

as I know,
ext4 use "extent" instead of "block mapping" (ext2/3) to reduce fragmentation.
supposed, 50% usage/free space with a lot fragmentation: you would see differences on non-SSD.
even you backup large files, as long as deleting and creating files are occurred -> fragmentation could happens

honestly, we can not avoid fragmentation in current and modern filesystem, just can minimized fragmentation.
 
Seems like we could periodically reformat/resilver slices and drives too.
1. Introduce a hot spare,
2. Pull a working member and reformat
3. Cycle the extra drives first-in/last-out through the pool. Repeat...
When a drive gets wiped and then reintroduced/resilvered into the pool, it should get written sequentially and be essentially defragged, no?
 
Are you scrubbing? I would scrub once a week. Do it when you sleep. If you do not scrub then you might get all sorts of dirt so to say that can slow the pool down.
 
Scrubbing won't do anything, it just reads data.. well and will write if it finds anything corrupted. If there was so many corrupted blocks that it affected performance you got a bigger problem.

Seems like we could periodically reformat/resilver slices and drives too.
1. Introduce a hot spare,
2. Pull a working member and reformat
3. Cycle the extra drives first-in/last-out through the pool. Repeat...
When a drive gets wiped and then reintroduced/resilvered into the pool, it should get written sequentially and be essentially defragged, no?

Apparently this does do something, but how effective it is isn't really known. You can try searching mailing lists and threads turn into nitpicking about what is fragmentation and no one ever really wants to try to answer the question. Part of the problem is there isn't any way to measure fragmentation on zfs, so its hard to say how well anything works. Likely if you fill an array up >90% and you know its greatly fragmented, and then you delete enough data to have decent amount of free space again... the resliver trick should decrease fragmentation. But for most arrays you'll likely end up with similar amounts.
 
Oh and I think at 80% full is when zfs switches from 'first fit' to 'best fit'... you can change when this happens somehow. Soon as it switches to 'best fit' I would think new data would start getting much more fragmented.

Anyone who thinks their zfs file system is slow due to fragmentation, are you passed 80%?

FWIW you can mess with this to have it keep using 'first fit':
echo "metaslab_df_free_pct/W 4" | mdb -kw
http://blogs.everycity.co.uk/alasda...ly-slowly-when-free-disk-usage-goes-above-80/

Also here is sorta sad information, I always just expected eventually zfs would get defrag but:
http://www.mail-archive.com/[email protected]/msg00002.html

Looks kinda bleak. BTW Block Pointer Rewrite is what they refer to as needed feature to implement defrag, in addition to being able to add a disk to a raidz{,2,3} array. Basically the only real downsides to zfs both need it.
 
Last edited:
Oh and I think at 80% full is when zfs switches from 'first fit' to 'best fit'.

Agreed...this is absolutely critical. I already changed the server from the default value of 30 to '4' as discussed in those forum posts...otherwise instead of the server just being dog slow, it's completely unstable.

The problem is I was probably running this server for many months with the default metaslab setting while space was wells below 20% free...thus ZFS spent muchs writing to the array while in 'best fit' mode. One working theory is that for some reason if you write to the array in 'best fit' mode for a long time it creates an over all degradation problem. I don't completely buy that as I get slow read rates even on large ISO files that were written to the array years ago.

BTW, another major symptom when the array gets in this mode is that resilver times are absolutely ridiculous. On Sunday, while I was doing file bench testing, the LSI controller somehow freaked up and make ZFS thing one of the drives in the array was bad...hence it started resilvering to on of the 2TB hotspares. Three DAYS later, it was still only about 2/3rds done...then we took a power hit and due to a UPS problem the server rebooted before the resilver completed, thus forcing it to start all over again. My intention was to temporarily stop the resilver by forcing a detect of the hotspare...interestingly as there were not additional spares, ZFS decided that the orginal drive was in fact NOT failed and then started resilvering to it. Since that drive was very close to being in sync, that process then completed in about 2-3 hours. Anyways that last part is tied to my symptoms, but just thought I'd mention that detach trick when you think strongly that ZFS has improperly called 'fail' on a drive that is likely perfectly fine.



Regardless, I'm currently ZFS sending 5TB of data from the old pool to the new pool as we speak. As usual, this is crawling along at about 100MB/s as the whole process is bottle-necked by whatever is going on with the original pool.

I'll delete the source volume for this and free up 5TB in the original pool, but I'm going to guess that's NOT going to fix the performance problem...we'll see.

I'm guessing that I'll have to ultimately copy everything from the old pool to the new, and recreate the original pool from scratch to 'fix' it....there would seem to be some other way, but I'm at a loss.
 
Looks kinda bleak. BTW Block Pointer Rewrite is what they refer to as needed feature to implement defrag, in addition to being able to add a disk to a raidz{,2,3} array. Basically the only real downsides to zfs both need it.

i've been told the down sides of BP rewrite are such that implementing it really gains you nothing.

there is some work being done right now for different features that may end up also allowing for a bp rewrite like feature.
 
"i've been told the down sides of BP rewrite are such that implementing it really gains you nothing"

any details?
 
one of the above links touches on it.

I implemented most of BP rewrite several years back, at Sun/Oracle. I
don't know what plans Oracle has for this work, but given its absence in
S11, I wouldn't bank on it being released. There are several obstacles
that they would have overcome. Performance was a big problem -- like with
dedup, we must store a giant table of translations. Also, the code was
didn't layer well; many other features needed to "know about" bprewrite.
Maintaining it would add significant to cost to future projects.

the need is there, there are benefits, however the way it was originally envisioned has serious down sides.

there is work being done for 4.x that may dovetail with the need for bprewrite. can't really go into any detail, ndas and all. bprewrite is not a high priority though.
 
BTW this is a bit of a stretch... If you run 'iowait -x 10' on Linux or 'iostat -x -n 10' on Solaris, and then put some load on array... are avg wait times similar for all the drives? Specifically the avg write time?
 
OK..I moved 5.5TB of space from the old pool ('rz2pool') to the new pool ('zulu01') now each pool is less than 50% full:

Code:
root@zulu01:~# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
rz2pool   16.2T  7.15T  9.10T         -    44%  1.00x  ONLINE  -
zulu01    18.1T  7.09T  11.0T         -    39%  1.00x  ONLINE  -

Here are the interesting results:

1) Filebench to the new pool seems to have slowed down, but almost 50%...from 1000MB/s to about 500MB/s

2) Filbench to 40GB test file that was created will the pool was full is exactly as it was before (e.g. 50-100MB/s)

3) Filebench to a 40GB test file that was created AFTER the move is greatly improved...up to about 500MB/s

I'm most concerned about "1)" ... at 50% degradation with the pool only 44% full.
Is this what one should expect with ZFS?
 
No, you should not expect it.

My guess is it the filesystem is fuller than 80%, and that slows everything down. You have now moved 5.5TB to another pool to free up space, which is good. Now you should reach your old speed again. Check that you dont have any snapshot of the deleted data, though.

If you have data and snapshot it, and then you delete the data - all the data is still there. You need to delete the snapshot too. Otherwise no space will be released. Delete data AND snapshot to free up space.

I have a single 1.5TB disk, and when ~80GB is free, everything slows down to a halt. When I free up space, the disk becomes fast again. For raid, they say 85% full is the max limit. Try to keep under 85% full. If you exceed the limit, then everything will be slow, so you will know when you filled up your raid too much. As you have now.

ZFS does not rot with time. ALL filesystems becomes slow around 90%. Just free some space.
 
When you run 'zpool list' the amount allocated/free includes snapshots and everything, so that doesn't explain it.

This also does not explain results, but I expect it explains some of it: Outside of a disk is about twice as fast as the inside:
http://www.storagereview.com/seagate_barracuda_3tb_review_1tb_platters_st3000dm001
190MB/sec on outside, 95MB/sec on inside.

Here is explanation of how ZFS decides where to write blocks:
https://blogs.oracle.com/bonwick/en_US/entry/zfs_block_allocation

With empty pool you'll get pretty ideal speeds as ZFS will put everything on outside edge of disk. Only 1/2 full pool writes shouldn't yet be going inside metaslabs, but the outter most will probably have been filled.

50% drop seems kinda great though.
 
No, you should not expect it.

My guess is it the filesystem is fuller than 80%, and that slows everything down. You have now moved 5.5TB to another pool to free up space, which is good. Now you should reach your old speed again. Check that you dont have any snapshot of the deleted data, though.

I did a 'zfs destroy' on the volume...it's totally gone...I deleted all snapshots on that zvol as you can do a destroy without deleting the snapshots first.
 
I did a 'zfs destroy' on the volume...it's totally gone...I deleted all snapshots on that zvol as you can do a destroy without deleting the snapshots first.
Ok. So now you have an empty zpool. Everything is fast again? It was the free space issue of 85% full zpool? Or is everything still slow, even though it is empty zpool?
 
Necro'ing this 3 yr old thread :) I have been using ZoL on and off. I have noticed that even at well under 50% pool utilization, after a few months of steady write traffic (this serves a datastore for ESXi, so a lot of the writes are random into VM's virtual disks), write (and read) performance deteriorates by about 33%. Copying everything off and blowing away/recreating the pool, and copying back 'fixes' this. I've read the comments about all filesystems being subject to fragmentation, but ZFS makes it worse by causing fragmentation even when rewriting data in place (in place from the application point of view - I do understand that ZFS writes the new data elsewhere...)
 
Back
Top