Confused by zfs pool size after creation

bleomycin

Limp Gawd
Joined
Aug 14, 2010
Messages
242
Hi All,

Running ZoL v0.6.5.3 on debian jessie. I just created a new pool consisting of 12x 4TB drives in raidz2 (everything on this pool is backed up, and maximum reasonably safe capacity is more important). I just nuked my previous pool of 2x6 drive raidz2's for this configuration to gain some additional usable space.

Anyways, zfs list is reporting 32.1TB usable right out of the gate, which seems very low. df -h reports 33TB, also quite low. I've tried creating the pool with the standard 128k record size as well as 1M record size (all of my files are large) without any difference in reported size. I'm not really sure what's going on here?
 
Seems high to me, I would expect zfs list for your config to claim 29.1tb available.

If you used 512sector disks, and not 4k, then 36.3tb
 
Seems high to me, I would expect zfs list for your config to claim 29.1tb available.

If you used 512sector disks, and not 4k, then 36.3tb

I guess I just don't understand. If I have 10x4TB drives leftover for storage after 2 are lost to parity shouldn't I be seeing ~37TB usable, or a bit less? These are 4k sector disks. I just went from 4 parity drives to 2 parity drives when i switched from 2x6 drive raidz2 to 1x12 drive raidz2 and only gained 3.55TB of space, 1 entire drive seems to have just gone missing?
 
First all the zfs commands give different information. Does "zpool list" give you what you want to see. There is a long winded reason that "zfs list" takes into account all the internal metadata and private reservations and things of that nature. and such while "zpool list" gives you raw stats.
 
Last edited:
First all the zfs commands give different information. Does "zpool list" give you what you want to see. There is a long winded reason that "zfs list" takes into account all the internal metadata and such while "zpool list" gives you raw stats

Yeah, zpool list gives me the total correct size before parity:

Code:
zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  43.5T  14.2T  29.3T         -    10%    32%  1.00x  ONLINE  -

Code:
zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank        10.8T  21.3T   219K  /tank
tank/Media  10.8T  21.3T  10.8T  /tank/Media

Code:
df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        31G   23G  6.4G  79% /
udev             10M     0   10M   0% /dev
tmpfs           5.8G  8.7M  5.8G   1% /run
tmpfs            15G     0   15G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            15G     0   15G   0% /sys/fs/cgroup
tank             22T     0   22T   0% /tank
tank/Media       33T   11T   22T  34% /tank/Media

Code:
zfs get all tank
[URL="https://gist.github.com/anonymous/d6721e4b2c730dde31e7"]https://gist.github.com/anonymous/d6721e4b2c730dde31e7[/URL]

Code:
zpool get all tank
[URL="https://gist.github.com/anonymous/7c1dd64f308fccc81f7a"]https://gist.github.com/anonymous/7c1dd64f308fccc81f7a[/URL]
 
I found the explanation here for anyone who may have been curious, assuming it is correct: https://bedecarroll.com/2015/01/26/freenas-zfs-performance-testing/

He says:

Code:
4 TB drives = about 3.6 TiB (AKA formated capacity)
12 x 3.6 TiB = 43.2 TiB
12 drives with 2 drives worth of parity = 10 data disks
43.2 TiB - 7.2 TiB = 36 TiB
ZFS has a metadata overhead of 1/16th per drive so:
1/16th of 3.6 TiB = 0.225 TiB
12 x 0.225 TiB = 2.7 TiB
36 TiB - 2.7 TiB = 33.3 TiB of free space, roughly

That seems to make sense...
 
There's also the default reserved space to consider, which was increased from 1.6% to 3.2% in ZoL 0.6.5.
 
There's also the default reserved space to consider, which was increased from 1.6% to 3.2% in ZoL 0.6.5.

Crap, I didn't know about that one. Looks like spa_slop_shift 6 should set it back to how it was, hopefully that's a live option and not one i needed to do at creation. I'll have to look more into that, thanks for the tip!
 
Yes, that's the main reason.

With 512b sectors it's not so big of a problem but with 4K, the problem is compounded.

Yeah, that definitely seems to be a a big factor as well. I wonder how btrfs handles this scenario in comparison?
 
This is a little off topic, but is ZFS giving you any errors during boot by chance?
 
Just as a friendly reminder, if you didn't know, you do not have to call your pool "tank". That's just the example name that everyone uses.
 
Hi All,

Running ZoL v0.6.5.3 on debian jessie. I just created a new pool consisting of 12x 4TB drives in raidz2 (everything on this pool is backed up, and maximum reasonably safe capacity is more important). I just nuked my previous pool of 2x6 drive raidz2's for this configuration to gain some additional usable space.

Anyways, zfs list is reporting 32.1TB usable right out of the gate, which seems very low. df -h reports 33TB, also quite low. I've tried creating the pool with the standard 128k record size as well as 1M record size (all of my files are large) without any difference in reported size. I'm not really sure what's going on here?

Previously you were using 6-disk RAIDZ2 vdevs which when using 4K sectors (ashift=12) have no extra overhead beyond the metadata space reserved by ZFS.

a 12-disk RAIDZ2 vdevs on ashift=12 (4K sectors) has a rather large overahead (8.9%) with the default 128KiB recordsize.

When you said you created your pool with 128KiB vs 1MiB recordsize, well that's not a thing. You don't create a pool with a record size. A pool doesn't have a record size. A record size is a dataset property. A pool contains many datasets, so a pool can contain many recordsizes.

Since a pool can contain many recordsizes, ZFS always uses the default (128KiB) record size as an assumption when calculating the size and capacity of the zpool, so whether or not you set 128KiB or 1MiB recordize for the datasets in your pool, it wont change the free space capacity that ZFS is showing. (they had to pick something).

However, writing your data to datasets that have 1MiB recordsize will take up less space than writing data to datasets that are using 128MiB recordsize (if your pool has padding overhead with 128KiB recordsize).

Let me show you some math about it.

Lets look at the 6x4TB vdevs. Each disk is 3.638TiB and 2 are parity, so that's 14.552TiB times 2 vdevs = 29.104TiB. But then subtract the 1.6% reserved for metadata space and you get 28.64TiB capacity of your old pool. This should be pretty close to what it was.

If you write a 10GB file to this pool, it will increase the USED property (zfs get used tank) by 10GB. This will be the same whether you use 128KiB or 1MiB recordsize on the dataset you are writing to.

Now lets look at your new 12x4TB RAIDZ2 vdev pool.

Since 2 are parity, 3.638 * 10 = 36.38TiB, then subtract the 1.6% and you have so far 35.8TiB. But this is not the space you get, because there is sector padding overhead on a 12-disk RAIDZ2 when you are using 4K sectors (ashift=12) and using 128KiB recordsize (which ZFS always does when calculating free space). The space you actually get is 32.62TiB.

There is about an 8.9% padding overhead associated with a 12-disk RAIDZ2.

Now, for how this interacts with large_blocks.

If you create a 128KiB dataset on this pool and write a 10GB file, it will as before, increase the USED property by 10GB.

But if you create a 1MiB recordsize dataset on the pool and write a 10GB file, it will only increase the USED property by 9.11GB.

As you can see, writing data to the 1MiB recordsize dataset results in the files appearing to take up 8.9% less space than how large they really are. (this has nothing to do with compression, this happens on in-compressible/already compressed data too). So the end result is that the pool will let you write 35.8TiB of data to it (if you use 1MiB recordsize), and it will show up as only 32.62TiB of data.

So you are effectively gaining back all the space that would be lost to padding with 128KiB records if you only use 1MiB records on all your datasets.

Here is some info on the padding space of 4K sectors on ashift=12:
https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html

You can change the reservation from 3.2% back to 1.6% by setting the kernel parameter spa_slop_shift back to 6.

You can do this live by executing: echo 6 > /sys/module/zfs/parameters/spa_slop_shift.

Kernel parameters don't stay between boots so you can set this property on bootup the recommended way by creating a file: /etc/modprobe.d/zfs.conf and in the file writing a line: options zfs spa_slop_shift=6

Hope this helps.
 
Last edited:
There is about an 8.9% padding overhead associated with a 12-disk RAIDZ2.

Now, for how this interacts with large_blocks.

If you create a 128KiB dataset on this pool and write a 10GB file, it will as before, increase the USED property by 10GB.

But if you create a 1MiB recordsize dataset on the pool and write a 10GB file, it will only increase the USED property by 9.11GB.

As you can see, writing data to the 1MiB recordsize dataset results in the files appearing to take up 8.9% less space than how large they really are. (this has nothing to do with compression, this happens on in-compressible/already compressed data too). So the end result is that the pool will let you write 35.8TiB of data to it (if you use 1MiB recordsize), and it will show up as only 32.62TiB of data.

So you are effectively gaining back all the space that would be lost to padding with 128KiB records if you only use 1MiB records on all your datasets.

Here is some info on the padding space of 4K sectors on ashift=12:
https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html

You can change the reservation from 3.2% back to 1.6% by setting the kernel parameter spa_slop_shift back to 6.

You can do this live by executing: echo 6 > /sys/module/zfs/parameters/spa_slop_shift.

Kernel parameters don't stay between boots so you can set this property on bootup the recommended way by creating a file: /etc/modprobe.d/zfs.conf and in the file writing a line: options zfs spa_slop_shift=6

Hope this helps.

SirMaster,

Wow! Really excellent reply! It really helps me wrap my head around how this is actually working, which I couldn't manage to do by myself. Thank you.

However, writing your data to datasets that have 1MiB recordsize will take up less space than writing data to datasets that are using 128MiB recordsize (if your pool has padding overhead with 128KiB recordsize).

Does this mean switching to a more efficient drive configuration results in less free space gained when using 1MB recordsize?


Also, what are the potential downsides to decreasing the filesystem reservation back to 1.6%? The changelog on the ZoL site didn't really mention why they increased the reservation.
 
SirMaster,

Wow! Really excellent reply! It really helps me wrap my head around how this is actually working, which I couldn't manage to do by myself. Thank you.



Does this mean switching to a more efficient drive configuration results in less free space gained when using 1MB recordsize?


Also, what are the potential downsides to decreasing the filesystem reservation back to 1.6%? The changelog on the ZoL site didn't really mention why they increased the reservation.

Yes, I should have mentioned for the first case of the 6x4TB RAIDZ2 which has no padding overhead that writing a 10GB file to a 128KiB dataset vs a 1MiB dataset will both make the USED property go up by 10GB so there is no space gained in that case. You only gain back the space that would be lost to padding in 128KiB.

Although 1MiB records could have better compression results, but that's a separate thing and works the same on any vdev layout.

They increased the reserved space to better prevent problems of running into a full pool and zfs getting stuck because it has no free space (so sometimes it can't even delete). It's safe to change on a pool so large as yours.

Although honestly it's probably not the best idea to fill your pool to less than 3.2% free space anyway, so does it really matter if you can't use the last 3.2% or 1.6%? Most people recommend adding more space around 80%. With large blocks you are probably fine up to 90% or maybe up to 95% at absolute max.
 
Last edited:
Back
Top