Confused by zfs pool size after creation

Discussion in 'SSDs & Data Storage' started by bleomycin, Oct 16, 2015.

  1. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    Hi All,

    Running ZoL v0.6.5.3 on debian jessie. I just created a new pool consisting of 12x 4TB drives in raidz2 (everything on this pool is backed up, and maximum reasonably safe capacity is more important). I just nuked my previous pool of 2x6 drive raidz2's for this configuration to gain some additional usable space.

    Anyways, zfs list is reporting 32.1TB usable right out of the gate, which seems very low. df -h reports 33TB, also quite low. I've tried creating the pool with the standard 128k record size as well as 1M record size (all of my files are large) without any difference in reported size. I'm not really sure what's going on here?
     
  2. patrickdk

    patrickdk Gawd

    Messages:
    744
    Joined:
    Jan 3, 2012
    Seems high to me, I would expect zfs list for your config to claim 29.1tb available.

    If you used 512sector disks, and not 4k, then 36.3tb
     
  3. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    I guess I just don't understand. If I have 10x4TB drives leftover for storage after 2 are lost to parity shouldn't I be seeing ~37TB usable, or a bit less? These are 4k sector disks. I just went from 4 parity drives to 2 parity drives when i switched from 2x6 drive raidz2 to 1x12 drive raidz2 and only gained 3.55TB of space, 1 entire drive seems to have just gone missing?
     
  4. halfelite

    halfelite n00b

    Messages:
    31
    Joined:
    Feb 17, 2009
    First all the zfs commands give different information. Does "zpool list" give you what you want to see. There is a long winded reason that "zfs list" takes into account all the internal metadata and private reservations and things of that nature. and such while "zpool list" gives you raw stats.
     
    Last edited: Oct 16, 2015
  5. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    Yeah, zpool list gives me the total correct size before parity:

    Code:
    zpool list
    NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
    tank  43.5T  14.2T  29.3T         -    10%    32%  1.00x  ONLINE  -
    Code:
    zfs list
    NAME         USED  AVAIL  REFER  MOUNTPOINT
    tank        10.8T  21.3T   219K  /tank
    tank/Media  10.8T  21.3T  10.8T  /tank/Media
    Code:
    df -h
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/vda1        31G   23G  6.4G  79% /
    udev             10M     0   10M   0% /dev
    tmpfs           5.8G  8.7M  5.8G   1% /run
    tmpfs            15G     0   15G   0% /dev/shm
    tmpfs           5.0M     0  5.0M   0% /run/lock
    tmpfs            15G     0   15G   0% /sys/fs/cgroup
    tank             22T     0   22T   0% /tank
    tank/Media       33T   11T   22T  34% /tank/Media
    Code:
    zfs get all tank
    [URL="https://gist.github.com/anonymous/d6721e4b2c730dde31e7"]https://gist.github.com/anonymous/d6721e4b2c730dde31e7[/URL]
    Code:
    zpool get all tank
    [URL="https://gist.github.com/anonymous/7c1dd64f308fccc81f7a"]https://gist.github.com/anonymous/7c1dd64f308fccc81f7a[/URL]
     
  6. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    I found the explanation here for anyone who may have been curious, assuming it is correct: https://bedecarroll.com/2015/01/26/freenas-zfs-performance-testing/

    He says:

    Code:
    4 TB drives = about 3.6 TiB (AKA formated capacity)
    12 x 3.6 TiB = 43.2 TiB
    12 drives with 2 drives worth of parity = 10 data disks
    43.2 TiB - 7.2 TiB = 36 TiB
    ZFS has a metadata overhead of 1/16th per drive so:
    1/16th of 3.6 TiB = 0.225 TiB
    12 x 0.225 TiB = 2.7 TiB
    36 TiB - 2.7 TiB = 33.3 TiB of free space, roughly
    That seems to make sense...
     
  7. zrav

    zrav Limp Gawd

    Messages:
    163
    Joined:
    Sep 22, 2011
    There's also the default reserved space to consider, which was increased from 1.6% to 3.2% in ZoL 0.6.5.
     
  8. HammerSandwich

    HammerSandwich [H]ard|Gawd

    Messages:
    1,117
    Joined:
    Nov 18, 2004
    Check out Figure 4 here. 10 data disks seems to be worst case!
     
  9. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    Crap, I didn't know about that one. Looks like spa_slop_shift 6 should set it back to how it was, hopefully that's a live option and not one i needed to do at creation. I'll have to look more into that, thanks for the tip!
     
  10. Aesma

    Aesma [H]ard|Gawd

    Messages:
    1,844
    Joined:
    Mar 24, 2010
    Yes, that's the main reason.

    With 512b sectors it's not so big of a problem but with 4K, the problem is compounded.
     
  11. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    Yeah, that definitely seems to be a a big factor as well. I wonder how btrfs handles this scenario in comparison?
     
  12. Percy

    Percy Gawd

    Messages:
    750
    Joined:
    Sep 27, 2002
    This is a little off topic, but is ZFS giving you any errors during boot by chance?
     
  13. westrock2000

    westrock2000 [H]ardForum Junkie

    Messages:
    9,152
    Joined:
    Jun 3, 2005
    Just as a friendly reminder, if you didn't know, you do not have to call your pool "tank". That's just the example name that everyone uses.
     
  14. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    Previously you were using 6-disk RAIDZ2 vdevs which when using 4K sectors (ashift=12) have no extra overhead beyond the metadata space reserved by ZFS.

    a 12-disk RAIDZ2 vdevs on ashift=12 (4K sectors) has a rather large overahead (8.9%) with the default 128KiB recordsize.

    When you said you created your pool with 128KiB vs 1MiB recordsize, well that's not a thing. You don't create a pool with a record size. A pool doesn't have a record size. A record size is a dataset property. A pool contains many datasets, so a pool can contain many recordsizes.

    Since a pool can contain many recordsizes, ZFS always uses the default (128KiB) record size as an assumption when calculating the size and capacity of the zpool, so whether or not you set 128KiB or 1MiB recordize for the datasets in your pool, it wont change the free space capacity that ZFS is showing. (they had to pick something).

    However, writing your data to datasets that have 1MiB recordsize will take up less space than writing data to datasets that are using 128MiB recordsize (if your pool has padding overhead with 128KiB recordsize).

    Let me show you some math about it.

    Lets look at the 6x4TB vdevs. Each disk is 3.638TiB and 2 are parity, so that's 14.552TiB times 2 vdevs = 29.104TiB. But then subtract the 1.6% reserved for metadata space and you get 28.64TiB capacity of your old pool. This should be pretty close to what it was.

    If you write a 10GB file to this pool, it will increase the USED property (zfs get used tank) by 10GB. This will be the same whether you use 128KiB or 1MiB recordsize on the dataset you are writing to.

    Now lets look at your new 12x4TB RAIDZ2 vdev pool.

    Since 2 are parity, 3.638 * 10 = 36.38TiB, then subtract the 1.6% and you have so far 35.8TiB. But this is not the space you get, because there is sector padding overhead on a 12-disk RAIDZ2 when you are using 4K sectors (ashift=12) and using 128KiB recordsize (which ZFS always does when calculating free space). The space you actually get is 32.62TiB.

    There is about an 8.9% padding overhead associated with a 12-disk RAIDZ2.

    Now, for how this interacts with large_blocks.

    If you create a 128KiB dataset on this pool and write a 10GB file, it will as before, increase the USED property by 10GB.

    But if you create a 1MiB recordsize dataset on the pool and write a 10GB file, it will only increase the USED property by 9.11GB.

    As you can see, writing data to the 1MiB recordsize dataset results in the files appearing to take up 8.9% less space than how large they really are. (this has nothing to do with compression, this happens on in-compressible/already compressed data too). So the end result is that the pool will let you write 35.8TiB of data to it (if you use 1MiB recordsize), and it will show up as only 32.62TiB of data.

    So you are effectively gaining back all the space that would be lost to padding with 128KiB records if you only use 1MiB records on all your datasets.

    Here is some info on the padding space of 4K sectors on ashift=12:
    https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html

    You can change the reservation from 3.2% back to 1.6% by setting the kernel parameter spa_slop_shift back to 6.

    You can do this live by executing: echo 6 > /sys/module/zfs/parameters/spa_slop_shift.

    Kernel parameters don't stay between boots so you can set this property on bootup the recommended way by creating a file: /etc/modprobe.d/zfs.conf and in the file writing a line: options zfs spa_slop_shift=6

    Hope this helps.
     
    Last edited: Oct 19, 2015
  15. HammerSandwich

    HammerSandwich [H]ard|Gawd

    Messages:
    1,117
    Joined:
    Nov 18, 2004
  16. bleomycin

    bleomycin Limp Gawd

    Messages:
    238
    Joined:
    Aug 14, 2010
    SirMaster,

    Wow! Really excellent reply! It really helps me wrap my head around how this is actually working, which I couldn't manage to do by myself. Thank you.

    Does this mean switching to a more efficient drive configuration results in less free space gained when using 1MB recordsize?


    Also, what are the potential downsides to decreasing the filesystem reservation back to 1.6%? The changelog on the ZoL site didn't really mention why they increased the reservation.
     
  17. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    Yes, I should have mentioned for the first case of the 6x4TB RAIDZ2 which has no padding overhead that writing a 10GB file to a 128KiB dataset vs a 1MiB dataset will both make the USED property go up by 10GB so there is no space gained in that case. You only gain back the space that would be lost to padding in 128KiB.

    Although 1MiB records could have better compression results, but that's a separate thing and works the same on any vdev layout.

    They increased the reserved space to better prevent problems of running into a full pool and zfs getting stuck because it has no free space (so sometimes it can't even delete). It's safe to change on a pool so large as yours.

    Although honestly it's probably not the best idea to fill your pool to less than 3.2% free space anyway, so does it really matter if you can't use the last 3.2% or 1.6%? Most people recommend adding more space around 80%. With large blocks you are probably fine up to 90% or maybe up to 95% at absolute max.
     
    Last edited: Oct 19, 2015