ZFS autoexpand problem

DataMine

n00b
Joined
Feb 8, 2012
Messages
41
Hello, I have a 16 disk RaidZ3 arrary that had a mix of 3TB, 2TB, and 1.5TB disk units in it. I was getting 15.5TB of usable space. Then I replaced the last 1.5TB disk with a 3TB disk and set the autoexpand flag to on. I shut down my system to replace the 1.5TB disk with the 3TB after the resilver and rebooted. but I am only getting 20.7 TB of space. My other arrary 15x2TB RaidZ2 is getting 22TB of usable space, so a 16x2TB RaidZ3 should be getting the same right? Am I missing something?

The Idea was to get more space now, as I find deals on new 3TB disk I would replace the smaller disk until all the disk in the array were 3TB, afterwards I would make a new array with the left over 2TB, and use the 1.5TB for offsite backup of the important data
Code:
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
media01  27.2T  20.0T  7.21T    73%  1.00x  ONLINE  -
media03    29T  10.9T  18.1T    37%  1.00x  ONLINE  -
Code:
 pool: media03
 state: ONLINE
  scan: resilvered 56K in 0h0m with 0 errors on Thu Oct  3 15:34:56 2013
config:

	NAME                                            STATE     READ WRITE CKSUM
	media03                                         ONLINE       0     0     0
	  raidz3-0                                      ONLINE       0     0     0
	    ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2185388    ONLINE       0     0     0
	    ata-Hitachi_HDS722020ALA330_JK1174YAJ88NWW  ONLINE       0     0     0
	    ata-ST3000DM001-9YN166_W1F11XWL             ONLINE       0     0     0
	    ata-TOSHIBA_DT01ABA300_83C121NKS            ONLINE       0     0     0
	    ata-WDC_WD20EARX-00PASB0_WD-WCAZAA274537    ONLINE       0     0     0
	    ata-Hitachi_HDS723020BLA642_MN1220F30KTX0D  ONLINE       0     0     0
	    ata-WDC_WD20EURS-63S48Y0_WD-WMAZA7268996    ONLINE       0     0     0
	    ata-Hitachi_HDS722020ALA330_JK11A8BFGATBMF  ONLINE       0     0     0
	    ata-TOSHIBA_DT01ABA300_23KDLMPGS            ONLINE       0     0     0
	    ata-TOSHIBA_DT01ABA300_536RWX8AS            ONLINE       0     0     0
	    ata-TOSHIBA_DT01ABA300_238DEA2GS            ONLINE       0     0     0
	    ata-Hitachi_HDS723020BLA642_MN1220F31XRE6D  ONLINE       0     0     0
	    ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5246623    ONLINE       0     0     0
	    ata-ST2000DM001-9YN164_Z1F02A3G             ONLINE       0     0     0
	    ata-ST33000651AS_9XK0FBDG                   ONLINE       0     0     0
	    ata-WDC_WD30EZRX-00DC0B0_WD-WMC1T1704191    ONLINE       0     0     0
 
Edit: Misread problem, sorry. That probably won't help!

Are you mixing 512 and 4k disks? What's the pool ashift?
 
not really sure how to check 512 vs 4k (I thought that even mixing the only resulted in lower read/write performance and should not affect capacity at all) but here is all the flags set on the pool

Code:
NAME     PROPERTY               VALUE                  SOURCE
media03  size                   29T                    -
media03  capacity               38%                    -
media03  altroot                -                      default
media03  health                 ONLINE                 -
media03  guid                   15985232323593502269   default
media03  version                -                      default
media03  bootfs                 -                      default
media03  delegation             on                     default
media03  autoreplace            off                    default
media03  cachefile              -                      default
media03  failmode               wait                   default
media03  listsnapshots          off                    default
media03  autoexpand             on                     local
media03  dedupditto             0                      default
media03  dedupratio             1.00x                  -
media03  free                   17.8T                  -
media03  allocated              11.2T                  -
media03  readonly               off                    -
media03  ashift                 0                      default
media03  comment                -                      default
media03  expandsize             0                      -
media03  freeing                0                      default
media03  feature@async_destroy  enabled                local
media03  feature@empty_bpobj    enabled                local
media03  feature@lz4_compress   enabled                local
 
Please execute a 'zdb -C media03' to get your ashift. From what I have seen here I suppose you have an ashift of 9. With 13 data disks and an ashift of 12 you would waste a huge amount of space (rougly 18%).

You cannot just add AVAIL and USED to get the total amount of space, especially when the pools carry different data and even have a unequal filling levels. To calculate the available space ZFS estimates how much metadata it will require, but this may or may not match to the space the metadata uses when the pool fills up.
 
yes it says ashift: 12, should i change this? and how would I?

Code:
MOS Configuration:
        version: 5000
        name: 'media03'
        state: 0
        txg: 76309
        pool_guid: 15985232323593502269
        hostname: 'nas1'
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 15985232323593502269
            children[0]:
                type: 'raidz'
                id: 0
                guid: 10897904191755033925
                nparity: 3
                metaslab_array: 33
                metaslab_shift: 37
                ashift: 12
                asize: 32006155010048
                is_log: 0
                create_txg: 4
                children[0]:
                    type: 'disk'
                    id: 0
                    guid: 10086677702268349541
                    path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2185388-part1'
                    whole_disk: 1
                    DTL: 120
                    create_txg: 4
                children[1]:
                    type: 'disk'
                    id: 1
                    guid: 10746556501007002594
                    path: '/dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK1174YAJ88NWW-part1'
                    whole_disk: 1
                    DTL: 52
                    create_txg: 4
                children[2]:
                    type: 'disk'
                    id: 2
                    guid: 13587071107586707009
                    path: '/dev/disk/by-id/ata-ST3000DM001-9YN166_W1F11XWL-part1'
                    whole_disk: 1
                    DTL: 51
                    create_txg: 4
                children[3]:
                    type: 'disk'
                    id: 3
                    guid: 16818880299089374457
                    path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_83C121NKS-part1'
                    whole_disk: 1
                    DTL: 50
                    create_txg: 4
                children[4]:
                    type: 'disk'
                    id: 4
                    guid: 10885010448189793385
                    path: '/dev/disk/by-id/ata-WDC_WD20EARX-00PASB0_WD-WCAZAA274537-part1'
                    whole_disk: 1
                    DTL: 49
                    create_txg: 4
                children[5]:
                    type: 'disk'
                    id: 5
                    guid: 12654990196337501768
                    path: '/dev/disk/by-id/ata-Hitachi_HDS723020BLA642_MN1220F30KTX0D-part1'
                    whole_disk: 1
                    DTL: 48
                    create_txg: 4
                children[6]:
                    type: 'disk'
                    id: 6
                    guid: 16265472658171074893
                    path: '/dev/disk/by-id/ata-WDC_WD20EURS-63S48Y0_WD-WMAZA7268996-part1'
                    whole_disk: 1
                    DTL: 47
                    create_txg: 4
                children[7]:
                    type: 'disk'
                    id: 7
                    guid: 8542810596399517192
                    path: '/dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK11A8BFGATBMF-part1'
                    whole_disk: 1
                    DTL: 46
                    create_txg: 4
                children[8]:
                    type: 'disk'
                    id: 8
                    guid: 2094143228829878287
                    path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_23KDLMPGS-part1'
                    whole_disk: 1
                    DTL: 45
                    create_txg: 4
                children[9]:
                    type: 'disk'
                    id: 9
                    guid: 4538046503230694112
                    path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_536RWX8AS-part1'
                    whole_disk: 1
                    DTL: 44
                    create_txg: 4
                children[10]:
                    type: 'disk'
                    id: 10
                    guid: 16405706797812393936
                    path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_238DEA2GS-part1'
                    whole_disk: 1
                    DTL: 43
                    create_txg: 4
                children[11]:
                    type: 'disk'
                    id: 11
                    guid: 8930509428790570889
                    path: '/dev/disk/by-id/ata-Hitachi_HDS723020BLA642_MN1220F31XRE6D-part1'
                    whole_disk: 1
                    DTL: 42
                    create_txg: 4
                children[12]:
                    type: 'disk'
                    id: 12
                    guid: 6927026716990328985
                    path: '/dev/disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5246623-part1'
                    whole_disk: 1
                    DTL: 41
                    create_txg: 4
                children[13]:
                    type: 'disk'
                    id: 13
                    guid: 4250982475318050661
                    path: '/dev/disk/by-id/ata-ST2000DM001-9YN164_Z1F02A3G-part1'
                    whole_disk: 1
                    DTL: 40
                    create_txg: 4
                children[14]:
                    type: 'disk'
                    id: 14
                    guid: 9830375710225770979
                    path: '/dev/disk/by-id/ata-ST33000651AS_9XK0FBDG-part1'
                    whole_disk: 1
                    DTL: 39
                    create_txg: 4
                children[15]:
                    type: 'disk'
                    id: 15
                    guid: 15516814786807439750
                    path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00DC0B0_WD-WMC1T1704191-part1'
                    whole_disk: 1
                    DTL: 38
                    create_txg: 4
        features_for_read:
 
Well, ashift cannot be changed, only set during pool creation. With 13 data drives (3 parity) you will waste around 18% of the total space due to padding. However, ashift=12 is the best setting for modern drives which have 4 KiB sectors, with ashift=9 you will lose a lot of performance (basically half the IOPS for writes) and create a small integrity gap. Only some enterprise class drives still use 512 byte sectors.

Another solution ti this problem is to wait till the open source versions of ZFS get 1 MiB records.
 
would I be better off backing my data up, destroying the pool and making a new one with ashift=9, this is mostly a Read only pool so write speed does not matter as much as space does. so I just want to get the most space out of my unit as possible. according to my quick math I should be getting 16 drives - 3 part. drives = 13x1.86=24.18 TB of usable space, I am getting 20.7TB of space, losing almost 4TB is not acceptable.
 
With 13 data disks and an ashift of 12 you would waste a huge amount of space (rougly 18%).

WTF are you talking about? Larger ashifts mean larger sector sizes which can waste more space for small files (or files that are, say, 4097 bytes), but why are you claiming that it just wastes a certain percentage of space? It's not ashift itself that wastes the space. Let's be honest here - it's poorly planned arrays that can waste space (well, that and ZFS itself being programmed in such a way, but that's not something that's going to change). The drives NEED ashift 12 to get proper performance.

OP, you actually NEED ashift set to 12 to get proper performance out of your drives. Do NOT set it to 9. You should look into not making a 13-disk array instead. Unfortunately you have found out that ZFS is FAR from perfect. It's very picky about how you design parity-based arrays.
 
Last edited:
so how do I fix it?

Are you open to the idea of making multiple smaller Vdevs instead? Can you list how many drives of each capacity you have? I could probably figure it out by Googling the drive model numbers that I'm not familiar with, but I'm lazy :)

I can't guarantee you'll be happy with the results, though. We may still end up having to eat more than 3 disks of space.
 
well, right now I am wasting about 7Tb of space so Im up.

I have
9-3TB drives
9-2TB drives
one 16 bay case, one hotswap case (2 bay) that I have been using for backup and to replace the disk in my array.
 
WTF are you talking about? Larger ashifts mean larger sector sizes which can waste more space for small files (or files that are, say, 4097 bytes), but why are you claiming that it just wastes a certain percentage of space? It's not ashift itself that wastes the space. Let's be honest here - it's poorly planned arrays that can waste space (well, that and ZFS itself being programmed in such a way, but that's not something that's going to change). The drives NEED ashift 12 to get proper performance.

A full record of 128 KiB is striped over 13 data disks. That means each disk receives 9.85 KiB of data. Since the smallest unit of data on a disk is dictated by the block size (2ˆashift), an ashift of 12 means that per disk 3 blocks of 4 KiB have to be allocated, as it cannot fit into 2 blocks. The remaining space is just padded. So the 128 KiB record will actually take 12 KiB*13 disks = 156 KiB, resulting in a space efficiency of 82%. This will get worse for smaller records. Of course in reality there is no separation between data and parity disks, it is just interleaved over all disks.

When using an ashift of 9, the 9.85 KiB of data will use 20 blocks, each 512 byte in size, requiring 10 KiB of actual space on each disk and 130 KiB space for the whole record. In this case the space efficiency is larger than 98%.

If you setup an array with a power of two number of data disks (plus 1-3 parity), you get always 100% regardless of the ashift setting. That is because the division of the record size by the number of disks and by the block size will always result in an integer.

You may ask, why the data will not sequentially fill the complete stripes. Because then it would involve ridiculous math just to calculate where a specific address inside a file is stored on disk.

You can easily find out how the records are actually stored on disk by playing around with zdb for a while.
 
Last edited:
is there some easy formula I can use to determine how i should setup my disk? I don't really care about speed of the array since I will be limited to a bonded gigabit bandwidth (2x1Gig) anyway. but I want to setup for maximum space with the correct amount of drive protection as well. I have been backing up all the data from the array so destroying the pool is a valid option.
 
A full record of 128 KiB is striped over 13 data disks. That means each disk receives 9.85 KiB of data. Since the smallest unit of data on a disk is dictated by the block size (2ˆashift), an ashift of 12 means that per disk 3 blocks of 4 KiB have to be allocated, as it cannot fit into 2 blocks. The remaining space is just padded. So the 128 KiB record will actually take 12 KiB*13 disks = 156 KiB, resulting in a space efficiency of 82%. This will get worse for smaller records. Of course in reality there is no separation between data and parity disks, it is just interleaved over all disks.

When using an ashift of 9, the 9.85 KiB of data will use 20 blocks, each 512 byte in size, requiring 10 KiB of actual space on each disk and 130 KiB space for the whole record. In this case the space efficiency is larger than 98%.

If you setup an array with a power of two number of data disks (plus 1-3 parity), you get always 100% regardless of the ashift setting. That is because the division of the record size by the number of disks and by the block size will always result in an integer.

You may ask, why the data will not sequentially fill the complete stripes. Because then it would involve ridiculous math just to calculate where a specific address inside a file is stored on disk.

You can easily find out how the records are actually stored on disk by playing around with zdb for a while.

This is correct, and the best explanation of this phenomenon I've read. Took me a while to figure this out with trial and error :)
 
is there some easy formula I can use to determine how i should setup my disk? I don't really care about speed of the array since I will be limited to a bonded gigabit bandwidth (2x1Gig) anyway. but I want to setup for maximum space with the correct amount of drive protection as well. I have been backing up all the data from the array so destroying the pool is a valid option.

Your best bet is to use an 11-drive RAIDZ3 or 10-drive RAIDZ2 with disks of the same size.
 
well i was going with a 19 disk Raidz3, but since I had exactly 11 3TB disk I went with 11 Disk Raidz3, thanks all, this thread can be locked now.
 
Just to add... 19 disk RAIDZ3 is also fine (assuming all disks are of the same size), I was not suggesting this because you stated that you use a 16 bay case.
 
Back
Top