ZFS autoexpand problem

Discussion in 'SSDs & Data Storage' started by DataMine, Oct 3, 2013.

  1. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    Hello, I have a 16 disk RaidZ3 arrary that had a mix of 3TB, 2TB, and 1.5TB disk units in it. I was getting 15.5TB of usable space. Then I replaced the last 1.5TB disk with a 3TB disk and set the autoexpand flag to on. I shut down my system to replace the 1.5TB disk with the 3TB after the resilver and rebooted. but I am only getting 20.7 TB of space. My other arrary 15x2TB RaidZ2 is getting 22TB of usable space, so a 16x2TB RaidZ3 should be getting the same right? Am I missing something?

    The Idea was to get more space now, as I find deals on new 3TB disk I would replace the smaller disk until all the disk in the array were 3TB, afterwards I would make a new array with the left over 2TB, and use the 1.5TB for offsite backup of the important data
    Code:
    NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
    media01  27.2T  20.0T  7.21T    73%  1.00x  ONLINE  -
    media03    29T  10.9T  18.1T    37%  1.00x  ONLINE  -
    
    Code:
     pool: media03
     state: ONLINE
      scan: resilvered 56K in 0h0m with 0 errors on Thu Oct  3 15:34:56 2013
    config:
    
    	NAME                                            STATE     READ WRITE CKSUM
    	media03                                         ONLINE       0     0     0
    	  raidz3-0                                      ONLINE       0     0     0
    	    ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2185388    ONLINE       0     0     0
    	    ata-Hitachi_HDS722020ALA330_JK1174YAJ88NWW  ONLINE       0     0     0
    	    ata-ST3000DM001-9YN166_W1F11XWL             ONLINE       0     0     0
    	    ata-TOSHIBA_DT01ABA300_83C121NKS            ONLINE       0     0     0
    	    ata-WDC_WD20EARX-00PASB0_WD-WCAZAA274537    ONLINE       0     0     0
    	    ata-Hitachi_HDS723020BLA642_MN1220F30KTX0D  ONLINE       0     0     0
    	    ata-WDC_WD20EURS-63S48Y0_WD-WMAZA7268996    ONLINE       0     0     0
    	    ata-Hitachi_HDS722020ALA330_JK11A8BFGATBMF  ONLINE       0     0     0
    	    ata-TOSHIBA_DT01ABA300_23KDLMPGS            ONLINE       0     0     0
    	    ata-TOSHIBA_DT01ABA300_536RWX8AS            ONLINE       0     0     0
    	    ata-TOSHIBA_DT01ABA300_238DEA2GS            ONLINE       0     0     0
    	    ata-Hitachi_HDS723020BLA642_MN1220F31XRE6D  ONLINE       0     0     0
    	    ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5246623    ONLINE       0     0     0
    	    ata-ST2000DM001-9YN164_Z1F02A3G             ONLINE       0     0     0
    	    ata-ST33000651AS_9XK0FBDG                   ONLINE       0     0     0
    	    ata-WDC_WD30EZRX-00DC0B0_WD-WMC1T1704191    ONLINE       0     0     0
    
    
     
  2. Jim G

    Jim G Limp Gawd

    Messages:
    221
    Joined:
    Jun 2, 2011
    Edit: Misread problem, sorry. That probably won't help!

    Are you mixing 512 and 4k disks? What's the pool ashift?
     
  3. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    not really sure how to check 512 vs 4k (I thought that even mixing the only resulted in lower read/write performance and should not affect capacity at all) but here is all the flags set on the pool

    Code:
    NAME     PROPERTY               VALUE                  SOURCE
    media03  size                   29T                    -
    media03  capacity               38%                    -
    media03  altroot                -                      default
    media03  health                 ONLINE                 -
    media03  guid                   15985232323593502269   default
    media03  version                -                      default
    media03  bootfs                 -                      default
    media03  delegation             on                     default
    media03  autoreplace            off                    default
    media03  cachefile              -                      default
    media03  failmode               wait                   default
    media03  listsnapshots          off                    default
    media03  autoexpand             on                     local
    media03  dedupditto             0                      default
    media03  dedupratio             1.00x                  -
    media03  free                   17.8T                  -
    media03  allocated              11.2T                  -
    media03  readonly               off                    -
    media03  ashift                 0                      default
    media03  comment                -                      default
    media03  expandsize             0                      -
    media03  freeing                0                      default
    media03  feature@async_destroy  enabled                local
    media03  feature@empty_bpobj    enabled                local
    media03  feature@lz4_compress   enabled                local
    
    
     
  4. omniscence

    omniscence [H]ard|Gawd

    Messages:
    1,311
    Joined:
    Jun 27, 2010
    Please execute a 'zdb -C media03' to get your ashift. From what I have seen here I suppose you have an ashift of 9. With 13 data disks and an ashift of 12 you would waste a huge amount of space (rougly 18%).

    You cannot just add AVAIL and USED to get the total amount of space, especially when the pools carry different data and even have a unequal filling levels. To calculate the available space ZFS estimates how much metadata it will require, but this may or may not match to the space the metadata uses when the pool fills up.
     
  5. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    yes it says ashift: 12, should i change this? and how would I?

    Code:
    MOS Configuration:
            version: 5000
            name: 'media03'
            state: 0
            txg: 76309
            pool_guid: 15985232323593502269
            hostname: 'nas1'
            vdev_children: 1
            vdev_tree:
                type: 'root'
                id: 0
                guid: 15985232323593502269
                children[0]:
                    type: 'raidz'
                    id: 0
                    guid: 10897904191755033925
                    nparity: 3
                    metaslab_array: 33
                    metaslab_shift: 37
                    ashift: 12
                    asize: 32006155010048
                    is_log: 0
                    create_txg: 4
                    children[0]:
                        type: 'disk'
                        id: 0
                        guid: 10086677702268349541
                        path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00MMMB0_WD-WCAWZ2185388-part1'
                        whole_disk: 1
                        DTL: 120
                        create_txg: 4
                    children[1]:
                        type: 'disk'
                        id: 1
                        guid: 10746556501007002594
                        path: '/dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK1174YAJ88NWW-part1'
                        whole_disk: 1
                        DTL: 52
                        create_txg: 4
                    children[2]:
                        type: 'disk'
                        id: 2
                        guid: 13587071107586707009
                        path: '/dev/disk/by-id/ata-ST3000DM001-9YN166_W1F11XWL-part1'
                        whole_disk: 1
                        DTL: 51
                        create_txg: 4
                    children[3]:
                        type: 'disk'
                        id: 3
                        guid: 16818880299089374457
                        path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_83C121NKS-part1'
                        whole_disk: 1
                        DTL: 50
                        create_txg: 4
                    children[4]:
                        type: 'disk'
                        id: 4
                        guid: 10885010448189793385
                        path: '/dev/disk/by-id/ata-WDC_WD20EARX-00PASB0_WD-WCAZAA274537-part1'
                        whole_disk: 1
                        DTL: 49
                        create_txg: 4
                    children[5]:
                        type: 'disk'
                        id: 5
                        guid: 12654990196337501768
                        path: '/dev/disk/by-id/ata-Hitachi_HDS723020BLA642_MN1220F30KTX0D-part1'
                        whole_disk: 1
                        DTL: 48
                        create_txg: 4
                    children[6]:
                        type: 'disk'
                        id: 6
                        guid: 16265472658171074893
                        path: '/dev/disk/by-id/ata-WDC_WD20EURS-63S48Y0_WD-WMAZA7268996-part1'
                        whole_disk: 1
                        DTL: 47
                        create_txg: 4
                    children[7]:
                        type: 'disk'
                        id: 7
                        guid: 8542810596399517192
                        path: '/dev/disk/by-id/ata-Hitachi_HDS722020ALA330_JK11A8BFGATBMF-part1'
                        whole_disk: 1
                        DTL: 46
                        create_txg: 4
                    children[8]:
                        type: 'disk'
                        id: 8
                        guid: 2094143228829878287
                        path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_23KDLMPGS-part1'
                        whole_disk: 1
                        DTL: 45
                        create_txg: 4
                    children[9]:
                        type: 'disk'
                        id: 9
                        guid: 4538046503230694112
                        path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_536RWX8AS-part1'
                        whole_disk: 1
                        DTL: 44
                        create_txg: 4
                    children[10]:
                        type: 'disk'
                        id: 10
                        guid: 16405706797812393936
                        path: '/dev/disk/by-id/ata-TOSHIBA_DT01ABA300_238DEA2GS-part1'
                        whole_disk: 1
                        DTL: 43
                        create_txg: 4
                    children[11]:
                        type: 'disk'
                        id: 11
                        guid: 8930509428790570889
                        path: '/dev/disk/by-id/ata-Hitachi_HDS723020BLA642_MN1220F31XRE6D-part1'
                        whole_disk: 1
                        DTL: 42
                        create_txg: 4
                    children[12]:
                        type: 'disk'
                        id: 12
                        guid: 6927026716990328985
                        path: '/dev/disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5246623-part1'
                        whole_disk: 1
                        DTL: 41
                        create_txg: 4
                    children[13]:
                        type: 'disk'
                        id: 13
                        guid: 4250982475318050661
                        path: '/dev/disk/by-id/ata-ST2000DM001-9YN164_Z1F02A3G-part1'
                        whole_disk: 1
                        DTL: 40
                        create_txg: 4
                    children[14]:
                        type: 'disk'
                        id: 14
                        guid: 9830375710225770979
                        path: '/dev/disk/by-id/ata-ST33000651AS_9XK0FBDG-part1'
                        whole_disk: 1
                        DTL: 39
                        create_txg: 4
                    children[15]:
                        type: 'disk'
                        id: 15
                        guid: 15516814786807439750
                        path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00DC0B0_WD-WMC1T1704191-part1'
                        whole_disk: 1
                        DTL: 38
                        create_txg: 4
            features_for_read:
    
    
     
  6. omniscence

    omniscence [H]ard|Gawd

    Messages:
    1,311
    Joined:
    Jun 27, 2010
    Well, ashift cannot be changed, only set during pool creation. With 13 data drives (3 parity) you will waste around 18% of the total space due to padding. However, ashift=12 is the best setting for modern drives which have 4 KiB sectors, with ashift=9 you will lose a lot of performance (basically half the IOPS for writes) and create a small integrity gap. Only some enterprise class drives still use 512 byte sectors.

    Another solution ti this problem is to wait till the open source versions of ZFS get 1 MiB records.
     
  7. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    would I be better off backing my data up, destroying the pool and making a new one with ashift=9, this is mostly a Read only pool so write speed does not matter as much as space does. so I just want to get the most space out of my unit as possible. according to my quick math I should be getting 16 drives - 3 part. drives = 13x1.86=24.18 TB of usable space, I am getting 20.7TB of space, losing almost 4TB is not acceptable.
     
  8. dandragonrage

    dandragonrage [H]ardForum Junkie

    Messages:
    8,299
    Joined:
    Jun 5, 2004
    WTF are you talking about? Larger ashifts mean larger sector sizes which can waste more space for small files (or files that are, say, 4097 bytes), but why are you claiming that it just wastes a certain percentage of space? It's not ashift itself that wastes the space. Let's be honest here - it's poorly planned arrays that can waste space (well, that and ZFS itself being programmed in such a way, but that's not something that's going to change). The drives NEED ashift 12 to get proper performance.

    OP, you actually NEED ashift set to 12 to get proper performance out of your drives. Do NOT set it to 9. You should look into not making a 13-disk array instead. Unfortunately you have found out that ZFS is FAR from perfect. It's very picky about how you design parity-based arrays.
     
    Last edited: Oct 4, 2013
  9. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    so how do I fix it?
     
  10. dandragonrage

    dandragonrage [H]ardForum Junkie

    Messages:
    8,299
    Joined:
    Jun 5, 2004
    Are you open to the idea of making multiple smaller Vdevs instead? Can you list how many drives of each capacity you have? I could probably figure it out by Googling the drive model numbers that I'm not familiar with, but I'm lazy :)

    I can't guarantee you'll be happy with the results, though. We may still end up having to eat more than 3 disks of space.
     
  11. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    well, right now I am wasting about 7Tb of space so Im up.

    I have
    9-3TB drives
    9-2TB drives
    one 16 bay case, one hotswap case (2 bay) that I have been using for backup and to replace the disk in my array.
     
  12. omniscence

    omniscence [H]ard|Gawd

    Messages:
    1,311
    Joined:
    Jun 27, 2010
    A full record of 128 KiB is striped over 13 data disks. That means each disk receives 9.85 KiB of data. Since the smallest unit of data on a disk is dictated by the block size (2ˆashift), an ashift of 12 means that per disk 3 blocks of 4 KiB have to be allocated, as it cannot fit into 2 blocks. The remaining space is just padded. So the 128 KiB record will actually take 12 KiB*13 disks = 156 KiB, resulting in a space efficiency of 82%. This will get worse for smaller records. Of course in reality there is no separation between data and parity disks, it is just interleaved over all disks.

    When using an ashift of 9, the 9.85 KiB of data will use 20 blocks, each 512 byte in size, requiring 10 KiB of actual space on each disk and 130 KiB space for the whole record. In this case the space efficiency is larger than 98%.

    If you setup an array with a power of two number of data disks (plus 1-3 parity), you get always 100% regardless of the ashift setting. That is because the division of the record size by the number of disks and by the block size will always result in an integer.

    You may ask, why the data will not sequentially fill the complete stripes. Because then it would involve ridiculous math just to calculate where a specific address inside a file is stored on disk.

    You can easily find out how the records are actually stored on disk by playing around with zdb for a while.
     
    Last edited: Oct 4, 2013
  13. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    is there some easy formula I can use to determine how i should setup my disk? I don't really care about speed of the array since I will be limited to a bonded gigabit bandwidth (2x1Gig) anyway. but I want to setup for maximum space with the correct amount of drive protection as well. I have been backing up all the data from the array so destroying the pool is a valid option.
     
  14. Firebug24k

    Firebug24k [H]Lite

    Messages:
    106
    Joined:
    Aug 31, 2006
    This is correct, and the best explanation of this phenomenon I've read. Took me a while to figure this out with trial and error :)
     
  15. omniscence

    omniscence [H]ard|Gawd

    Messages:
    1,311
    Joined:
    Jun 27, 2010
    Your best bet is to use an 11-drive RAIDZ3 or 10-drive RAIDZ2 with disks of the same size.
     
  16. DataMine

    DataMine n00b

    Messages:
    41
    Joined:
    Feb 8, 2012
    well i was going with a 19 disk Raidz3, but since I had exactly 11 3TB disk I went with 11 Disk Raidz3, thanks all, this thread can be locked now.
     
  17. omniscence

    omniscence [H]ard|Gawd

    Messages:
    1,311
    Joined:
    Jun 27, 2010
    Just to add... 19 disk RAIDZ3 is also fine (assuming all disks are of the same size), I was not suggesting this because you stated that you use a 16 bay case.