RAIDZ2 space efficiency problem

bao__zhe · Oct 4, 2013

So I've been using ZFS RAID2 for some time and find that it's time for me to investigate the space efficiency problem as my pool is filling up now.

So here is the configuration: ST3000DM001-1CH1 Seagate 3TB x 10 in a RAIDZ2 pool with ashift=12 on Solaris 11.1

zpool status -v

Code:

  pool: data
 state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
        pool will no longer be accessible on older software versions.
  scan: scrub repaired 0 in 41h13m with 0 errors on Wed Oct  2 20:13:26 2013
config:

        NAME                       STATE     READ WRITE CKSUM
        data                       ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c0t5000C5004E21FDD3d0  ONLINE       0     0     0
            c0t5000C5004E246BFCd0  ONLINE       0     0     0
            c0t5000C5004E3F5596d0  ONLINE       0     0     0
            c0t5000C5003CD954C0d0  ONLINE       0     0     0
            c0t5000C5003CDCA539d0  ONLINE       0     0     0
            c0t5000C5004E4AB489d0  ONLINE       0     0     0
            c0t5000C5005203762Ed0  ONLINE       0     0     0
            c0t5000C5005203A060d0  ONLINE       0     0     0
            c0t5000C5005203A780d0  ONLINE       0     0     0
            c0t5000C5005209B060d0  ONLINE       0     0     0
        logs
          c0t5000000000000003d0    ONLINE       0     0     0
        cache
          c0t5000000000000000d0    ONLINE       0     0     0
          c8t14d0                  ONLINE       0     0     0

errors: No known data errors

zfs list

Code:

data                              16.6T  3.91T   558K  /data
data/Audio                         667G  3.82T   665G  /data/Audio
data/Download                      613G  3.82T   358G  /data/Download
data/ESXI                          918G  3.82T   466K  /data/ESXI
data/ESXI/FW                       202M  3.82T   195M  /data/ESXI/FW
data/ESXI/VCS                     3.12G  3.82T  1.47G  /data/ESXI/VCS
data/ESXI/VDR                      915G  3.82T   881G  /data/ESXI/VDR
data/ESXIB                         730G  3.82T   223G  /data/ESXIB
data/File                          590G  3.82T   587G  /data/File
data/Game                         2.11T  3.82T  2.11T  /data/Game
data/Production                    245G  3.82T   202G  /data/Production
data/System                        135G  3.82T   135G  /data/System
data/Temp                         2.91G  3.82T   393K  /data/Temp
data/Tool                          815G  3.82T   812G  /data/Tool
data/VM                            180G  3.82T   177G  /data/VM
data/Video                        9.62T  3.82T  9.57T  /data/Video

zpool iostat

Code:

               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        21.7T  5.56T    179     91  19.3M  2.73M
----------  -----  -----  -----  -----  -----  -----

zpool list

Code:

NAME     SIZE  ALLOC   FREE  CAP  DEDUP    HEALTH  ALTROOT
data    27.2T  21.7T  5.56T  79%  1.00x    ONLINE  -

My question is: given the 3TB HDDs (2.73TB to be exact) and 8 data disks in the pool the total available space should be 21.83TB. However it seems the useable space is 20.51TB (16.6 + 3.91). So I'm wondering where is the missing 1.3TB?

patrickdk · Oct 4, 2013

It is lost cause of your 4k sectors.

using 8*4k wide, you allocate 32k at a time. If you store something small, it will still use 32k of space (except for the exception when it switchs to mirroring mode).

So that 1.3tb is gone cause you didn't use single disks, or mirrors, but opted for a raidz, and then expanded due to 4k*8 allocation size requirements.

bao__zhe · Oct 11, 2013

sorry to get back late...that's something i never think of...good point

Thanks~

SirMaster · Oct 11, 2013

Shouldn't that only make small files larger, not make the total volume smaller?

bao__zhe · Oct 11, 2013

emmm...i guess it depends on how ZFS comes up with the number "16.6T" and "3.91T"? not sure in this regard

bexamous · Oct 11, 2013

See here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

Think that'll answer your question.

SirMaster · Oct 11, 2013

bexamous said:
See here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

Think that'll answer your question.

Wow. Clear and concise thank you for that.

I'm super happy now that the array I am creating this weekend is going to be made of 2 6-disk RAID2Zs.

Although, does anyone know what happens when I put the 2 6-disk vdevs into the pool together? Does that add any overhead? What about if I were to add a third 6-disk vdev?

bexamous · Oct 11, 2013

No overhead is per vdev, multiple vdevs do not change anything.

patrickdk · Oct 11, 2013

Also remember, that assumes no compression, you will be wasting space using compression, but no more than if you wheren't using it, with the possibility you could gain some space.

Silhouette · Oct 12, 2013

I'd like to know how BTRFS compares to ZFS in this aspect.

westrock2000 · Oct 12, 2013

bexamous said:
See here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

Think that'll answer your question.

The magic numbers as described in that link:

"Zero overhead from both happens at raidz1 with 2, 3, 5, 9 and 17 disks and raidz2 with 3, 6 or 18 disks."

bao__zhe · Oct 15, 2013

i understand this:

This creates overhead unless the number of disks minus raidz level is a power of two

but don't quite understand this one:

Above that is allocation overhead where each block (together with parity) is padded to occupy the multiple of raidz level plus 1 (sectors)

why is there a "raidz level plus 1" involved here?

JoeComp · Oct 15, 2013

bao__zhe said:
why is there a "raidz level plus 1" involved here?

I was wondering the same thing. Perhaps room for storing multiple copies of the metadata?

bexamous · Oct 15, 2013

This is a little more information although it doesn't fully answer the question:

The reason for this is a bit complicated, but without this roundup, you can end up with stranded sectors that are unallocated and unusable, leading to the question, "I still have free space, why can't I write a file?" We simply account for for these roundup sectors as part of the allocation that caused them.

http://thr3ads.net/zfs-discuss/2006/09/364051-Metaslab-alignment-on-RAID-Z#m364055

Also see here:
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42039.html

Annoyingly the referenced links in this email are all broken.

Also semi-related to this general topic, well not raidz but just zfs in general, but this may explain for missing space, or at least discrepancies seen:
http://www.cuddletech.com/blog/pivot/entry.php?id=1013

bao__zhe · Oct 15, 2013

bexamous said:
Also see here:
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42039.html

This below excerpt mentions the "variable stripe writes", which is what? Also it seems to be performance related as well...

the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped.

bexamous · Oct 15, 2013

'variable stripe widths' might make more sense, but see here:
https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/

Also looking at this link now, it may get general idea across but don't pay too close attention. That image of raidz is flawed.

Maybe this is better explanation from someone who designed ZFS:
http://pl.atyp.us/wordpress/?p=1006#comment-6173

That image on previous link is wrong in that you cannot have a stripe that wraps around, at most a strip it cover all disks.

But I'm still not understanding how this ends up creating unusable sectors.

bao__zhe · Oct 15, 2013

i was just about to ask the same thing on the picture...let me read through the new link...

bao__zhe · Oct 15, 2013

It seems, although i still don't understand exactly, "variable stripe writes" make more sense as you can write 3 sectors transactionally when updating 1 data sector and 4 sectors transactionally when updating 2 data sectors (assuming RADIZ2) utilizing the copy-on-write feature. Per my understanding the parity is calculated sector-wise and each strip still cover exactly all disks, so it is not "variable stripe widths".

But i guess this is where the confusion is because we are still doing partial-stripe writes and are still using fixed strip widths.

Although the picture is wrong, the post does mention that

Another aspect of the RAIDZ levels is the fact that if the stripe is longer than the disks in the array, if there is a disk failure, not enough data with the parity can reconstruct the data. Thus, ZFS will mirror some of the data in the stripe to prevent this from happening.

Not sure if that is relevant here.

EDIT:
New discovery:

The actual code that corresponds to this issue is here:

https://java.net/projects/solaris/sources/on-src/content/usr/src/uts/common/fs/zfs/vdev_raidz.c

line 1504:

asize = roundup(asize, nparity + 1) << ashift;

But there is no comment on this line

bexamous · Oct 16, 2013

bao__zhe said:
Per my understanding the parity is calculated sector-wise and each strip still cover exactly all disks, so it is not "variable stripe widths".

No that is incorrect. ZFS does not have a fixed block size, it is variable up to the record size, default is 128KB. So a block of data can be 512 bytes up to 128KB. If you write 32KB file that will become a single 32KB block, add to that parity and together is you stripe. It does not always cover all disks.

See like page 13 for diagram, it is similar to the last image that was incorrect but properly shows parity blocks:
wiki.illumos.org/download/attachments/1146951/zfs_last.pdf‎

When you look at free disk space or anything, it assumes 128KB blocks will be used, which is best case. Worst case, if every block ends up being a single sector, 512 bytes, each block then will have 1 parity sector and half your space ends up as parity. In practice though.

bexamous · Oct 16, 2013

Wait is this really simple:

If you can only write full stripes, the smallest possible write is the smallest possible stripe, which is: 1 sector + raid level. So 2 sectors for raidz or 3 sectors for raid2z. So the problem becomes, if you have three blocks, A B C, on a raidz vdev:
A1 A2 A3 A4 Ap B1 B2 B3 B4 Bp C1 C2 C3 C4 Cp

You delete block B and replace it with a smaller block D:
A1 A2 A3 A4 Ap D1 D2 D3 Dp __ C1 C2 C3 C4 Cp

You now have a hole of 1 sector. But because the smallest possible write is 2 sectors for raidz or 3 sectors for raid2z you cannot actually use this single sector hole.

If you ask the FS how much free space there is, it'll count these holes, but you cannot actually get data into them. If you require all blocks be a multiple of 1+ raid level, you cannot end up with a hole that cannot be filled.

Block allocation is done at metaslab layer. Metaslab has a space map, it gets a block (data + parity) and finds a hole to fit it in. It does not know about raid level or anything else. It doesn't know it will never be given a 1 sector block.

bao__zhe · Oct 17, 2013

the new diagram and your example makes more sense now.

so how about if all my files are exactly 128KB and are written continuously without any deletion? i believe it will still have 4.8% inefficiency (40 sectors used out of 42 sectors allocated). each file will probably look like this:

(EDIT: maybe more like this)

Code:

A01 A05 A09 A13 A17 A21 A25 A29 P01 Q01
A02 A06 A10 A14 A18 A22 P02 Q02 A26 A30
A03 A07 A11 A15 P03 Q03 A19 A23 A27 A31
A04 A08 P04 Q04 A12 A16 A20 A24 A28 A32
X01 X02

in this 128KB block case each sector is 4KB and we have a total of 32 sectors to hold data and 8 sectors to hold parity. but according to the code and the blog posts there will be 2 additional sectors X01 and X02 to make the sector count a multiple of 3 (so it's 42 sectors). and hence the question.

an even worse case would be a 8KB file:

Code:

A1 A2 Ap Aq X1 X2

where you will have 33% inefficiency.

a 4KB file seems to have no inefficiency tho.

Nex7 · Oct 17, 2013

Ya'll are hurting my head. Just to add in here, though, remember that ZFS will coalesce writes as part of the txg workflow. So if you write in exactly 8 KB in an entire txg, sure, you'd be trying to write out just 8 KB (+ metadata and parity). However, that's almost never. If you try to write 40 MB of 8 KB files, you're not going to write 40 MB of 8 KB data (+ metadata & parity and all that), you're going to write 40 MB of max-record-size (usually 128 KB) data and one potentially smaller block of the remainder (+ meta & parity, etc).

I also saw a comment about small holes. Yes, you can end up with a system that has tons of small holes on the raw drive layout, given sufficient time and the right workload. This sucks, because ZFS won't just avoid them forever, eventually it will be forced to use them and does so through something called a 'gang' block where it crams as much data as it can into the little holes and has another level of metadata above it to point at them all, and if you're using them, performance is tanking really fast.

bao__zhe · Oct 18, 2013

so...what is a block exactly? one txg? like cluster in NTFS?

RAIDZ2 space efficiency problem

Limp Gawd

Gawd

Limp Gawd

2[H]4U

Limp Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Gawd

Limp Gawd

[H]F Junkie

Limp Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

Weaksauce

Limp Gawd