55 PetaByte ZFS installation

madrebel · May 16, 2012

the link you provided is from 2009. in the 3 years since then, any bugs directly related to zfs have likely been squashed. hba and or hard drive firmware bugs that may or may not have been referenced have also likely been squashed.

however, if you try to build a 100TB+ system using consumer desktop SATA drives you're going to have serious problems. no manufacturer will recommend using their consumer drives in RAID arrays for many reasons. even enterprise class SATA is discouraged for large deployments because SATA is only single channel and interposers aren't made from magic unicorn horn.

bottom line is no filesystem is truly 'safe' which is why you have backups. if for instance one of my large NAS boxes gets corrupt i can't run fsck on it, idc if it might fix the problem i CANT be down that long. Joe, have you ever tried running FSCK on even 1TB of data? it takes FOREVER. FSCK on say oh idk ... 55PB worth of data would literally take MONTHS to finish. you really think lawrence livermore would ever just say "well shit guys, we're going to be down for the next X months while we run FSCK".

No, they wouldn't, they can't.

oh btw fsck isn't a magical fairy duster either. i've had plenty of unrecoverable errors using fsck/chdsk in the past.

if you don't have backups, you're fucked. end of story. if you do have backups, there isn't a single instance where rebuilding/restoring from backup isn't faster than running FSCK when your data is measured in TB.

*edit* oh, also forgot to mention something. if you took the time to read the comments in your own link you would realize that zfs added something called PSARC 2009/479. this allows you to roll back up the the last (in theory) 127 transaction states which if you have a pool that can't be imported then presumably one of the last 127 states will be importable at which point the combination of replication, self healing, and scrubbing should fix everything. if it doesnt, stop trying to use horrible desktop class hardware.

_Gea · May 16, 2012

JoeComp said:
Keep dreaming. While, in an ideal world, ZFS may not need fsck, the problem is that the world is not ideal. ZFS does need fsck. Fortunately, btrfs developers learned from the mistakes of ZFS developers and btrfs will have a useful fsck.

http://www.osnews.com/story/22423/Should_ZFS_Have_a_fsck_Tool_

I think nobody needs a filechck tool on ZFS like those needed on other filesystems.
The online scub and auto-repair mechanism on access does the same. (without taking offline the whole disks and waiting until its finished)

What's missing is a rescue tool like the unformat or recovery tools known from Windows to analyse
and search the whole disk(s) surface in case of a disaster for saveable files.

But that is not check filesystem for failures, that is disaster recovery.
Example: add a new vdev to a Pool. If that vdev fails, the whole pool is lost altough most/ all files are completely on other vdevs.

brutalizer · May 16, 2012

JoeComp said:
Your math is WAY off. I think that may be the largest mathematical mistake I have ever seen or heard of. Have you considered working for an investment bank?
You are off by a factor of more than 10^39.
2^128 = 3.4 * 10^38

Actually, I work in large world famous well known finance company as an Analyst. I have studied much math, more than most people.

I got an important phone call while finishing the post, that is the reason I did not finish my post, but just clicked "submit reply" and took the phone call.

My point is that mankind dont need more than 128 bits. Ever. This is very different from Bill Gates "640K should be enough for everyone", because to exceed 128 bits you need more matter than 100 moons. Thus, if someone thinks that mankind will store more data than can be produced from 100 moons, then we will need more than 128 bits. Quite unlikely, though.

JoeComp said:
Keep dreaming. While, in an ideal world, ZFS may not need fsck, the problem is that the world is not ideal. ZFS does need fsck. Fortunately, btrfs developers learned from the mistakes of ZFS developers and btrfs will have a useful fsck.
http://www.osnews.com/story/22423/Should_ZFS_Have_a_fsck_Tool_

BTRFS developers dont know sheit. They dont understand the principles of ZFS, and that is the reason BTRFS have a fsck tool. ZFS is more than 10 years old, and have never had a fsck tool for all these years and ZFS did fine. Why? Is fsck desperately needed? No, ZFS does not have, nor need a fsck tool because of its design. BTRFS is just a ZFS wannabe, without any BTRFS developer understanding why the ZFS design. BTRFS developers "learned from the mistakes of ZFS developers"? Are you mad? They have not grokked ZFS yet and never will.

In response to your ignorant link, here is an follw up answer to your silly link: "ZFS does not need no stinking fsck". Read this to see why ZFS does not need fsck.
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html

Rectal Prolapse said:
And yes, fsck-like tool for ZFS would be great - would take care of those (super rare?) cases where it would be needed.

In the meantime, I suppose we could all switch to btrfs - I don't know the status of it - is it feature complete now?

No, fsck is not needed on ZFS and never will be. The reason is because ZFS is designed different from common filesystems. If ZFS corrupts the zpool, what will ZFS user do then? Well, the answer is not take the zpool offline and run fsck for hours just like BTRFS users would do. That is bad and antique. The BTRFS developers are stuck in the old mindset, and have not really understood ZFS yet.

If the zpool is corrupt, if I can not import the zpool because of corruption, then the ZFS user will simply do like this:
When I alter data on ZFS, only the changes will be written down on an empty place on the disk. The old data is still intact and never touched. Thus, the changes are written, but the old data is still there. This is because CoW.

Now, if the zpool becomes corrupt, then the ZFS user will rollback in time. Every change is stored on the zpool. The latest change corrupted the zpool. So I rollback to an earlier state, 30 seconds earlier and try import "zpool -F import". If that does not work I rollback 30 seconds more. Eventually, one of the states will be correct, so I can import that state. I will loose the latest 30 seconds of changes, but that is ok.

Everything is left intact on ZFS, so I can back in time to whenever I want. I just back in time, to before the corruption took place. It is like Apple OS X "Time Machine". With Time Machine I can rollback to any time. The same with ZFS, I can rollback the system to any time.

So there is no need for antique fsck. fsck requires you to take the pool off line and try to repair all data, which might take long time if there is lot of data. But ZFS is self healing, so it will repair minor data problems. If there is serious data problems for instance the pool is corrupt, just back in time. If your computer is struck by lightning and everything melts, then your data is toast, but no solution would help in this case either.

Now, BTRFS is actually CoW (they say). So every change should be still left intact on the pool. So if there is corruption, you just need to rollback in time. Why in earth does BTRFS does have fsck tool??? BTRFS dont need it! BTRFS devs dont understand the power of CoW, or how it can be used. That is why. CoW filesystems dont need stinking fsck tool, if the developers understand how to use the power of CoW. Apparently, the BTRFS devs have not understood ZFS nor its power.

It seems that BTRFS devs just copy ZFS, without understanding. The BTRFS creator Chris Mason has officially admitted he looks at ZFS to copy good ideas and functions, apparently he has not really understood the finer details. For instance, he said that only after ZFS devs have raved about checksums and data corruption, himself added checksums to BTRFS. He got convinced checksums was a good thing to add, only after listening to ZFS devs, he said. He did not understand how common data corruption is. He is just a Linux hacker, with no experience of Enterprise storage. But Solaris devs have long experience of Enterprise storage server halls and knows what kind of problems are there, when storing vast data. Chris Mason has probably never even visited a server hall.

I am not even convinced that BTRFS checksums are good enough. CERN did a study, and concluded "adding checksums to combat data corruption is not enough". You can not just add checksums and expect it to be safe against data corruption.

Here is a new DIF standard which adds checksums to combat data corruption. Supposedly, it is very safe:
http://en.wikipedia.org/wiki/Data_Integrity_Field
"DIF stands for Data Integrity Field. The purpose of this field is to provide End-to-End data protection in Computer/Enterprise data storage methodology.". The aim is to give End-to-End data protection, just as ZFS does.

Well, guess what? Here is an Enterprise Fibre Channel disk, which has the DIF data integrity field. And lo and behold, it reports that for every 10^16 bits read, there will be one irrecoverable error. Now, how is it possible that it gets data corruption, even though it has DIF checksums to guarantee data safety? Adding checksums is not enough
http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/10K.7/FC/100260916d.pdf

On the other hand, researchers have examined ZFS and injected artificial errors and ZFS recovered from all errors. It seems that ZFS has done error detection correctly:
http://en.wikipedia.org/wiki/ZFS#Data_Integrity

How about BTRFS? I dont know, but I doubt they know what they are doing. "BTRFS devs have learned from ZFS mistakes"? Pffft.

Rectal Prolapse · May 16, 2012

JoeComp said:
You certainly are lazy. You couldn't even be bothered to read the link I ALREADY provided?

I did it read it and the what I asked for wasn't in it.

Also, I already stated my laziness. I am too lazy to bother looking up stuff for something that doesn't matter to me anyways. But you are welcome to do it for me to prove your earth-shattering point.

JoeComp · May 16, 2012

Rectal Prolapse said:
I did it read it and the what I asked for wasn't in it.

You obviously did not read it very carefully.

brutalizer · May 16, 2012

JoeComp said:
You obviously did not read it very carefully.

It is not needed to read it carefully. Your link is wrong, and refuted. Here is an explanation why your link is wrong:
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html

To recap, you just rollback in time if zpool is corrupt. No need to fsck.

JoeComp · May 16, 2012

brutalizer said:
It is not needed to read it carefully. Your link is wrong, and refuted. Here is an explanation why your link is wrong:

You don't have to keep posting that link. I've read it before. It is typical ZFS zealot denial of reality. I did not respond because it is impossible to have a reasonable discussion with people like you.

bexamous · May 16, 2012

I've not yet had problems with ZFS On Linux. If you look through the mailing lists bugs people have found, going back pretty far, never result in lost data. Fairly recently, maybe two months back now, I managed to lose a btrfs fs after a test system lost power. I didn't try to recover that hard as nothing of any use was on it.

I would definitely use ZFS On Linux over BTRFS for home fileserver.

brutalizer · May 17, 2012

JoeComp said:
You don't have to keep posting that link. I've read it before. It is typical ZFS zealot denial of reality.

Fine.

If you can post examples that show a ZFS fsck would help, then I will admit the BTRFS devs are correct and that a fsck tool is needed. Until now, 10 years later, ZFS never had any fsck tool and ZFS has done well all the time. It is not like we see reports everywhere with people complaining they need ZFS fsck. I admit I have seen few cases where people said they wanted ZFS fsck, but that was many years ago. To solve these requests for fsck, the rollback in time functionality was added just a few years ago.

You say ZFS fsck is needed, then prove it. Show us links. If there are no such links, then ZFS fsck is not needed - exactly what ZFS devs have said all the time. Fact is; there are not many such links requesting fsck. What does that tell you?

Regarding ZFS fuse Linux being more stable than BTRFS. I dont know, but if the largest super computer in the world are ready to trust ZFS fuse, then maybe it will soon be ready for prime time.

/dev/null · May 17, 2012

I wish them luck with Lustre. I tried to deploy this and ran into several bugs including a kernel panic that a SINGLE "user-mode" program doing heavy i/o could re-create on demand. I opened a bug approximately 2 years ago which is open TO THIS DAY. I provided kernel dumps & every log they wanted and after going back & forth with them for a few weeks, the Lustre team stopped responding. There was a follow-up some time later where another client of theirs also could reproduce the same bug.

My work offered to BUY PAID SUPPORT and after several weeks we FINALLY got a $ number with no details on what that covered (# of tickets? time period? 24x7 ? 13x5 ?). After several weeks of trying to get any kind of answer we gave up & migrated back to NFS (NFS was actually more stable).

A buddy who is at U of M doing a post-doc has basically been unable to get his files off of their Luster installation at any reasonable speed because he hais many, many small files (which Luster) is not designed for and is so slow it's useless. He actually cancelled a presentation on his research this week due to the issue. The supercomputing team has only responded to him saying they've opened a ticket with the vendor.

Lustre has a lot of CURRENT issues they should fix before they try to fix their backing store....

/dev/null · May 17, 2012

Brutalizer> I've lost data twice on ZFS due to power outages and zfs meta-data corruption where I couldn't online the pool no mater what I've done. There really needs to be some fix for that no matter if it's FSCK or something else

_Gea · May 17, 2012

What sense is behind talking of (missing or not) tools that you need on older filesystems instead of
the problem and how to solve it.

example:
If you have a damaged unmountable filesystem you traditionally
- check it offline, wait and hope the best. On a large filesystem, this can last a long time.

If you have problems accessing files or folders on a mounted filesystem you traditionally
- check it offline, wait and hope the best. On a large filesystem, this can last a long time.

If you had waited a long time and get informed then about a disaster, you can start to rebuild from backup (lasts as well a long time)

Obviously not a good idea on large modern filesystems when you need availability.
ZFS tries to divide this in its parts. If the filesystem is not mountable,
the ZFS import must try to repair or go to a former importable state. If this cannot be done due to a bug like with older ZFS,
this bug has to be fixed instead of developing a work around tool like a offline filecheck doing the same.

If the filesystem is mounted but there are remaining errors on files or folders, they are repaired either on access (self healing)
or during a online scrub check on the whole mounted filesystem. No offline check, no offline service.

If there are errors that cannot be repaired due to missing data redundancy, you get at least informed about these files
and can restore them quickly from backup. The rest of the files are correct, that is much more than you can expect from a traditional filecheck
where you need to restore the whole data if you cannot trust them any longer.

This is the way ZFS is working, this is the way all modern filesystems must do in future.
The needed filesystem basics for this like copy on write and checksums are on the way on Win ReFS and btrfs (at least partly)
- maybee (partly then) as stable as in ZFS but only in x-y years where they must fix their bugs.

-> without checksums, you cannot trust metadata or data
-> without in the filesystem integrated diskerror, raid and volumemanagement you cannot repair on the fly (or at least not as smart as ZFS)
-> without copy on write where modified datablocks are always created new, old stable data may be overwritten, no chance to go back to a former stable state

If you encountered a real disaster with or without ZFS, you need a backup - always-

bexamous · May 17, 2012

brutalizer said:
Regarding ZFS fuse Linux being more stable than BTRFS. I dont know, but if the largest super computer in the world are ready to trust ZFS fuse, then maybe it will soon be ready for prime time.

They're not using ZFS w/Fuse, they're using ZFS kernel module:
http://zfsonlinux.org/

brutalizer · May 17, 2012

Robstar said:
Brutalizer> I've lost data twice on ZFS due to power outages and zfs meta-data corruption where I couldn't online the pool no mater what I've done. There really needs to be some fix for that no matter if it's FSCK or something else

I am not saying that is impossible loose data with ZFS, you can certainly loose data with ZFS. I have seen some reports on this.

But I am saying that I am not convinced that ZFS fsck is needed. I want to see links on scenarios/reports where an fsck would help. Maybe the user would still loose data, fsck or not? I am not convinced fsck is needed. ZFS has never needed fsck earlier.

Regarding fsck. Normal fsck on Unix/Linux takes long time. I read about one web site, they said it would be quicker to restore from backup, than doing fsck.

Another problem with fsck is that fsck never checks the data. It only checks the metadata (journal log, etc). After a succesful fsck, the data might still be corrupt. Just like MS ReFS, it has only checksums for metadata, but no checksums for the data. For MS ReFS to use checksums for data too, you need to switch it On. Why is checksums for data not On by default in ReFS? I suspect because it is not reliable yet. If it worked fine, then checksums for data would be On by default.

brutalizer · May 18, 2012

Robstar said:
Brutalizer> I've lost data twice on ZFS due to power outages and zfs meta-data corruption where I couldn't online the pool no mater what I've done. There really needs to be some fix for that no matter if it's FSCK or something else

A question, did you try to rollback your zpool in time with the "-F" flag? You can rollback to any of 127 earlier states. Some of those latest 127 changes should have a functioning zpool. Did you try this "-F" flag?

jonnyjl · May 18, 2012

brutalizer said:
A question, did you try to rollback your zpool in time with the "-F" flag? You can rollback to any of 127 earlier states. Some of those latest 127 changes should have a functioning zpool. Did you try this "-F" flag?

When the first car rolled out, I bet there were people wondering why Ford couldn't get it together and figure out a way to attach it to a horse.

brutalizer · May 18, 2012

jonnyjl said:
When the first car rolled out, I bet there were people wondering why Ford couldn't get it together and figure out a way to attach it to a horse.

What are you trying to say? That trying "-F" is obvious, everybody does it?

cantalup · May 18, 2012

brutalizer said:
.........
But I am saying that I am not convinced that ZFS fsck is needed. I want to see links on scenarios/reports where an fsck would help. Maybe the user would still loose data, fsck or not? I am not convinced fsck is needed. ZFS has never needed fsck earlier.

Regarding fsck. Normal fsck on Unix/Linux takes long time. I read about one web site, they said it would be quicker to restore from backup, than doing fsck.

Another problem with fsck is that fsck never checks the data. It only checks the metadata (journal log, etc). After a succesful fsck, the data might still be corrupt. ......

fsck is the last option for the filesystem
I rarely do fsck on even ext3

, unless on corrupted filesystem where I need to recover some data due to any failure.
fsck try to fix filesystem integrity and does check data, it just check orphaned links/structure, reclaim broken links to become free available, and others.
this is the reason when fsck woill ask you for example "do you want to fix XXXX" , do you want to clear links XXX, and others that depends on fsck tools .

journaling mostly speed fsck to put the filesystem puts back to intergrated healthy as possible.
there are other reserved data that can be used for fsck to use..check the detail on fsck that you refer..

not all data can be fixed by fsck, since fsck is not rely on data integrity, fsck only rely on filesystem integrity

do any filesystem need fsck? I believe Yes.. but rarely used, this tool is used for the last option to recover filesystem integrity

if you are talking about big data storage, it is true... restore for back-up is the quick way.

brutalizer · May 19, 2012

cantalup said:
fsck try to fix filesystem integrity and does check data

Ive read on several links that fsck never checks the data, it only checks the metadata. For instance here:
http://en.wikipedia.org/wiki/ZFS#Data_Integrity

"fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still be corrupt.

fsck must be run on an offline filesystem, which means the filesystem must be unmounted and not usable while being repaired."

Do you have information that says the opposite?

cantalup · May 19, 2012

brutalizer said:
Ive read on several links that fsck never checks the data, it only checks the metadata. For instance here:
http://en.wikipedia.org/wiki/ZFS#Data_Integrity

"fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still be corrupt.

fsck must be run on an offline filesystem, which means the filesystem must be unmounted and not usable while being repaired."

Do you have information that says the opposite?

please read my posting clearly

fsck does not check data integrity

, it check data, in plain english: check file consistancy and do not cek file intergirty

. once fsck fix the file, the file/data could be recovered or corrupted

fsck rely on blocks and links when possible with journaling data (and others where can be different with across OS), many information tha can be said "metadata", I woul prefer to point directly.. for example journaling , just my style not to confused people with metadata definition, since metadata has slightly different meaning metadata in file where I do not discuss.

here is a good simple technical explanation, since I try not to rely on wikipedia

, just me
http://lwn.net/Articles/248180/
The many faces of fsck

zfs is a little bit different than ordinary fsck, since zfs has selfrepairing feature ( modern filesystem should have this one at least

).
http://docs.oracle.com/cd/E19963-01/html/821-1462/zdb-1m.html
.....zdb, performs basic consistency checks on the pool and associated datasets, and report any problems detected.
see.. consistency word.

as I know:
you can run fsck on mounted device/partition

. you can use "force" command, but YMMV that cause more issue, since on-line/mounted device/partition would have writing activities outside fsck process. this would mess-up fsck process

.
tested on linux system, please remember zfs has different approach, read on the last link from oracle.

brutalizer · May 19, 2012

cantalup said:
fsck does not check data integrity , it check data, in plain english: check file consistancy and do not cek file intergirty

Ok, I got confused, because you explained earlier:
"fsck try to fix filesystem integrity and does check data"
But now you say fsck does not check data integrity? It checks metadata, and does some basic data checks, but does not check data integrity. Is my understanding correct?

as I know:
you can run fsck on mounted device/partition . you can use "force" command, but YMMV that cause more issue, since on-line/mounted device/partition would have writing activities outside fsck process. this would mess-up fsck process .

Ok, so for important data you should avoid doing fsck, it seems. fsck can be forced on a mounted filesystem, but that can mess up things and is not recommended on mounted filesystems. Correct?

EDIT: I heard about a XFS user who did fsck on 6TB data in 2 minutes or so. That means XFS has not checked all data, because to scan 6TB takes longer time than 2 minutes. XFS skipped lot of data, it only checked metadata. I you scan data for 2 minutes, you can scan, say, 10GB data. This means the rest of the 5.990GB data was not scanned.

55 PetaByte ZFS installation

Gawd

Supreme [H]ardness

[H]ard|Gawd

Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd