ZFS on Linux vs. MDADM ext4

danswartz · Oct 29, 2012

If you are using some flavor of solaris with napp-it, it's under the jobs menu.

Rudde93 · Oct 30, 2012

I'm in debian with MDADM...

cantalup · Oct 30, 2012

Rudde93 said:
So how do I scrub pools and take snapshoots?

you should tray opensolaris flavor or solaris with nap-it ... everything in the GUI

..

on linux, you have to use command line... there are simple tutorial on ZoL as I remember, take a look...

cantalup · Oct 30, 2012

Rudde93 said:
I'm in debian with MDADM...

I am on centos.

what is your problem with mdadm?

just my suggestion:
1) try mdadm with other linux distro, for example, centos or ubuntu
2) try ZoL( with cautions where not all features are implemented and some holes to avoid, check on ZoL mailinglist)
3) pick whatever you like.
Notes:
* for "real" environment server, I would suggest to use mdadm on linux.
or ZoL for non trivial server(linux), for example, minor backup server and others..
or jump to opensolaris/solaris ZFS with nap-it...

Rudde93 · Oct 30, 2012

The whole point of this post is to get away from solaris, so the solaris suggestions aren't helping at all..

I run Debian, XFS + MDADM, Raid-6

I wan't to know how I can scrub pools and I wan't to know how I can snapshot pools, and send pool status on email.

Aesma · Oct 30, 2012

cantalup said:
what filesystem are you using?
some filesytems do padding bytes ...oops..., it looks like bad/corrupted file during cheksuming.
can you load to your program or app (require those files) to check those files are bad/corrupted?

get better hardware to minimize you headache...

Windows/NTFS and most of these files are video so there is no way that I know of to check them. Usually the changed bit(s) will cause a glitch (I use Total Commander to find the differences between two files, then compute at which time in the video it would happen).

JoeComp · Oct 30, 2012

Rudde93 said:
I run Debian, XFS + MDADM, Raid-6

I wan't to know how I can scrub pools and I wan't to know how I can snapshot pools, and send pool status on email.

mdadm can start a check (scrub) by:

Code:

echo repair > /sys/block/md_/md/sync_action

Of course, replace md_ with whatever volume you want to repair.

If you just want a check without repair, replace "repair" with "check".

https://raid.wiki.kernel.org/index.php?title=Scrubbing&oldid=2362

danswartz · Oct 30, 2012

As far as the snapshot part, I think you need LVM on top of md for that, no?

danswartz · Oct 30, 2012

If you really want to keep using debian, can you switch to zfs on linux? Does everything you want. Forgive me if this was already asked - I didn't see it if so...

kac77 · Oct 30, 2012

danswartz said:
As far as the snapshot part, I think you need LVM on top of md for that, no?

Correct was just about to type that.

Code:

sudo apt-get install lvm2

That will install the ability to pool, snapshots etc. Rather than typing all of the commands you'll need, the easiest thing would be to read this.

http://www.thegeekstuff.com/2010/08/how-to-create-lvm/

danswartz · Oct 30, 2012

My main objection to md+lvm+FS is that everything is separated. For example, if you want to do snapshots with LVM (least this is how it used to work), you had to guess how much space you'd want for them, and keep that aside for lvm to use. If you guessed too small, you don't get the snapshots you want - if you guess too big, you waste space.

Rudde93 · Oct 30, 2012

JoeComp said:
mdadm can start a check (scrub) by:

Code:

echo repair >> /sys/block/md_/md/sync_action

Of course, replace md_ with whatever volume you want to repair.

If you just want a check without repair, replace "repair" with "check".

https://raid.wiki.kernel.org/index.php?title=Scrubbing&oldid=2362

-bash: /sys/block/md1/md/sync_action: Permission denied

I did sudo it.

kac77 · Oct 30, 2012

danswartz said:
My main objection to md+lvm+FS is that everything is separated. For example, if you want to do snapshots with LVM (least this is how it used to work), you had to guess how much space you'd want for them, and keep that aside for lvm to use. If you guessed too small, you don't get the snapshots you want - if you guess too big, you waste space.

There isn't any real guesswork here though. No more guesswork than creating a vdev out of how many disks vs what the end user requires.

In terms of lvm being a separate service it's no more an issue than any other OS. SMB and NFS are separate services and they drive everything from Napp-It to Windows.

kac77 · Oct 30, 2012

Rudde93 said:
I did sudo it.

Here you go.

http://ubuntuforums.org/showthread.php?t=1647483

danswartz · Oct 30, 2012

kac77 said:
There isn't any real guesswork here though. No more guesswork than creating a vdev out of how many disks vs what the end user requires.

In terms of lvm being a separate service it's no more an issue than any other OS. SMB and NFS are separate services and they drive everything from Napp-It to Windows.

I don't think we're communicating here. Of course there is guesswork. You put N drives together into a raid volume. You then have to decide how to slice and dice it to assign space to various partitions on the volume. You also have to take a guess as to how much space you need for snapshots. None of this is true for ZFS, since the snapshots and filesystems all draw free blocks from a common pool. I have no idea where you got the 'lvm being a separate service' thing - I never referred to 'services', which is an OS abstraction. I was referring to conceptual layers - if they are too opaque, one layer cannot make intelligent decisions based on what a different layer might want/do.

kac77 · Oct 30, 2012

danswartz said:
I don't think we're communicating here. Of course there is guesswork. You put N drives together into a raid volume. You then have to decide how to slice and dice it to assign space to various partitions on the volume. You also have to take a guess as to how much space you need for snapshots. None of this is true for ZFS, since the snapshots and filesystems all draw free blocks from a common pool.

I assume you are talking about thin provisioning and even here you need to always calculate how many drives are required, how much space is expected, and your burn through rate. All thin provisioning does is change or delay when you need to figure out your usage patterns. It doesn't remove them. Otherwise the file system in question would be called "magic" and we all wouldn't have to address storage ever again.

danswartz said:
I have no idea where you got the 'lvm being a separate service' thing - I never referred to 'services', which is an OS abstraction. I was referring to conceptual layers - if they are too opaque, one layer cannot make intelligent decisions based on what a different layer might want/do.

I'm sorry but you need to be more detailed here. Rather than jump to conclusions please give an example of how this relates to file systems and in what example would this be beneficial to the end user.

danswartz · Oct 30, 2012

I guess you could say it's thin provisioning, if you want. Of course, you have to monitor your usage, never claimed otherwise. What I am saying is that with something like md+lvm, you need to decide ahead of time how&where to allocate your space, and without major hassle, cannot change that (many FS allow you to grow but not shrink, so reallocating storage in multiple FS on top of md+lvm is a chore which doesn't exist with ZFS.) The magic part? Nice strawman, you are rebutting assertions I never made - just claiming that the integrated architecture of ZFS removes a lot of the hassles a static setup (md or hw raid + lvm + fs) forces you to go through. As far as 'more details', if you create a zfs pool with say 5 filesystems, you needn't worry about how much goes to who - if you care to restrict one or more from being storage hogs, you can throw a quota on them.) If you want to take a number of snapshots which consume large percentages of your total space, you can do so without having to budget for that when you set up the pool. Again, not saying this is bulletproof, but it is certainly much less restrictive than the other approach, which is a consequence of where free space is tracked. If lvm could say 'I need X blocks for a snapshot', that would be peachy, but it can't, since md+lvm requires you to pre-assign all space when you create your volumes.

kac77 · Oct 30, 2012

danswartz said:
I guess you could say it's thin provisioning, if you want.

I could go out my way to help you here but I'm not. What I will say is that your knowledge of what lvm can do and can't do is limited and leave it at that.

bexamous · Oct 31, 2012

kac77 said:
I could go out my way to help you here but I'm not. What I will say is that your knowledge of what lvm can do and can't do is limited and leave it at that.

How about just not post at all?

danswartz · Oct 31, 2012

Nice. I admitted my understanding of lvm was limited - certainly several years ago (my last exposure to it), it had the limitations I mentioned. At no point have you explained to the contrary, just make a snarky comment and run. Oh, by the way, I wasn't the one asking for help, I was trying to provide it. LOL...

brutalizer · Oct 31, 2012

cantalup said:
get better hardware to minimize you headache...

Here we go again.... As I said: better hardware will not help against data corruption. I told you this, several times before:

For instance, heavy cosmic radiation will corrupt data, and even destroy electronics. Heave Sun bursts destroys satellites, and corrupts data. Better hardware will not help against cosmic radiation.
http://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electronics

If you make a loud sound close to the disks, they will be affected. If you scream loud at the disks, they will wobble. Some of these wobbles will corrupt the data. Maybe there will be ghost writes, where the disk writes go into the wrong sector on the disk.
http://www.youtube.com/watch?v=tDacjrSCeq4

If your SATA disk cable is loose, you will get data corruption. Maybe your SATA disk cable has not been correctly inserted, or it has been wriggled free when you tried to close the chassis? Do you think better server hardware will help against this?

If your switch or router is faulty, you will get errors.
http://jforonda.blogspot.se/2007/01/faulty-fc-port-meets-zfs.html

If your power supply is faulty, you will get errors.
https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta

5-10% of all data corruption is because of bugs in the disk firmware:
http://static.usenix.org/event/fast08/tech/full_papers/jiang/jiang.pdf

When you send something via TCP/IP, there might be errors. That is the reason Linux vendors distribute MD5 checksums so you can compare ISO checksums to see evertying is correct. I have been forced to reload ISO files, because the first was corrupt. Better hardware will not protect against TCP/IP errors.

When you buy hardware, even Enterprise class hardware, they might be faulty. Or have bugs in the disk firmware. Or anything. Maybe when they shipped the hardware, it got a hard punch inside the ship/car?

Better hardware will not eliminate data corruption, because there are many error sources outside the PC: cosmic radiation, vibrations, room heat, power might be shaky causing power to fluctuate in quality affecting writes/reads, bugs in disk firmware, etc.

There are bugs in ALL software. No software is 100% bug free. Even Enterprise software will have bugs. Even disk firmware will have bugs, no matter how expensive disks.

ZFS will catch all these errors above: disk firmware bugs, faulty power supply, vibrations, cosmic radiation, etc. Everything. Even with bad hardware, ZFS will catch all the corruption above. ZFS on bad hardware gives better data corruption, than non ZFS on Enterprise hardware.

Aesma said:
I don't know if it's silent data corruption as in "the file was fine on the drive, then it was corrupted", or rather, "the file was never fine because it was corrupted during transfer". I just know that until recently I didn't do hashes of my files but after I discovered that a simple move had corrupted a file I started doing this and find differences between original and backup. What is worse is that I often switch the backup and original drives to even out wear so I don't even know which file is good when I find inconsistencies.

Yes, the last sentence is a big big problem. You need to do checksums of all files, and update the checksum file when you add more files. And then you need to do checksum calculations every month to see if any files been altered. And then you need to fetch the file from the backup. So this is a huge pain, if you do that manually.

Or, you will just use ZFS, which will do all this above, automatically for you. Just start "scrub" every month, and you are done.

cantalup · Oct 31, 2012

Aesma said:
Windows/NTFS and most of these files are video so there is no way that I know of to check them. Usually the changed bit(s) will cause a glitch (I use Total Commander to find the differences between two files, then compute at which time in the video it would happen).

are you running smb share remotely or local ntfs?, just try to understand
does the "changed bits" happen in the beginning or middle or at the end or randomly in the file?

cantalup · Oct 31, 2012

kac77 said:
I could go out my way to help you here but I'm not. What I will say is that your knowledge of what lvm can do and can't do is limited and leave it at that.

well through briefly explanation since LVM is dispersed especially from LVM to LVM2.

cantalup · Oct 31, 2012

danswartz said:
Nice. I admitted my understanding of lvm was limited - certainly several years ago (my last exposure to it), it had the limitations I mentioned. At no point have you explained to the contrary, just make a snarky comment and run. Oh, by the way, I wasn't the one asking for help, I was trying to provide it. LOL...

LVM2 is commonly use ehem... in rhel/centos and other linux distros(commercial and non)
it is still powerful with need some imagination from users.
using snapshot on LVM2 need more a bit technical understanding at the first time than running ZFS snapshot.

just a bit of out topic,
even with BTRFS, LVM2) still would be useful as I see now.
I read that linux kernl 3.X has many update on BTRFS, but still buggy. and.
see, BTRFS development got a bit patch up from contributors(not from oracle) for example, Red Hat, and other companies.
this looks like that some company need kind of native GPL linux style "ZFS" for future deployment.

danswartz · Oct 31, 2012

cantalup said:
well through briefly explanation since LVM is dispersed especially from LVM to LVM2.

My original experience with lvm was as described in my earlier posts. e.g. when you created volumes you needed to decide how much to allot for them and how much for clients (filesystems/partitions). If newer lvm implementations do not share that shortcoming, it should have been simple for kac77 to explain why. Instead he took a cheap shot and left. I'm not sure from your reply if you are agreeing or disagreeing with him. If the former, you are not making things any clearer than he did.

cantalup · Oct 31, 2012

danswartz said:
My original experience with lvm was as described in my earlier posts. e.g. when you created volumes you needed to decide how much to allot for them and how much for clients (filesystems/partitions). If newer lvm implementations do not share that shortcoming, it should have been simple for kac77 to explain why. Instead he took a cheap shot and left. I'm not sure from your reply if you are agreeing or disagreeing with him. If the former, you are not making things any clearer than he did.

there are a lot resourced on LVM2..
I never agree or disagreeing since some people would do their way on using LVM2.

explaining LVM2 takes times , just look on online resources, some links posted in here.
since I am using centos (rhel flavor), I always look rhel official online tutorial/help

LVM2 is not new...this is the reason on mentioning BTRFS on my previous post.

as I know, you have to create a preliminary size for snapshot, when need larger size, I can do growing up the size. snapshot will be deleted anytime, when not needed.
shrinking LVM2 snapshot? I never try that since no need to do that on my purpose, some people posted that they can do that.

as I posted, ZFS snapshot is more human friendly than LVM2 snapshot that needs technical understanding at first time

danswartz · Oct 31, 2012

I'm still not clear I guess. My understanding of LVM: when you set up a pool (or whatever LVM calls it), do you not need to specify how much space to reserve for snapshots? If so, that was my point. If not, a simple explanation should suffice without requiring the other person to go searching for LVM resources online (I don't think I'm being unreasonable here - I provided a brief explanation of how ZFS works, and didn't say 'there is plenty of ZFS info out there, go read it...') To elaborate on my original point: with zfs, free space comes from pool itself, so you don't need to create fixed size partitions, whether they can be grown or shrunk is then up to the filesystems in question. Again, not saying this makes md+lvm+fs bad, just less flexible and requiring manual intervention for cases where zfs 'just works'. That's all I was ever saying to begin with...

kac77 · Oct 31, 2012

danswartz said:
I'm still not clear I guess.

LVM2 is very flexible and more than most people are aware of. We would be here all day talking about ways to manage snapshots, live migrations, benefits to using whole drives vs partitions, the role of pvdisplay and lvdisplay, etc. So telling you to read up on it would do you far more good than just arguing with you. How would that help?

That being said, if need you an understanding about something you can create a thread and I'll be more than happy to assist. It's not fair to the OP for you to argue something that you have questions about.

cantalup · Oct 31, 2012

danswartz said:
I'm still not clear I guess. My understanding of LVM: when you set up a pool (or whatever LVM calls it), do you not need to specify how much space to reserve for snapshots? If so, that was my point. If not, a simple explanation should suffice without requiring the other person to go searching for LVM resources online (I don't think I'm being unreasonable here - I provided a brief explanation of how ZFS works, and didn't say 'there is plenty of ZFS info out there, go read it...') To elaborate on my original point: with zfs, free space comes from pool itself, so you don't need to create fixed size partitions, whether they can be grown or shrunk is then up to the filesystems in question. Again, not saying this makes md+lvm+fs bad, just less flexible and requiring manual intervention for cases where zfs 'just works'. That's all I was ever saying to begin with...

when started ZFS hunting, I read ZFS before jumping in ... better have a background than starting with a blank point.
I always suggest to whomever that ask me " do you already have background knowledege?". this helps to speed up. especially for LVM2. where is flexible and massive implementations in real life.
on LVM2, mostly usie Logical not Partition for snapshot practically (reply for your bolded post).

on my opinion (NOT on RAID mechanism since LVM2 only has simple mirroring as I understand), LVM2 is more flexible that ZFS, since LVM2 can work with many filesystems compared with ZFS that has built-in which limits flexibility. this is my opinion only where you can against mine.

honestly, learning LVM2 needs more extra efforts than learning ZFS since LVM2 is very dynamic on daily usages. this is just my understanding too.

I would like to stop , and to visit when seeing LVM2 thread.
see again on LVM2 thread

danswartz · Oct 31, 2012

I think this horse is well and truly dead.

Rudde93 · Oct 31, 2012

Seriously guys, this is the worst case of thread hijack I have ever seen!

cantalup · Oct 31, 2012

Rudde93 said:
Seriously guys, this is the worst case of thread hijack I have ever seen!

not really worst

.. if you look on some zfs threads...

..
this is the reason I hit the brake

..
do you decide to use mdadm+ext4 or ZoL?

***************
I even got a "harsh" warning from someone does not like ZoL. someone whom thinks solaris 11 ZFS is superior than other ZFS implementation hehehe

Rudde93 · Nov 1, 2012

I had to go with MDADM + XFS since ext4 don't support partitions over 16 TB, but I will keep myself posted in the ZoL development and really hope they get good, because ZFS is what I really want

Also, I got my 10 GbE switch and 10 GbE card yesterday, just need a cable now and I will test speeds over network :-D

cantalup · Nov 1, 2012

Rudde93 said:
I had to go with MDADM + XFS since ext4 don't support partitions over 16 TB, but I will keep myself posted in the ZoL development and really hope they get good, because ZFS is what I really want

Also, I got my 10 GbE switch and 10 GbE card yesterday, just need a cable now and I will test speeds over network :-D

great, since you already decide.

speaking of ZoL, you can try to try on a demo machine and test all your scenarios,

I was waiting patiently on BTRFS actually, and Zol came along in my view.
I would imagine BTRFS would be "identical" with ZoL when both are mature. the differences is BTFS has GPL license and already merged in kernel distribution.

till now, ZoL and madadm+filesystem are good enough for me.

pretty high price for 10gbE

, I wished, I had it

Rudde93 · Nov 1, 2012

Why is oracle developing BTRFS to linux? And I had actually never heard of it, seems very interesting, will it be a raid filesystem or do I have to use MDADM + BTRFS?

cantalup · Nov 1, 2012

Rudde93 said:
Why is oracle developing BTRFS to linux? And I had actually never heard of it, seems very interesting, will it be a raid filesystem or do I have to use MDADM + BTRFS?

btrfs has been around for years, https://btrfs.wiki.kernel.org/index.php/Main_Page
"Btrfs is under heavy development, but every effort is being made to keep the filesystem stable and fast. Because of the speed of development, you should run the latest kernel...."

tried on fedora for testing around 2 years ago, was buggy.
I see development is racing-up recently. kernel 3.X

when you use btrfs, you do NOT mdadm.
mdadm would gone as a past history.
btrfs is zfs alike... with some differences..

why did oracle start developing btrfs? you can read on http://www.linux-magazine.com/Online/News/Future-of-Btrfs-Secured and http://www.mail-archive.com/[email protected]/msg02261.html

I absolutely jump to btrfs when mature enough

omniscence · Nov 2, 2012

I _do_ use btrfs on mdadm. btrfs does not support a parity based redundancy scheme yet and it will stay that way for a long time probably. While the "raid1" scheme of btrfs may be okay for 2 or 3 disks, it is quite wasteful for 8 or more drives. The main reasons I use it are the snapshotting and checksumming capabilities. If a corrupt file is detected I can always restore it from (daily) backups.

cantalup · Nov 2, 2012

omniscence said:
I _do_ use btrfs on mdadm. btrfs does not support a parity based redundancy scheme yet and it will stay that way for a long time probably. While the "raid1" scheme of btrfs may be okay for 2 or 3 disks, it is quite wasteful for 8 or more drives. The main reasons I use it are the snapshotting and checksumming capabilities. If a corrupt file is detected I can always restore it from (daily) backups.

we will see in near future on btrfs direction where many companies try to push for replacing ext4 today or near future.
we have many tools on "linux", for example, mdadm, btrfs (waiting to be mature enough), lvm2

, ext4 (ext3 on steroid for me, this should be a gap filler before getting mature btrfs), xfs, and others

my common assumption( just mine).. mdadm would be less deployed when btrfs is stable and mature with features

drescherjm · Nov 2, 2012

my common assumption( just mine).. mdadm would be less deployed when btrfs is stable and mature with features

For me at work (where I now have around 50TB of storage with the 8TB added this week) I am running all servers mdadm raid 6 as the raid and for the most part with ext4 or now btrfs as the filesystem. For me btrfs which I have used and tested for over a year (for mostly data that could be lost) I have begun to migrate away from lvm2 since btrfs subvolumes and snapshots replace my usage of lvm2 but any migration away from using mdadam I would expect to be years from happening after btrfs gets the raid5/6 patches. I will have to thoroughly test btrfs raid5/6 in disaster situations (like I have done many times and continue to do so with mdadm) like hot pulling 3 drives from a raid 6 and seeing after that can I force the raid to recover with minimal loss. With mdadm I am very confident that I can recover in that situation. BTW, during my testing mdadm a few months back I triggered a kernel bug (scsi_remove_target) that was fixed and I ended up getting my name on the kernel logs / patch.

With that said moving from lvm2. I see some interesting recent developments in lvm2. They are adding raid 5/6 to lvm. Although I highly doubt I would ever use that.

cantalup · Nov 2, 2012

drescherjm said:
... BTW, during my testing mdadm a few months back I triggered a kernel bug (scsi_remove_target) that was fixed and I ended up getting my name on the kernel logs / patch.

With that said moving from lvm2. I see some interesting recent developments in lvm2. They are adding raid 5/6 to lvm. Although I highly doubt I would ever use that.

you are a contributor to them for kernel patch. nice!

raid 5/6 on lvm? hmm.. that's make lvm2(or newer) getting fat aka complicated

ZFS on Linux vs. MDADM ext4

2[H]4U

Limp Gawd

Gawd

Gawd

Limp Gawd

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Limp Gawd

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Gawd

Gawd

Gawd

2[H]4U

Gawd

2[H]4U

2[H]4U

Gawd

2[H]4U

Limp Gawd

Gawd

Limp Gawd

Gawd

Limp Gawd

Gawd

[H]ard|Gawd

Gawd

[H]F Junkie

Gawd