Best mdadm array

MrSmoke · Nov 17, 2010

So ive got 3 x F4EG 2tb drives. I want to create a RAID5 array for 4-15gb files etc. Ive used mdadm+lvm+ext3 before and now im looking at ext4 or xfs.

What i want to be able to do is add more 2tb drives later (i dont think i need to use lvm's anymore?). Performance in dc++ would be nice as well (i think it does lots of small read/writes? i dont really remember off the top of my head).

Should i create the array on the raw disks (dev/sda) or should i format them? (dev/sda1). Will i get alignment issues with these drives?
I was getting pretty crap speeds with raid5 4x15EADS + lvm + ext3 hence the reason to change.

What would be the best way to create this array and allow me to do online expansion later? Sorry for the sucky layout/grammar in this post lol.

Thanks

tgrimley · Nov 17, 2010

Make a Linux RAID partition on each drive. It's much nicer for when things go wrong. Easier to identify drives.

they should look like this:

Code:

tgrimley@XXXXXXXX:~$ sudo fdisk -l /dev/sda


Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xde78d2f3

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1      243201  1953512001   fd  Linux raid autodetect

Create the array with /dev/sd[abc]1 or whichever. It's much easier later when you can see a partition on the drives. Expansion doesn't need anything special. Just add (sudo mdadm --add /dev/md0 /dev/sdd1) and grow (sudo mdadm --grow /dev/md0 --raid-devices=4). Afterwards, unmount, fsck, and expand, remount.

You can tweak the chunk-size to find something that works well for large seq read/write and for small read/write. Depends on your setup. You can run bonnie++ or similar to find what works. Take the time to do this; it is worth it.

Relativistic · Nov 17, 2010

Use this to find the optimal stride and stripe size: http://busybox.net/~aldot/mkfs_stride.html
I'd use a chunk size of 128K+ given that you are going to store large files.

Don't use lvm unless you really need it (you don't).

Only problem is that the stride/stripe setting wont be optimal when you grow the array.

Child of Wonder · Nov 17, 2010

XFS would work great for large files. In performance tests it still squeaks out a lead over
EXT4.

And I agree with the previous poster, partition each drive. Makes managing things easier.

Gambit · Nov 17, 2010

out of curiosity, how does it make it easier than just putting the filesystem on the drive itself? Wouldn't you have to deal with aligning the partitions on the drive if you actually partition the drive?

I ask this because I'm working on setting up a 3 drive raid 5 of the Samsung 2tb HD204UI drives myself and haven't partitioned the drives... I don't see how it'll help things down the road.

john4200 · Nov 17, 2010

I also fail to see the advantage of creating the RAID on partitions instead of raw block devices.

Unless I have a specific reason (eg., different sized HDDs), I install mdadm RAIDs onto the raw block devices. And then I install XFS onto the raw RAID block device /dev/md_

I usually specify the devices to mdadm using /dev/disk/by-id/

The recent versions of mdadm use a default chunk size of 512KB. I think that is a good choice for the usage you have described.

Regardless, after you do an expansion with mdadm, you would have to also expand your filesystem. With XFS, you use xfs_growfs

Gambit · Nov 17, 2010

john4200 said:
The recent versions of mdadm use a default chunk size of 512KB. I think that is a good choice for the usage you have described.

You sure about that? I *just* installed Ubuntu 10.10 server last night, and it's default (as listed under: $man mdadm) is 64KB

john4200 · Nov 17, 2010

Gambit said:
You sure about that? I *just* installed Ubuntu 10.10 server last night, and it's default (as listed under: $man mdadm) is 64KB

Yes, I am sure. Neil Brown, the mdadm developer, has stated it on the linux raid email list. I'm not certain at what version the change occurred, but 512KB has been the default for many months now, if not more than a year. I know that version 3.1.2 (2010 Mar 10) defaults to 512KB chunks.

For some reason, Ubuntu is way behind on mdadm. I think they are only up to 2.6.7 or so.

MrSmoke · Nov 17, 2010

ok so xfs is the winner. I noticed it had a nice and easy grow method (so long lvm lol).

So creating the partition on the drives wont give me alignment issues with this drive? I did think about having partitions on the drive would make it easier to identify if the drive is being used or not so i dont overwrite the data on a raw drive.

Gambit · Nov 17, 2010

john4200 said:
Yes, I am sure. Neil Brown, the mdadm developer, has stated it on the linux raid email list. I'm not certain at what version the change occurred, but 512KB has been the default for many months now, if not more than a year. I know that version 3.1.2 (2010 Mar 10) defaults to 512KB chunks.

For some reason, Ubuntu is way behind on mdadm. I think they are only up to 2.6.7 or so.

Yep, sure enough there's version 3.1.4 (Aug-31-2010) on www.kernel.org and "$man mdadm" properly shows 512 as the default.

MrSmoke said:
ok so xfs is the winner. I noticed it had a nice and easy grow method (so long lvm lol).

So creating the partition on the drives wont give me alignment issues with this drive? I did think about having partitions on the drive would make it easier to identify if the drive is being used or not so i dont overwrite the data on a raw drive.

Keep in mind, you cannot shrink an xfs filesystem. You have to backup the data, recreate the filesystem and restore the data. As for keeping track of the drives in use, use a small sticker or a piece of paper taped to the inside of the server case with drives, serial numbers, filesystem name (eg. /dev/sda), etc. Probably not a bad practice anyway.

MrSmoke · Nov 18, 2010

i dont think ill ever need to shrink it lol.

mdadm -v --create /dev/md2 --force --chunk=128 --level=raid5 --spare-devices=0 --raid-devices=3 /dev/sdg1 /dev/sdh1 /dev/sdj1

building at ~97000K/sec which isnt to bad

Gambit · Nov 18, 2010

Hmm... mine is rebuilding now at ~58,000. I created it with: mdadm --create /dev/md0 --level=5 --chunk=64 --raid-devices=3 /dev/sd{b,c,d}

Red Falcon · Nov 18, 2010

This is great info, thanks for posting it.

/dev/null · Dec 9, 2010

Also for watching it build...

watch -n 3 cat /proc/mdstat

mitgib · Dec 9, 2010

I actually did alot of research on this and was surprised to learn a smaller chuck size is usually going to be faster access for larger files. XFS will auto detect the details from mdadm for sunit and swidth which is great.

And a 12 drive raid10 I have at work building right now

Code:

Every 2.0s: cat /proc/mdstat                            Thu Dec  9 20:39:22 2010

Personalities : [raid10]
md0 : active raid10 sdm1[11] sdl1[10] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sd
f1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
      5860559616 blocks 64K chunks 2 near-copies [12/12] [UUUUUUUUUUUU]
      [=====>...............]  resync = 26.9% (1577992192/5860559616) finish=584
.6min speed=122082K/sec

unused devices: <none>

I was shocked by how well this is preforming on a 4 port sil3124 pci-e x4 card with a few sil3176 port multipliers

john4200 · Dec 9, 2010

mitgib said:
I actually did alot of research on this and was surprised to learn a smaller chuck size is usually going to be faster access for larger files.

Maybe you could use a little more research!

It is actually more complicated than your statement might lead one to believe.

If you do a lot of sequential access, then a good chunk size is around F / D, where F is the typical file size you are accessing (or perhaps the 10th percentile size), and D is the number of data disks in the RAID (total number of disks minus number of parity/redundancy disks). That way you can get most of the disks reading in parallel for most of your files.

But if you do a lot of asynchronous random reads, then the optimal chunk size is going to be several times larger than the typical I/O size you are accessing. That way it is likely that most individual reads will be entirely from one disk, and with a lot of asynchronous random reads, you can read from most of the disks in parallel. You can see that a very small chunk size would be terrible for random reads, since all of the disks would have to seek to a random location before a single I/O access can complete (since the I/O block will span chunks on all the disks). By the way, I am assuming mechanical HDDs. This would not be a problem on SSDs, since the random access latency is much lower for SSDs as compared to HDDs.

Obiron · Dec 10, 2010

I hope this isn't too much of a hijack, but since this thread talked about the subject, I'm hoping for a little advice myself.

I recently built a new fileserver, and hadn't even considered chunk size, block, etc... until recently. I was just going with what mdadm set up by default (using Ubuntu Server 10.04.1). I can see that it might be a good idea to change some of those values, and I'm hoping someone can help me figure out the best numbers for my situation.

My array will be RAID 5 with 3 drives (2 TB each) and will grow to 5 drives once I copy stuff off my main PC and move drives out of that machine.The majority of files are media, ranging from mp3 to flac, and compressed movies (700MB) to uncompressed bluray (20-40GB). The movies will be my main concern, which will be watched via a separate PC running XBMC. It looks like mdadm gave me a block size of 4k, and a chunk of 64k.

Concerning block: I haven't seen much info on this. Is it advisable to change block, or leave it at the default 4096?

Concerning chunk: Since I have large files, 64k seems a bit small, so I should change this to 128+ (maybe 512?). Using the link that Relativistic posted, and assuming a 512k chunk and 4k block, it tells me to set the stripe width at 256 for 3 drives, or 512 for 5 drives (and stride at 128 for both scenarios). Since I will be starting with 3, then growing to 5 (one drive at a time) should I start with the full 512, or change that number each time I grow the array?

john4200 · Dec 10, 2010

You are talking about two different things -- the RAID configuration and the filesystem parameters. For RAID, the only things you need to worry about are the level (5, 6, etc.), the number of disks in the RAID, and the chunk size. Once those are set, the stripe width (stride, etc.) for the filesystem is already determined. Don't be confused by the fact that the stride is shown in filesystem blocks instead of KBs, it is still determined by the chunk size and the number of data disks.

The block size is a filesystem parameter. There is no reason to change it from the default. Actually, I don't think you need to specify anything when you create an ext3 or ext4 filesystem -- I think mkfs will be able to determine the optimal parameters itself by querying mdadm.

As for the chunk size for your application, I'd just set it to 512KB and keep it there. For the usage you described, it is not very important. Your performance should be fine with a wide range of chunk values. But 512KB has been the default for mdadm for a while (at least since 3.1.2), and it will work well for the kind of large files you mentioned.

tgrimley · Dec 10, 2010

Obiron said:
Concerning chunk: Since I have large files, 64k seems a bit small, so I should change this to 128+ (maybe 512?). Using the link that Relativistic posted, and assuming a 512k chunk and 4k block, it tells me to set the stripe width at 256 for 3 drives, or 512 for 5 drives (and stride at 128 for both scenarios). Since I will be starting with 3, then growing to 5 (one drive at a time) should I start with the full 512, or change that number each time I grow the array?

Changing chunk size is very slow. I'd make it 512 (if that's the chunk size you choose for 5 drives) and leave it there.

Obiron · Dec 10, 2010

Thanks for the replies. I guess I was making it more complicated than it need to be. It's good to know that I only need to choose a chunk size now.

Best mdadm array

n00b

Limp Gawd

n00b

2[H]4U

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

n00b

Gawd

n00b

Gawd

[H]ard DCOTM x3

[H]F Junkie

n00b

[H]ard|Gawd

n00b

[H]ard|Gawd

Limp Gawd

n00b