Linux & Ext4 - Blocksize alignment??

Krobar

Limp Gawd
Joined
Feb 24, 2010
Messages
345
Can any Linux gurus offer any advice on Ext4 and partition bnlocksize alignment? I'm on a 10 Drive Raid 6 array with 512Kbyte stripe size.


I used the commands below and whilst I think it will align to a 512Kbyte stripe I dont really know. Does it look right? (Hopefully I havent overwritten the controllers metadata...)



dd if=/dev/urandom of=/dev/sdb bs=512 count=64
pvcreate --metadatasize 500k /dev/sdb
pvs -o pe_start


vgcreate RaidVolGroup00 /dev/sdb
lvcreate --extents 100%VG --name RaidLogVol00 RaidVolGroup00


yum -y install e4fsprogs



[root@localhost ~]#mkfs -t ext4 -E stride=128,stripe-width=1024 -i 65536 -m 0 -O extents,uninit_bg,dir_index,filetype,has_journal,sparse_super /dev/RaidVolGroup00/RaidLogVol00
mke4fs 1.41.5 (23-Apr-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
243793920 inodes, 3900695552 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
119040 block groups
32768 blocks per group, 32768 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000, 3855122432

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first. Use tune4fs -c or -i to override.

Edit: My initial format attempt was terrible and was likely to take over 300days to sort the inodes. I read a much better description of the settings for raid in the link below and the corrected one above sorted the inodes in under 10 mins:
http://www.ep.ph.bham.ac.uk/general/support/raid/raidperf11.html

Add and mount:
echo "/dev/RaidVolGroup00/RaidLogVol00 /data0 ext4 defaults 0 0" >>/etc/fstab
mkdir /data0
mount /data0

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup01-LogVol00
285G 6.1G 265G 3% /
/dev/sda1 99M 12M 82M 13% /boot
tmpfs 1014M 0 1014M 0% /dev/shm
/dev/mapper/RaidVolGroup00-RaidLogVol00
15T 138M 15T 1% /data0
 
Last edited:
Eeks!

First, please realise a 512 kilobyte stripesize in RAID5 or RAID6 is very high. That would mean that your full stripe block size would be: (10 - 2) * 512KiB = 4096KiB = 4MiB. Meaning, any small write to the array would trigger into reading up to 4MiB, XORing all and writing 4MiB. So when writing 2 bytes; you're actually 'amplifying' that to 8 megabytes of physical I/O.

Normally, i recommend a 128KiB stripesize, in that case your array's full stripe block would be 8 * 128KiB = 1 megabyte. As normally I/O requests do not go beyond 128KiB, this stripesize would be best if you would align your filesystem properly.

The 300 days to create a filesystem sounds like an extreme effect from write amplification; i would sort that out first.

In FreeBSD aligning is very simple, use the bare disk device node:

newfs /dev/sdb

Or use a label:

glabel label disk2 /dev/sdb
newfs /dev/label/disk2

If you use partitions, make sure the offset of the partition is a multiple of the full stripe block size. So, in our example of 10-disk RAID6 with 128KiB stripesize and 1MiB full stripe block; the partition offset would have to be 1024KiB, 2048KiB, 4096KiB, 8192KiB, etc; a multiple of 1024KiB. After that you can create any filesystem normally i would assume. I don't think you need to make the filesystem change its internal offset; though that's another way to fix the problem i guess.

So either don't use partitions at all, or use partitions with the correct offset - and certainly not the default 63 sector or 31.5KiB.
 
I found that with mdadm leaving the default stripe size of 64KiB usually leads to the good read and write performance. I have in the past made the mistake of creating arrays with 256MiB or 512MiB blocks only to see poor write performance.
 
Edit: My initial format attempt was terrible and was likely to take over 300days to sort the inodes.

And I complain when it takes 5 minutes to format ext4 on my linux software raid 5 arrays..
 
LVM will eat some performance and makes aligning the data more difficult. Unless you're using it for volume management, and it doesn't look like you are, don't use it. Create the RAID directly on the raw block devices (don't partition), and then create the filesystem directly on the RAID md device. That's the easiest path. If you need to use LVM you should do the same but you should specify the PE size to be a multiple of your RAID stripe size.
 
Thanks for the help everyone.

dd if=/dev/urandom of=/dev/sdb bs=512 count=64

Do you think this killed the Metadata?

Had a complete meltdown on reboot, looks like metadata was gone because all drives appeared as healthy but the controller was insistent that they weren't part of the array. To be honest it may have been the power management spindown that started the problem, all drives spundown for the first time but only 4 were seen by ASM on waking up, on reboot it was in a bad state.

Before I rebooted the array was giving me 400Mbyte/sec read or write with DD, dropped to 320Mbyte/sec with 4k blocks (Appears to be CPU limited by the 1.46Ghz Mobile Celeron)

Does anyone else have any more thoughs on a suitable block size. This array is not likely to store any files below 4Gbyte.
 
Last edited:
LVM will eat some performance and makes aligning the data more difficult. Unless you're using it for volume management, and it doesn't look like you are, don't use it. Create the RAID directly on the raw block devices (don't partition), and then create the filesystem directly on the RAID md device. That's the easiest path. If you need to use LVM you should do the same but you should specify the PE size to be a multiple of your RAID stripe size.

Your guess is right, I dont need LVM, I was scared off by mkfs warning me it was a raw device first time round, just formatted straight to the device whilst it is building and rebooted, array OK.
 
Does anyone else have any more thoughs on a suitable block size. This array is not likely to store any files below 4Gbyte.

Do you mean the stripe/chunk size for RAID striping? Given your large files, 512KB should be fine.

With today's high density, high latency, low-power HDDs, there are few applications that will benefit from less than 256KB chunk size. Most of the low-power, >1 TB HDDs have access latency (seek time + rotational latency) around 12-15ms, but they can read sequentially at around 100 MB/s. So it takes about 2.5ms to read 256KB, but it takes 5 or 6 times that long just to get the head into position over the starting sector.

The only application I can think of that would benefit siginficantly from a smaller chunk size would be something that does a high percentage of writes compared to reads, and most of the writes are for files smaller than 256KB. Even then I do not think you would double the write performance by decreasing the chunk size.
 
Last edited:
Yup, that kills everything possible, including the MBR.

Thanks Longblock, at least that explains the meltdown, I suppose the card must re-read metadata on resuming from spindown. I wasnt sure if the raid card meta data was hidden from the OS or not, looks like it isnt.
 
Controller metadata shouldn't be visible to the OS. Reboot it and you'll find out soon enough... but it would have wiped any LVM/mdadm/filesystem metadata.
 
I formatted the raw device and rebooted during rebuild. All still in-tact this time.

My best guess at the moment is that the aacraid driver built-in to Centos 5.4 has some severe issues with the recent Adaptec Power management additions. Have disable spindown for now and will give Adaptec a call.
 
It gets worse:

Event Description: Bad Block discovered: controller 1 (9835b000).
Event Type: Warning
Event Source: [email protected]omain
Date: 03/31/2010
Time: 02:44:44 PM BST
 
Replaced the PSU with a Corasair HX650 but still things get worse. The card now randomly doesnt detect 2-4 of the drives during bios boot. The drives not detected vary and swapping the drives to different bays (Different backplane, cables and ports on Adaptec) doesnt help. I think I have a bad card, any other ideas?
 
Perhaps the drives need more time, you're using staggered spinup?

No, it never worked. I upgraded to the HX650 PSU so it could deal with the inrush on 20 drives. I disabled head parking just in case.

Just installed a fan straight over the Adaptec so temp is now down to 51. Last attempt before I RMA it.
 
Back
Top