Solaris iostat errors

tormentum · Apr 28, 2011

G'day Guys,

I'm having some strange issues on my Solaris Express 11 and ZFS. I'm getting a lot of iostat errors on ONE of my zpools when I it under load:

Code:

root@arkf-san1:/dev/rdsk# iostat -exmn
                            extended device statistics       ---- errors ---
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    0.1    0.4    2.9   31.6  0.0  0.0   17.4    2.1   0   0   0   0   0   0 c7t0d0
    0.1    0.4    3.0   31.6  0.0  0.0   17.4    2.2   0   0   0   0   0   0 c7t1d0
   23.6   20.9  454.9  213.1  0.0  0.1    0.0    2.8   0   3   0   0   0   0 c0t50024E900430BBEBd0    \
   23.5   20.9  455.0  213.1  0.0  0.1    0.0    2.9   0   3   0   0   0   0 c0t50024E900430BBF2d0     |
   23.6   21.0  455.0  213.1  0.0  0.1    0.0    2.7   0   3   0   0   0   0 c0t50024E900430BBEEd0     | Zpool1
   23.5   21.0  455.1  213.1  0.0  0.1    0.0    2.8   0   3   0   0   0   0 c0t50024E900430BBDCd0     | Samsung F2
   23.5   20.9  455.1  213.1  0.0  0.1    0.0    2.9   0   3   0   0   0   0 c0t50024E900430BC2Ad0     | 1.5TB disks
   23.5   20.9  454.8  213.1  0.0  0.1    0.0    2.8   0   3   0   0   0   0 c0t50024E900431D9B0d0     |
   23.4   20.9  454.8  213.1  0.0  0.1    0.0    2.8   0   3   0   0   0   0 c0t50024E900431D9ACd0     |
   23.4   20.9  454.8  213.1  0.0  0.1    0.0    2.8   0   3   0   0   0   0 c0t50024E900431D9E9d0    /
    0.0   15.9    0.1 11470.3 0.0  0.1    0.0    4.8   0   7   0   8  17  25 c0t50024E90047BBF9Ad0    \
    0.0   15.8    0.1 11470.2 0.0  0.1    0.0    4.8   0   7   0   8  17  25 c0t50024E90047BBF98d0     |
    0.0   15.9    0.1 11470.4 0.0  0.1    0.0    4.8   0   7   0   4   9  13 c0t50024E90047C51A2d0     | Zpool2
    0.0   15.8    0.1 11470.2 0.0  0.1    0.0    4.8   0   7   0   8  17  25 c0t50024E90047BBC90d0     | Samsung F4
    0.0    2.9    0.2 1847.4  0.0  0.0    0.0    4.9   0   1   0  14  35  49 c0t50024E90047C5450d0     | 2.0TB disks
    0.0    2.9    0.2 1847.3  0.0  0.0    0.0    4.6   0   1   0   2   6   8 c0t50024E90047C5512d0     |
    0.0    2.9    0.2 1847.4  0.0  0.0    0.0    4.6   0   1   0   6  10  16 c0t50024E90047C51B0d0     |
    0.0    2.9    0.2 1847.3  0.0  0.0    0.0    4.6   0   1   0   2   4   6 c0t50024E90047C55C4d0    /

The above iostat printout was taken half way through a "zfs send | zfs recv" between two zpools. The source zpool contains the top 8 disks, and the destination zpool contains the bottom 8 disks. The top 8 disks (zpool1) are Samsung 1.5 F2 disks, and the bottom 8 disks (zpool2) are Samsung 2TB F4 disks.

Before anyone asks, YES I have updated the 2TB disks to the latest firmware!

My rig comprises of the following significant items:

8 x SAMSUNG EcoGreen F2 HD154UI hard disks
8 x 2TB SAMSUNG Spinpoint F4 HD204UI hard disks
LSI Internal SATA/SAS 9211-8i 6Gb/s PCI-Express 2.0 card
HP SAS Expander Card
NORCO RPC-4224 4U Rackmount Server Case

I have tried different disks (Seagate, WD, Hitachi, Samsung), and have not had any issues. It's only the 2TB Samsung F4 disks that have an I/O issues. I have also tried connecting the 2TB disks via a straight SATA card (non-SAS) and I still have the same issue.

Anyone else had issues as above??

tormentum · May 1, 2011

I just disovered that LSI has released a new firmware version (v9.0.0.0). I'll upgrade and report back.

ChrisBenn · May 1, 2011

When you tried connecting directly was that still through the expander? Do you have the errors without the expander but with the same drives?

tormentum · May 2, 2011

tormentum said:
I just disovered that LSI has released a new firmware version (v9.0.0.0). I'll upgrade and report back.

So I screwed up the flash and am contacting LSI support. My system now fails to POST when the card is inserted in the machine.

*sigh* haha

ZX2Slow · May 2, 2011

tormentum · May 2, 2011

Yes, I've tried isolating to particular drives, to particular SAS cables etc. Issue occurs even when I have a single device connected to an Intel SATA cable.

I'm putting it down to an incompatibility or poor design of the Samsung 2TB disk. I'll probably sell them and purchase something else; Hitachi maybe.

ChrisBenn · May 3, 2011

So with a drive plugged directly into a motherboard header you still get the errors? With different SATA cables? Either drive or controller is bad, and if other drives work fine in the controller then I'd say those drives are bum.

tormentum · Oct 6, 2011

Update:

So on my main rig (Norco 4224) I have 6 rows of 1.5TB Samsung now, configured as a single zpool, with 4 raidz2 vdev's containing 6 disks each. Works a charm and never have IOSTAT errors.

I recently built another box and placed the same Samsung 2TB disks into this server. There is no backplane, just straight SAS -> SATA cables connecting to another LSI 9211-8i card. Again, IOSTAT errors are appearing. I've only just started testing with this new box. There's about 11TB usable capacity (7 x 2TB).

I'm doing a zfs send/recv from my norco to this box and once I get a sizable chunk of data on there I'll start stressing the disks a bit. Already one of them is showing regular IOSTAT errors.

I was wondering if anyone else has had issues with these 2TB Samsung HD204UI disks after upgrading their firmware?

brutalizer · Oct 6, 2011

I have one Samsung disk HD204UI and never encountered any problems at all. But I have never upgraded firmware.

tormentum · Oct 6, 2011

Hmm, i've been google trawling and found this:

http://opensolaris.org/jive/thread.jspa?threadID=136401

It has me wondering if this could be a 4K sector size issue. I thought ZFS worked fine with 4K drives. Am i wrong?

I'm using OpenIndiana oi_151a.

ChrisBenn · Oct 6, 2011

The 4k "issue" was purely a performance based issue.

The root of the problem is all the 4k drives available today actually report themselves as 512 byte drives & handle the translation in firmware. Depending on the filesystems usage patterns this can lead to non-optimal behaviour. What is pretty common is to force the zfs pool/vdev to treat the drives as 4k drives via setting the ashift to 12 for the pool/vdev. What this does is cause zfs to essentially batch up the operations into 4k chunks anyway, so overall performance takes a minimal hit from the drive firmware emulation.

If you don't manually set an ashift of 12 things will still work, it will just be a question of how efficient the firmware in the drive. But there shouldn't ever be a situation where the drive doesn't work.

ZFS is/would be fine with 4k drives if any of them actually reported themselves as such.

tormentum · Oct 6, 2011

Thanks ChrisBenn. I think what I'll do is get some data sets onto the new samsung volume and stress it a bit, see if I can get some regular errors on more than just one drive.

Based on the results from this i will do one of the following:
1) replace the problematic disks/cables
2) reformat with ashift=12 and retest.

Part of the problem is this appears intermittent. I could set up a looped zfs send/recv onto the same zpool and let it run for days and nothing will show up, and then while copying some small files I get the iostat error show up. Or viceversa!

Will report back.

brutalizer · Oct 6, 2011

If you use 4KB or 512byte should not give any errors. It can give performance decrease in worst case, but it should not make you loose data or disk crash. The problem is elsewhere, I strongly suspect

tormentum · Oct 6, 2011

brutalizer said:
If you use 4KB or 512byte should not give any errors. It can give performance decrease in worst case, but it should not make you loose data or disk crash. The problem is elsewhere, I strongly suspect

I've come to the same conclusion myself Brutalizer. I've created a 7 disk RAIDZ with my HD154UI disks using 4k sector size to test however.

An interesting issue has come up when i do a ZFS send/recv from my 512 zpool to my 4K zpool. My vmware zvol's expand in size for some reason. It does not appear to occur with straight file systems (see the "templates" and "software" file systems), but does occur with zvols:

Original "ashift=9" file system:

Code:

tank/vmware                        221G  11.5T  57.9K  /tank/vmware
tank/vmware/software              29.6G  11.5T  29.6G  /export/vmware-software
tank/vmware/templates             27.4G  11.5T  27.4G  /export/vmware-templates
tank/vmware/zvol-arkf-production   108G  11.5T   101G  -
tank/vmware/zvol-lab-01           38.1G  11.5T  38.1G  -
tank/vmware/zvol-lab-02           18.0G  11.5T  17.9G  -
tank/vmware/zvol-lab-03           1.47M  11.5T  1.40M  -
tank/vmware/zvol-scratch          6.41M  11.5T  6.33M  -

Destination "ashift=12" file system:

Code:

tank/arkf-san1/vmware                        325G  9.92T   242K  /tank/arkf-san1/vmware
tank/arkf-san1/vmware/software              29.7G  9.92T  29.7G  /export/vmware-software
tank/arkf-san1/vmware/templates             27.6G  9.92T  27.6G  /export/vmware-templates
tank/arkf-san1/vmware/zvol-arkf-production   172G  9.92T   172G  -
tank/arkf-san1/vmware/zvol-lab-01           65.4G  9.92T  65.4G  -
tank/arkf-san1/vmware/zvol-lab-02           30.5G  9.92T  30.5G  -
tank/arkf-san1/vmware/zvol-lab-03           6.93M  9.92T  6.93M  -
tank/arkf-san1/vmware/zvol-scratch          16.2M  9.92T  16.2M  -

Notice the volumes marked "zvol" toward the bottom? They have expanded when shifting between the hosts. I wonder if this has to do with compression? These volumes are configured with zle based compression.

When I dump the volumes to file (eg: zfs send tank/arkf-san1/vmware/zvol-scratch@2011-10-05 > /tank/zvol_dump), the files on both hosts are the same size, so it appears the data is correct, it's just an issue with the size of the actual volumes within ZFS. Very curious.

Thoughts?

ChrisBenn · Oct 6, 2011

Extra space is due to the larger sector size. If you make a filesystem and just put one big file on it then the filesize of the two systems should be within a few k of eachother.

As you add more smaller files then ZFS metadata ends up consuming more resources. The metadata would normally take up some portion of a 512b block - but now it's taking up some portion of a 4Kb block. So more slack space and it ends up taking more real space to store the same amount of data.

I'm not totally sure how ZFS deals with slack space for files, but if you have a directory with lots of files you could probably pretty easily loose the different (3.5Kb) per file. If it was a directory of millions of 1Kb text files that could obviously be significant. If it had 10 tv episodes it's probably irrelevant.

The flip side being if you are writing or reading 4k worth of data it should be faster to do it in on 4Kb chunk vs. 4 512b chunks. (Though again, probably not a significant difference, and we won't really realize any benefit until there are native 4k drives)

tormentum · Oct 8, 2011

Well, after hammering the 4K (ashift=12) pool heavily for a few days (I've transferred about 50TB worth of I/O through the pool), I have not generated any iostat errors. I'm about to go back to the 512 (ashift=9) pool configuration and see if i get errors again.

I must say, I'm quite surprised so far with this. I would not have thought that the sector size would have anything to do with hardware and transport errors on the disks, but it looks like it may. Interestingly ZFS has never detected any errors or bad reads which means the errors are corrected before the data ever reaches ZFS. Perhaps another buggy firmware issue with these drives?

Will let you know whether I can generate errors again after going back to the 512 pool config.

tormentum · Oct 9, 2011

Okay, back with the 512 (ashift=9) pool. Iostat errors began appearing after 30 minutes or so. Looks like the 512 vs 4K sector issue appears to be confirmed, with these drives at least.

I wish I tested this before. I would have bought more 2TB's instead of opting to return to the 1.5TB Samsung's. Oh well.

Thanks for everyone's input and suggestions.

ChrisBenn · Oct 9, 2011

Bizzare, it really sounds like firmware bugs in the drive itself (I'm not sure what else would cause that) - good catch though.

brutalizer · Oct 9, 2011

Just to confirm, with 512 byte sectors you got errors. And with 4kb sectors there were no errors?

Why are you returning disks, why not only use 4kb on those as well?

tormentum · Oct 9, 2011

brutalizer said:
Just to confirm, with 512 byte sectors you got errors. And with 4kb sectors there were no errors?

Why are you returning disks, why not only use 4kb on those as well?

Not returning the disks at all. I had shelved them for about 6 or 9 months (bought 12 of them between Jan and April this year). I've already re-formatted back to 4K and started copying zfs volumes in question.

I had returned to 512b to verify my hypothesis that the 512b format was causing the iostat hardware and transport errors. I confirmed this.

I have 2 storage servers, one 20TB server comprizing of 1.5TB disks (4 x 6 disk RAIDz2) and the new server which comprises of 8 2TB disks in a raidz. The 8 disk one is new and i decided to figure out why my 2TB samsungs were causing me issues earlier in the year.

brutalizer · Oct 10, 2011

So... with Samsung 2TB F4 HD204U disks it could be good to use 4KB sectors? Is this the conclusion?

I am using one of those disks as single disk storage, but have not seen any errors. But if I do a raid, I should use 4KB sectors, according to your experiment, it seems...

forumator · Oct 10, 2011

This is pretty interesting info if it's true that ashift=12 is required w/ 4k drives...I had thought it was a minor performance loss at worst...does formatting with ashift=12 have any downsides (again when using 4k drives)?

ChrisBenn · Oct 10, 2011

Ashift 9 with a 4k (512e) drive should be totally fine. What it is going to do is exercise the translation firmware significantly more than an ashift of 12 on the same drive. The main conclusion you can draw from this is that on the samsung F4 drives the firmware appears to be buggy with respect to ZFS's access patterns for a Raid-Z setup.

Not really any large downside to using an ashift of 12 though (even if some of the drives are true 512 - though no point if you don't have any 4k drives). Since it's fixed @ the pool at time of creation you really want to set it to 12 if you *ever* plan to add a 4k drive and not destroy/recreate the pool. Your ZFS metadata will now occupy a 4k chunk instead of a 512 byte chunk so you will have slightly more overhead - so slightly less available file space - though I don't think it will really be significant.

tormentum · Oct 10, 2011

brutalizer said:
So... with Samsung 2TB F4 HD204U disks it could be good to use 4KB sectors? Is this the conclusion?

I am using one of those disks as single disk storage, but have not seen any errors. But if I do a raid, I should use 4KB sectors, according to your experiment, it seems...

In my particular situation it would seem so. Bare in mind that I'm using an LSI SAS 9211-8i and not straight SATA. I'm not sure if this has an impact; could be an area for some extra study.

Also, I only notice iostat errors on fairly busy zvols/disks. If the zvol has not been doing much, there are few/no errors listed with iostat; although this stands to reason: higher IO count, higher iostat error count.

I'm not saying everyone should go and re-format. If you're not experiencing any issues (run "iostat -exmn" and see if you have any errors listed), why reformat? In my case, when an error was experienced I would get a temporary performance hit and would have to wait for the disk in question to recover. This recovery would take less than a second, however in my case I was using the disks for vmware zvol's. A classic example of an issue I had was installing Windows 2008r2 into a VM sitting on a HD204U based zvol taking over an hour (should be 10-12 minutes max!).

Also note: no ZFS errors were experienced so data it seems was not affected; issue was purely iostat transport and hardware errors (see first post).

In my opinion, this is a samsung specific issue. I have not been able to replicate the issue with my WD EARS disks. It feels like this could be a firmware issue on the Samsungs in how they handle 512b emulation.

Has anyone else experienced IOSTAT errors with their Samsung HD204U's? Bonus points if you have an LSI 9211-8i!

tormentum · Oct 10, 2011

ChrisBenn said:
Ashift 9 with a 4k (512e) drive should be totally fine. What it is going to do is exercise the translation firmware significantly more than an ashift of 12 on the same drive. The main conclusion you can draw from this is that on the samsung F4 drives the firmware appears to be buggy with respect to ZFS's access patterns for a Raid-Z setup.

Not really any large downside to using an ashift of 12 though (even if some of the drives are true 512 - though no point if you don't have any 4k drives). Since it's fixed @ the pool at time of creation you really want to set it to 12 if you *ever* plan to add a 4k drive and not destroy/recreate the pool. Your ZFS metadata will now occupy a 4k chunk instead of a 512 byte chunk so you will have slightly more overhead - so slightly less available file space - though I don't think it will really be significant.

Correct.

The only issue I have experienced so far is when I ship zvol snapshots from my 512b pool to my 4K pool the amount of space the zvol's take increase by around 20-40% (havn't been able to determine a correlation as yet).

ZFS file systems appear to be (relatively) unaffected. At most, one of my 400GB volumes containing roughly 800,000 files grows by about 20MB.

tormentum · Oct 10, 2011

ChrisBenn said:
Not really any large downside to using an ashift of 12 though (even if some of the drives are true 512 - though no point if you don't have any 4k drives). Since it's fixed @ the pool at time of creation you really want to set it to 12 if you *ever* plan to add a 4k drive and not destroy/recreate the pool. Your ZFS metadata will now occupy a 4k chunk instead of a 512 byte chunk so you will have slightly more overhead - so slightly less available file space - though I don't think it will really be significant.

Good point. I've finished backing up my vmware pool to reconfigure my vdev's. I might reformat as well using 4K and see how things go. Kind of future proofing the pool.

forumator · Oct 10, 2011

In order to create the pool w/ ashift=12 are you doing it manually, or creating first with ZFSguru, or downloading a binary (that is current for v28)? As far as I can tell napp-it doesn't support creation of pools with modified ashift so wondering what the cleanest way to format in OpenIndiana is.

ChrisBenn · Oct 10, 2011

I would download the modified zpool binary

tormentum · Oct 10, 2011

i compiled a copy of the zpool command with ashift set to 12:

http://www.arkf.net/files/zpool-12.zip

Simply create the pool using the zpool-12 command. Use the regular zpool executable for future queries of the pool.

Solaris iostat errors

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

n00b

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

Weaksauce

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Weaksauce

Limp Gawd

Limp Gawd