Building your own ZFS fileserver

The iSCSI mounts would be for system disks only; you don't need to share them on multiple machines they are tied to the machine. As it runs RAID-Z, the sequential speeds are very high, all writes are buffered anyway so fast too. Reads are normally slow, but in my case i have a RAID0 of two SSDs as cache device, most accessed parts will have the read-latencies associated with the SSD instead.

For my shared access, i use NFS which is faster than Samba (SMB/CIFS) but better, much faster anyway. In this case ZFS stores the files, just like when using Samba (SMB/CIFS).

For the zvol-exported iSCSI filesystems, you can apply snapshots just like on normal zfs filesystems. I think it's great and very flexible. Though i would like an upgrade to 10GBaseT once it comes available, giving me 10 Gigabit throughput; or like 600MB/s in practise. Then i would really reap the rewards as both my system drive and central storage will have SSD-like performance, even though most is stored on the HDDs.

It would have the read access times of the SSD if it were cached; large files are not cached but are accessed sequentially. Sequential I/O would be over 600MB/s, writes are little over 400MB/s, random writes are buffered and generally high-performance, even though on HDDs. Thanks to ZFS's advanced caching and buffering, i might add. That RAM usage is good for something. :)
 
Now go find me a switch for less than 5 grand

Probably cheaper to buy multiple cards for your server and direct-connect all the clients, for now. I have not seen any reasonably priced 10GBaseT switches or hubs.
 
How about cheap used Infiniband equipment, for the interconnects? Speed should be just as fast as 10gigE but way cheaper. I actually have a few cards and cables when I was planning to do a ZFS box myself, but then I just got a SAS expander instead and hooked it up to the raid card in my main windows box. So my Infiniband equipment is just lying around in a box unused...
 
LVM is a POS. It can reduce performance by almost 2/3 and greatly complicate filesystem recovery when you have an underlying media fault.
I'd be curious to hear some anecdotes. I've been using it quite a bit for at least 5 years now and I don't have any major complaints and haven't had any issues to speak of. The performance hit is there, but I still get about 85-90% of raw device performance. Recovery is going to be complicated with any volume management system, and I really don't have any complaints about the userland tools either. The snapshotting is a bit weak, but it's there, I use it, and it works well enough if its shortcomings aren't an issue in your use case.
 
I guess i'm just waiting for 10GBaseT to be affordable. I know the first generation products use chips that become very hot, use alot of power, the implementation is special and may have rough edges.

Any second generation of 10GBaseT will not only focus on server market but eventually all motherboards will have 10GBaseT onboard too. Once that happens, the implementation will have matured and what is now handled by a 10W TDP chip, is later integrated in the chipset or a 2W Realtek chip can do the same. With that kind of mass production and the price of NICs drops to under 50 euro and switches to under 100 euro. That would be the moment to buy; not the overprices first generation of products with huge profit margins.
 
sub.mesa, based on your experience, would you use the WD 2tb Green drives or the Seagate 2tb LP drives with FreeBSD/ZFS in RAIDZ2?

Controllers would be Intel onboard and Sii3132 PCIe cards...
 
I am using WD Green drives in RAID-Z for over a year, first the 1.0TB ones, later the 1.5TB. I feel that 2.0TB drives are relatively expensive compared to 1.5TB drives. However, i have no reason to think there should be any problem by using these drives. Just two issues:

If the drives support TLER, disable it. Adjust the SATA timeout values to about 60 seconds, that would enable HDDs to fix themselves and not get kicked out of the RAID.

If the drives support "Advanced format" or 4KiB sectors, you need to make sure FreeBSD understands that, the diskinfo -v /dev/adXX should tell you it is using 4KiB sectors, and you should not mix disks with another sector size. So all disks in a RAID-Z should be of the same type.

Silicon Image SiI-3132 is PCI-express x1 and works fine for me. You can have one array that is on both controllers if you like. How many disks are you thinking of?
 
So TLER being off is better for RAIDZ2?

If a drive is in error correction mode, will it hang the entire RAIDZ2 until it recovers?

8 drives in total... I'd buy them all at once. 4 would be off the onboard controller plus 2 each on a pair of Sii3132 cards. The other two motherboard ports would be for the SSD boot and a DVD drive.
 
TLER being off is better in almost all instances, except:
- mission-critical servers that can't afford hickups of 30-60 seconds
- when using badly designed RAID engines common to Windows

So when you use ZFS, you have the fortune of not needing TLER and let each drive finish its recovery process. TLER would kill that procedure and leave drives in a damaged state; that's not what you want on your home NAS. You do need to configure your operating system (FreeBSD?) to be more patient on SATA timeouts, setting the timeout to 60 seconds.

And yes, while a disk is correcting itself (TLER disabled) it will hang the array. Once it healed its damage it will continue like nothing was wrong. This hick-ups are prefered to home users over dropping a drive from the RAID and running in DEGRADED state; that would be more troublesome to recover from, and you wouldn't know if the HDD really failed or just had a sector it wanted to repair.
 
I have a FreeBSD 8.0 VM up and running on vSphere 4. Its fully up to date and has the latest port tree. I have ZFS configured and a LUN exported through istgtd. I'm getting an average speed of 37mb/sec. The underlying storage on this box is more than capable of maxing out gigabit Ethernet, both with Windows file sharing and with iSCSI on Windows, Linux, and Solaris.

I have identical slow performance on a physical machine (core 2 duo 2.4ghz) at the office.

Any thoughts on why its so slow?
 
I'd be curious to hear some anecdotes. I've been using it quite a bit for at least 5 years now and I don't have any major complaints and haven't had any issues to speak of. The performance hit is there, but I still get about 85-90% of raw device performance. Recovery is going to be complicated with any volume management system, and I really don't have any complaints about the userland tools either. The snapshotting is a bit weak, but it's there, I use it, and it works well enough if its shortcomings aren't an issue in your use case.

If you are trying to reassemble a raid array and are trying various combinations of device order to get a valid configuration, and that that array is a member of a LVM volume, if you get the first one wrong and then try and bring the volume back online, you will corrupt the volume labels making it impossible to repair the volume. You an't tell if the raid array is in a correct state if you are doing this last resort style assembly - you can tey and see if after you assembled it a valid filesystem exists. That is impossible because of LVM, so you basically eliminate a set of last resort data recovery methods.

Lots of things work fine when no failures occur. The test of many of these tools is what happens when bad things start to occur and you are trying to repair the problem.

And my experience was that LVM took an array that gave me 600 MB/s software raid6 performance at the device level and cut it to less than 200 MB/s when LVM was introduced. Granted, you can still fill a gigabit ethernet with it, but for local operations it was a real drag.

Samba BTW isn't all that good performance wise either. The native CIFS sharing code in Solaris is much better, but not available in freeBSD or linux.
 
I've been using an lsi sas hba and the chenbro sas expander with freebsd/zfs for a while. I've never been happier with another raid implementation. zfs has a ton of nice features and it's so simple to use. mdadm, lvm, and a filesystem with linux is just clumsy and annoying. With snapshots and zfs send/receive you can even implement a backup solution using only zfs and some scripting.

Filesystems are almost like directories in zfs. You can cheaply create as many as you want and set properties like quotas on them.

Hopefully if oracle does stop working on zfs maybe they could put it under multiple licenses so a real linux port could become possible.

Right now ive got 10 disks in a raidz2 in my norco. If I need to expand later I can add another 10 disks and add it to the pool. All the extra space will just show up automagically and be ready for use.

sub.mesa I think you should include a link to the zfs administration guide on your main post:
http://docs.sun.com/app/docs/doc/819-5461
 
@lizardking009
If you have performance issues, try to locate that issue to either local disk performance or network performance.

Start with a standard benchmark:
# write test
dd if=/dev/zero of=/path/to/zfs/zerofile.000 bs=1m count=8000
# read test
dd if=/path/to/zfs/zerofile.000 of=/dev/null bs=1m

If that gives you high performance, then the issue could be network related. Try to do network bandwidth tests (forgot the name of the app, but at least some work on both freebsd and windows). If network bandwidth isn't the problem, the protocol could be the issue. Try comparing FTP/CIFS/NFS/iSCSI performance.

Right now ive got 10 disks in a raidz2 in my norco. If I need to expand later I can add another 10 disks and add it to the pool.
You don't have to add 10 disks, you can add just 4 disks in a RAIDZ or even a 2 disk mirror. You also can use higher capacity disks in the new array you add. That's very flexible i think.

With normal capacity expansion, you're tied to buying the capacity you originally chosen when you setup the RAID. But in the future, higher capacity disks would be relatively cheaper. With ZFS expanding this way shouldn't be a problem. The only issue is that you have to add a bunch of disks at a time, like 4 in a RAIDZ or 2 in a mirror.
 
@lizardking009
If you have performance issues, try to locate that issue to either local disk performance or network performance.

Start with a standard benchmark:
# write test
dd if=/dev/zero of=/path/to/zfs/zerofile.000 bs=1m count=8000
# read test
dd if=/path/to/zfs/zerofile.000 of=/dev/null bs=1m

About 168Mb/sec on both within the ESX VM. Seems about right.

If that gives you high performance, then the issue could be network related. Try to do network bandwidth tests (forgot the name of the app, but at least some work on both freebsd and windows). If network bandwidth isn't the problem, the protocol could be the issue. Try comparing FTP/CIFS/NFS/iSCSI performance.

Are there any issues with the Intel Pro 1000 MT ethernet cards? ESX emulates one and my hardware box has one.
 
Could you try to use Virtualbox instead? I've had good performance from this free VM solution, may be worth a shot. You said you also got slow speeds when you natively installed FreeBSD using ZFS? Is that only with Samba (SMB/CIFS) or also with NFS/iSCSI/FTP?

Especially Samba would be tricky to get proper performance. Some tuning, such as using Jumbo Frames on all PCs in your LAN, and ensuring your switch also supports Jumbo Frames might help.
 
So when you use ZFS, you have the fortune of not needing TLER and let each drive finish its recovery process. TLER would kill that procedure and leave drives in a damaged state; that's not what you want on your home NAS. You do need to configure your operating system (FreeBSD?) to be more patient on SATA timeouts, setting the timeout to 60 seconds.

And yes, while a disk is correcting itself (TLER disabled) it will hang the array. Once it healed its damage it will continue like nothing was wrong. This hick-ups are prefered to home users over dropping a drive from the RAID and running in DEGRADED state; that would be more troublesome to recover from, and you wouldn't know if the HDD really failed or just had a sector it wanted to repair.

Even spending 60 seconds rather than 7 seconds trying to read a bad sector does not guarantee that it will succeed. The result may be that you just wasted more time and still did not manage to read the sector.

I'd rather have a short hang of only a few seconds, and if it cannot read the sector, then just regenerate the data from parity. This seems like just the sort of feature ZFS should have (i.e., just regenerating one stripe, rather than an entire disk). Can ZFS be configured to do so?
 
Could you try to use Virtualbox instead? I've had good performance from this free VM solution, may be worth a shot. You said you also got slow speeds when you natively installed FreeBSD using ZFS? Is that only with Samba (SMB/CIFS) or also with NFS/iSCSI/FTP?

I'm having the same problem on ESXi and a Core 2 Duo system. The VM has ZFS configured and the hardware box does not. The iSCSI performance is more or less identical between both systems, leading me to believe in either a network interface or istgt problem. The network itself is fine, as I can max gigabit on other VMs and to physical systems.

The point of this exercise is to build an iSCSI target box for a Windows 2008 R2 cluster, so Samba and FTP performance don't concern me.

Thanks for the pointers on benckmarking.... at least I'm looking in the right direction.
 
@john4200
Since your disks are connected to the operating system SATA/SCSI CAM drivers, this is now an OS feature and not ZFS specifically. You can set the timeout and keep it very short like 5 seconds. Note that that would mean you get disks kicked out of the array because they remain damaged and unfixed. Either you do a zero-write on them to fix the damage, or you swap the drive for a new one. Both methods cost alot more time than 60 seconds.

If you do keep the timeout short and ZFS encounters a read error, it will indeed use parity/redundant information to finish the request anyway. That would prevent hick-ups; but you should have a lot of hot spares this way and processing the failed disks (with just one bad sector) manually does cost time too. If you set the timeout to 60 seconds, you allow the HDD to fix all but the most damaged sectors, and generally have hassle-free operation.

@lizardking009
Did you enable Command Queueing (up to queue depth 255) on the istgt daemon? Disabling queueing might lead to lower performance. Could you post some benchmarks?
 
Since your disks are connected to the operating system SATA/SCSI CAM drivers, this is now an OS feature and not ZFS specifically. You can set the timeout and keep it very short like 5 seconds. Note that that would mean you get disks kicked out of the array because they remain damaged and unfixed. Either you do a zero-write on them to fix the damage, or you swap the drive for a new one. Both methods cost alot more time than 60 seconds.

No! The disk should NOT be considered degraded in the RAID set for just one bad sector. That is what I mean when I say this should be just the thing for ZFS to handle (as compared to a hardware RAID card). What do you mean by a "zero-write" that takes more than 60 seconds? Do you mean write zeros to the entire disk? That makes no sense.

Ideally, after a read error is returned, ZFS should be able to get the requested data by regenerating it from parity, and this should only take a few seconds with TLER. Also, if it is not a disk failure but just a bad sector, then it would be great if ZFS could remap the bad sector somewhere else for future use, but that may be too much to ask.
 
I think you forget the point that if you prevent your HDD from fixing damage, the damage stays. By writing to that sector, it would allow the HDD to swap the sector for a reserve one; in that case the damage is fixed. But unless you want to start zero-writing a single sector and then letting ZFS scrub, a simple zero write on the entire disk would also fix any sectors that are still readable but also weak and may degrade further if not written to for some time.

Point is, you want your disks to be healthy. Whenever you encounter a bad sector, you want that to be healed; either by the drive itself or manually by overwriting the sector. Multiple ways to write to the device, a simple rebuild of the affected disk should work.
 
I think you forget the point that if you prevent your HDD from fixing damage, the damage stays. By writing to that sector, it would allow the HDD to swap the sector for a reserve one; in that case the damage is fixed. But unless you want to start zero-writing a single sector and then letting ZFS scrub, a simple zero write on the entire disk would also fix any sectors that are still readable but also weak and may degrade further if not written to for some time.

Preventing the HDD from fixing damage? How so? Are you saying that reducing the timeout from 60 seconds to 7 seconds is "preventing"? In that case, why not say that reducing the timeout from 120 seconds to 60 seconds is preventing fixing damage? Or from 1 hour to 1 minute? Surely no timeout value is absolute. And whatever the timeout is, there will always be damage that cannot be fixed. If the HDD can be induced to remap a bad sector by writing to the sector, then great.

Or are you saying that you think it is wise to assume the entire disk needs to be zero-written just because a single read failed? I would think that ZFS could be smarter than that.
 
@john4200
@lizardking009
Did you enable Command Queueing (up to queue depth 255) on the istgt daemon? Disabling queueing might lead to lower performance. Could you post some benchmarks?

I have tried different values with no change. I've run minimal benchmarks - mostly just ATTO and Windows file copy. On the ESXi VM, I should be able to get 80-90% of gigabit on a Windows file copy of a large ISO because I've gotten that in Solaris, Windows, and Linux with two different targets.

I'm leaning towards istgt as the culprit, just because local performance is fine.

My project is still a few weeks off, so maybe something will change in istgt or maybe FreeBSD 8.1 will have the ZFS iSCSI target from Solaris - I can only hope! Otherwise, its Linux for me, as I don't have enough time to learn Solaris well enough to want to have it around full time.
 
I would love to move my FS to ZFS, however I have a couple of questions:

1. If I wanted to keep "folding" on my Fileserver, I should go with BSD instead of Solaris, shouldn't i?
2. Which ZFS features, if any, do I lose by using BSD vs Solaris?
3. What level of sophistication is required to set up SSD caching?
 
Fantastic article sub.mesa
My test machine is gonna be running freebsd for a while :p
 
@drzzt81:
1) folding@home you mean? Are you sure you want that? The OS tries very hard to schedule server processes; a process trying to use up all resources will increase latencies even though run at lowest priority. Just googled a bit, the linux client works on freebsd - as freebsd is binary compatible with linux binaries.

2) OpenSolaris offers you a kernel-based CIFS driver (replacing Samba) and a kernel-based iSCSI driver. Since you got Samba and istgt on FreeBSD there is no real missing functionality, but especially the kernel-based CIFS driver should be much better than Samba; as Samba sucks big time. Worst open source project i ever used - i would say.

3) You can add SSDs as 'cache' devices to your array, or even a RAID0 of two SSDs like i use. I'm not sure what you meant with "level of sophistication"; you can do this on any ZFS array as far i know.
 
Preventing the HDD from fixing damage? How so? Are you saying that reducing the timeout from 60 seconds to 7 seconds is "preventing"?
Yes, in fact, most weak sectors are recovered within 10 seconds; so you already get lots of bad sectors on your drive by capping TLER at 5 or 7 seconds. The other 25% should happen within a minute. That leaves you with only a small percentage of sectors that would still be bad after a minute recovery time. Still that's alot better than having potentially good sectors turn into bad sectors because you use TLER.

TLER just damages the drive, by preventing it from doing maintenance. Maintenance that ZFS prefers, but windows FakeRAID arrays trip on and behave erratically. But hey you're not on Windows now; so you don't need TLER. The only reason you would need TLER on an advanced operating system, is when you cannot afford your system to be unresponsive for even a few seconds; such as online transaction servers that have one million in damages if its offline for just 30 seconds; there TLER is pretty much mandatory; and many drives will need to be replaced as TLER decreases the usable lifespan of the HDDs.

In that case, why not say that reducing the timeout from 120 seconds to 60 seconds is preventing fixing damage?
One minute is pretty long, if it can't be recovered within 60 seconds; its a permanent bad sector. That means that any read request on it will fail and the drive is at risk of being kicked out of the array. That means either replace the drive with a new one, or fix the damage yourself. Or let ZFS deal with it but i prefer not to.

Or are you saying that you think it is wise to assume the entire disk needs to be zero-written just because a single read failed? I would think that ZFS could be smarter than that.
If you used TLER, you may have bad or weak sectors all over your disk that were not fixed because YOU TOLD THE DRIVE NOT TO; that's all what TLER is about. With all those weak sectors, the number of true bad sectors might increase rapidly if you don't do something about it.

Thus, the smart thing to do, would be a complete rewrite (doesn't have to be a zero-write; but a rewrite reads as well so is slower but doesnt destroy any data). This way, any weak sectors on the TLER-enabled drive will be remapped and thus fixed. Since would be something you need to do regularly if you enable TLER. If you don't, you risk having a disk with all kinds of read errors on it, and if you hit the jackpot then multiple disks would have bad sectors at some key locations; then you have a damaged array beyond repair by RAID-Z.

I see no reason to use TLER for home users at all. Its a dangerous option to enable, generally only when the contents of the drive are not important, or enough redundancy exists to cope with a high-cycle count of failed HDDs. Failed in this sense can mean just a weak sector; anything TLER tripped upon, the RAID engine gets to see the I/O error on the drive and kicks it out of the array. With lots of hot spares, this is the prefered choice for high-profile servers that have 200 drives lying on the shelf just as replacement disks, and just needs 100% uptime and 100% responsiveness.
 
I have tried different values with no change. I've run minimal benchmarks - mostly just ATTO and Windows file copy. On the ESXi VM, I should be able to get 80-90% of gigabit on a Windows file copy of a large ISO because I've gotten that in Solaris, Windows, and Linux with two different targets.

I'm leaning towards istgt as the culprit, just because local performance is fine.

My project is still a few weeks off, so maybe something will change in istgt or maybe FreeBSD 8.1 will have the ZFS iSCSI target from Solaris - I can only hope! Otherwise, its Linux for me, as I don't have enough time to learn Solaris well enough to want to have it around full time.

So I created a blank 8gb file and exported it as an iscsi target. Performance was right where I expected it - 80-90Mb/sec. Seems like there is no caching going on when I export from a /dev name, whether it is zfs or just a partition.

Any suggestions on how to get FreeBSD or istgt to enable caching?
 
Were you using zvols before? Like /dev/zvol/..... ?

Is ZFS precaching enabled? Its disabled by default if you have less than 4GB RAM. Could you do a dd on the zvol device? Like:

# read test (safe)
dd if=/dev/zvol/name of=/dev/null bs=1m count=6000
# write test (destroys data; only use on test zvol with no filesystem on it)
dd if=/dev/zero of=/dev/zvol/name bs=1m count=6000

Also, what version of istgt are you running? You can check with:
pkg_version -v | grep istgt

And can i have your uname too?
uname -a
 
I have a FreeBSD 8.0 VM up and running on vSphere 4. Its fully up to date and has the latest port tree. I have ZFS configured and a LUN exported through istgtd. I'm getting an average speed of 37mb/sec. The underlying storage on this box is more than capable of maxing out gigabit Ethernet, both with Windows file sharing and with iSCSI on Windows, Linux, and Solaris.

I have identical slow performance on a physical machine (core 2 duo 2.4ghz) at the office.

Any thoughts on why its so slow?

I'd make sure Nagle's algorithm is off.
 
I would love to move my FS to ZFS, however I have a couple of questions:

1. If I wanted to keep "folding" on my Fileserver, I should go with BSD instead of Solaris, shouldn't i?
Hypothetically, BSD will be much easier to set up FAH on. I haven't done it, though, so I don't know whether it's actually reasonable. There's apparently no SMP client support, though.
2. Which ZFS features, if any, do I lose by using BSD vs Solaris?
Some features are slightly slower to appear in BSD than OpenSolaris. The in-kernel CIFS server and iSCSI server, as mentioned earlier, are not in BSD.
3. What level of sophistication is required to set up SSD caching?
"zpool add mypool cache c2t3d0 log c3t4d0" just about does it. There are two kinds of disks it makes sense to use SSD for: cache and log. Cache is large, read-mostly, and can be pretty bad at writing and still have a benefit. Log is small, write-mostly, and needs to respect write barriers. Unfortunately, most non-enterprise SSDs don't respect write barriers. This means that if your computer unexpectedly loses power (the power company fails, or you trip over the UPS cord, say) then you can lose some of the most recent transactions to disk. This used to be a "goodbye pool" kind of thing, but now there's an option that usually takes care of this:
zpool import -F

Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported.
People have also used a program called 'logfix' to recover.
 
Were you using zvols before? Like /dev/zvol/..... ?

Yes

Is ZFS precaching enabled? Its disabled by default if you have less than 4GB RAM.

Its enabled. I also had bumped the RAM on the VM to 4gb.

Could you do a dd on the zvol device? Like:

# read test (safe)
dd if=/dev/zvol/name of=/dev/null bs=1m count=6000
6291456000 bytes transferred in 226.020362 secs (27835793 bytes/sec)

# write test (destroys data; only use on test zvol with no filesystem on it)
dd if=/dev/zero of=/dev/zvol/name bs=1m count=6000
6291456000 bytes transferred in 75.699318 secs (83111132 bytes/sec)

After the write test, I ran the read test again.
6291456000 bytes transferred in 60.904297 secs (103300691 bytes/sec)

Also, what version of istgt are you running? You can check with:
pkg_version -v | grep istgt

istgt-20100125

And can i have your uname too?
uname -a

FreeBSD iscsi2 8.0-RELEASE-p2 FreeBSD 8.0-RELEASE-p2 #0: Tue Jan 5 21:11:58 UTC 2010 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64
 
Your read performance sucks. :)
Not sure why though. After your write test, some parts are still cached so the higher score means nothing. Your raw read capped at 27MB/s and writing at 83MB/s; quite bad scores IMO.

These are raw speeds; without any optimization. If you would repeat on the ZFS volume itself you would have higher scores (read anyway) due to read-ahead/prefetching and stuff.

Can i see your "zfs status" output? Also, did you avoid using any PCI in your system, including the gigabit? Are the drives connected as raw disks to the VM; i.e. not using image files on the drives which emulate a drive. I'm using the same istgt version as you, though.
 
Your read performance sucks. :)
Not sure why though. After your write test, some parts are still cached so the higher score means nothing. Your raw read capped at 27MB/s and writing at 83MB/s; quite bad scores IMO.

These are raw speeds; without any optimization. If you would repeat on the ZFS volume itself you would have higher scores (read anyway) due to read-ahead/prefetching and stuff.

Can i see your "zfs status" output? Also, did you avoid using any PCI in your system, including the gigabit? Are the drives connected as raw disks to the VM; i.e. not using image files on the drives which emulate a drive. I'm using the same istgt version as you, though.

The base platform is a Dell PowerEdge T300 running ESXi, (2.83ghz quad core) with a Perc 6/i plus 4x500gb drives in RAID 6. The read performance in other VMs is well over 100mb/sec and I have successfully exported LVM logical volumes and /dev type devices with tgtd and had no performance issues. Same deal with the COMSTAR target on Solaris.

Raw disks with local storage isn't really an option on ESXi and any hacks to make it work would just muddy the waters. My two available testing platforms are ESXi and my spare Core 2 Duo box that I use to rule out vSphere issues.

I'm also having the same problem on my hardware Core 2 Duo with an Intel PCIe Gigabit NIC. This box also serves iSCSI targets with Linux tgtd at full disk speeds.

Code:
iscsi2# zpool status
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          label/disk1  ONLINE       0     0     0

errors: No known data errors
iscsi2#

Edit: I switched the virtual disk controller from SCSI to SAS. 97mb/sec read and 90mb/sec write from the raw volume using dd. Not fantastic for local, but acceptable. iSCSI performance is jumpy .. 110mb down to 50mb/sec up to 70mb/sec. Atto is the same way.

Edit2: I switched the driver on my hardware box from ata to ahci. Speeds are a lot better with iSCSI topping at 90mb/sec read and 58mb/write. This is good from the single SATA 250gb drive in that box. I'm guessing my sporadic speed issues on vSphere are related to FreeBSD and ESXi not playing nice at some level. My iSCSI target is going to use an Areca RAID controller... I hope FreeBSD's driver is better.
 
Last edited:
It's worse than this. When you add new disks on new controllers, the disk device names can change on you. This isn't a problem for ZFS, it reads the label off the disk and isn't distracted by the device names changing. It is a major pain for a user trying to figure out which disk is which.

Basically, you have to keep track of the disk serial numbers etc... and which slot they are in, and then use smart tools on the system to find out what the serial number of the disk was that had the problem and then manually look that up in the cabinet to find the proper disk. You can't see the serial number without pulling the disk out of the enclosure which is a problem.

Also, in the case of expanders, changing cables around and inserting disks in a empty slots in the expander can also change device names. The failure of being able to signal via the enclosure LED indicators can lead to the mistake of pulling a good drive during a critical outage instead of a bad drive, and take the volume down when it could have been avoided.

BTW, this is a big problem in Unix systems in general - hardware raid vendors have invested in GUI tools but the core OS has none of this. Solaris included.

I am not familiar with using "smart tools" to find what the serial number of a disk is. Anyone care to share details?




ZFS is kind of weird. It's both a filesystem and a volume manager. It's a little confusing and really I'd prefer the two to be separated, but it's borne out of necessity by some of ZFS's features (e.g. only mirroring used blocks).

Why would you prefer the two be separated?






Is ZFS precaching enabled? Its disabled by default if you have less than 4GB RAM.

Do you have a source for this claim?
 
Last edited:
If you don't have it, you'll need smartmontools installed. Then:

Code:
infinity:~# smartctl -i /dev/sda
smartctl 5.39 2009-12-09 r2995 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA family
Device Model:     WDC WD5000AAKS-00TMA0
Serial Number:    WD-WCAPW3454489
Firmware Version: 12.01C01
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Mar 10 02:41:07 2010 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Why would you prefer the two be separated?
Because it's the unix way :p
 
Nice thread, i'm joining the zfs bandwagon:

Just ordered an AMD X4 600e/mobo/ram + 4x2tb to be my new NAS and webserver etc (Replacing a readynas) - going in a normal ATX case until I expand beyond 4 disks/need HBA cards. Will post how I get on, also will probably setup some form of monitoring for the drive temps etc - might post a guide if anyone is interested.
 
Back
Top