OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

I have a M1015 in passthrough (esxi 5.1). Now I want to create a zfs2 system with 10 disks (8+2 parity).
Can I use rdm ( http://blog.davidwarburton.net/2010/10/25/rdm-mapping-of-local-sata-storage-for-esxi/ ) for the remaining two drives? (These two drives are attached to the standard AHCI ports of the intel z77 / i7-3770 motherboard. Unfortunately passingthrough this AHCI panther controller will break the esxi system - even with the datastor (which contains openindiana) on a simple, separate pxie1 controller - bummer).

Will rdm disks combined with M1015 passthrouged disks impair the safety of the zfs z2 data in any way?

I haven't done any testing with RDM myself but in theory it won't be less safe probably but it is a lot harder to maintain. If a disk fails and needs replacing or moving it will be a bit of manual fiddling and maybe some downtime.

If you want to test to see if it will be safe and not cause problems I would recommend creating a pool (with no important data on it!) and then while the SAN VM is offline switch a drive from the M1015 to the onboard and one the other way around. Re set up the RDM mapping for the drive you moved to onboard and boot the SAN VM back up. If your pool detects the drives fine and works then you should be all good and safe. Also gives you an idea of the steps you may need to do later when fixing failed disks or reconfiguring your setup.

Also note that you can try upgrading to the latest patched version of ESXi 5.1 as I found this fixed a problem I was having getting the motherboard onboard controller to pass though in 5.1. This may let you do it properly without using RDM. If you are using standalone ESXi to do the upgrade you need to download a zip file with the latest patches and copy it to a local datastore then SSH in and run a special command to install the zip (esxcli software vib install -d *full path to uploaded zip*).
Edit: but this does not seem to be your problem as an earlier post you said you were already on the latest patch sorry. The best plan would be to get a second M1015 if possible ( this covers you for if one of them fails as well as you can move the extra 2 drives to RDM in an emergency if this happens. Another option is to get a better HBA to boot ESXi and store the local datastore. It is normally possible to get it to work but it can depend on the motherboard and the boot config options in bios.

Michael
 
Last edited:
Yeah, it's booting into the napp-it 0.9 BE...

I really, really wish I could switch to OpenIndiana but I simply can't re-create my pools by copying data to offsite storage because it's 20 TB... :( (Can anybody think of any "reasonable" way I can do this? my ZFS pools are v31)

Why not downgrade [need fresh install] to Solaris 11 11/11. I'm with it + napp-it 0.8,everything works fine.
 
Why not downgrade [need fresh install] to Solaris 11 11/11. I'm with it + napp-it 0.8,everything works fine.

I would stay with 11.1 when I would use Solaris.

Solaris 11.1 settings are quite different from Illumos based systems and sometimes different to 11.0. I will fix all the specials I am aware but do not use S11.1 for anything but testings.

What you can do.
Check if settings work if you do manually at CLI and compare to the commands that are done in /var/web-gui/data/napp-it/zfsos/_lib/zfslib.pl

See Function: sub zfslib_change_property where you find different commands for S11.1
(napp-it is open and editable) and Illumos based systems.
 
Hi,

just wanted to mass-delete some snapshots I was creating with time-sliderd on several rpool datasets but had to find out that napp-it is only showing snapshots on non rpool pools. Is this correct? Is there a way around it?

Snapshots of the bootdisk are handled different to data snaps on Solaris based systems.

Best use is:
- do not create normal data snaps on rpool
- create snaps with beadm create snapname (be, boot environment)

this creates a bootable snap that can selected on bootup
This snap is a snap + a writable clone

If you activate a snap, (beadm activate name) it is the default at bootup
If you modify anything, this is stored in the current be where you booted into
(this is often the problem of missing settings after a new be is created and activated without immediate reboot)

If you want to delete a be: beadm destroy -f name
 
I'm having a problem with Solaris 11.1 and napp-it 0.9.

For some reason, everytime I reboot the server, the SMB server service stays offline and I need to manually disable/enable it to restart it.

http://docs.oracle.com/cd/E26502_01...bbrowsingfailswhensharesmb.3donissetonzfspool

SMB Shares on a ZFS File System are Inaccessible After a Reboot

SMB shares on a ZFS file system might be inaccessible to SMB clients if you reboot the Oracle Solaris SMB server.

Run the following command to reshare the ZFS shares:

# sharemgr start -P smb zfs


... congratulation Oracle.
You are on the best way for a "certified Oracle SMB sharing engineer".
Where is the easy click and run Sun CIFS server?

Hope Illumos will not follow that nonsense of a overcomplicated setup.
 
Michael,

Thank you for your suggestions

Another option is to get a better HBA to boot ESXi and store the local datastore. It is normally possible to get it to work but it can depend on the motherboard and the boot config options in bios.

Michael

Can you suggest a HBA? I have tried a older pci board and a brand new conceptronics pxie 1x board. After changing the chipset AHCI controller in passthrough mode, ESXI does start up but neither boards seems to show my datastore in the vsphere client and therefore I cannot start up any virtual machine.
I cannot add another M1015 in my motherboard - it has only two pcie 16x/8x slots. One has an M1015 and the other an Intel 10GBE card.
 
Michael,

Thank you for your suggestions



Can you suggest a HBA? I have tried a older pci board and a brand new conceptronics pxie 1x board. After changing the chipset AHCI controller in passthrough mode, ESXI does start up but neither boards seems to show my datastore in the vsphere client and therefore I cannot start up any virtual machine.
I cannot add another M1015 in my motherboard - it has only two pcie 16x/8x slots. One has an M1015 and the other an Intel 10GBE card.

I'm not sure why you are having problems seeing datastores. I would try plugging just one drive (disconnect all other drives) into your pcie 1x board or PCI board you had and do a clean install of esxi. If this boots into esxi fine then you should see your default datastore1 from the remaining space on this boot drive. Then I would upload the patch zip and get to latest version. Then enable passthough for the motherboard controller and the M1015. Reboot and if it shows your datastore fine still then you may be all good to add drives to the onboard and M1015 controllers and get going. If you have problems screen shots of your vmware Configuration->Storage Adapters screen could be helpful to post here from before and after passthough.

Note that once you start connecting drives to the M1015 and onbaord controller the bootsecotrs and data on these drives can effect your system bootup. Your machine bios will try to boot off any drives it can see in an order defined by your bios options which are often hard to config perfectly. enabling passthough only disables the onboard controller after ESXi has booted so before this these drives may be queried. non bootable zfs pool disks will not cause a problem as bios will skip over them but unless your boot order can be set right in bios you can have a problem which crops up later when you connect a new drive that has a valid boot sector. If this happens next reboot your machine may not boot... It can be good to test your setup for this by connecting a non ESXi bootable drive (Windows/linux etc) and seeing if your boot order ignores it and boots fine into ESXi so you know you are safe.

Also note another option if your onboard controller works fine for ESXi booting but not passthough is to try passing though the 1x or PCI devices (or both) to the SAN VM. depending on the chipset on them they may be supported by OI etc and while they don't have the bandwidth of the M1015 for one or two drives they will be fine.

As for what other HBA's are good for booting ESXi i'm not the best one to ask sorry. Google searching should find some options. Their are many suggestions earlier in this giant thread so searching in here may be helpful and some of Gea's guides have suggestions of supported SATA chipsets I think. Remember that passthough lets you use these extra cheap SATA controlers direct with the SAN as well so if they don't work on one they may work on the other. Cheap cards like ones that use SIL3132 may work.

http://blog.zorinaq.com/?e=10 <-- here is a list that may be useful but this is for solaris/zfs and not all these cards will work to boot ESXi but most of them probably do now.

Michael
 
I am currently running a home built ZFS server with 4 x 1TB SATA drives and 4GB of RAM using Napp-it 0.9. I was looking to rebuild the box with new hardware to increase performance, and have come across the opportunity to inherit the following hardware:

I believe the chassis is a SuperMicro SuperChassis 836BE16-R1K28B
Supermicro X7DB3 motherboard with 2 Xeon Processors
8GB RAM (I'd probably shell out some $$ to max it out at 32GB)
12 x Seagate Cheeta 15K 146GB SAS drives (would hold up to 16 drives)

I'm trying to figure out if it's worth starting here and spending the extra money to upgrade the RAM if the total raw storage is only around 1.7TB compared to my current setup which gives me around 3TB. Or... should I just start from scratch and build a new box based around the newer Intel e5 processors with 1+ TB drives?

Any feedback is much appreciated! Thank you.
 
Your request lacks details.

What performance are you attempting to increase? how is your current system configured?

I'm not sure where you get 1.7tb, it should only give you about 1.5tb if you use raidz1, and sounds like you are using raidz1 for your sata disks. This will be the number one performance issue, unless you only use it to watch movies, then your likely performance issue is your network.
 
The current system is configured with raidz1 using the 4x1TB sata disks. The two areas where I'm seeing the worst performance are from my vmware lab and TimeMachine backups from 2 Macs. I have a few datastores connecting to the ZFS box using both iSCSI and NFS. The vm's run very slow and laggy, so I started troubleshooting the performance. The storage latency was the cause of the performance issues, so I tried disabling the NFS sync to see if that would improve things. It helped a little, but it's still laggy and the latency is >20ms, usually somewhere around 200-800ms. I am currently running a 48port Cisco switch, with each ESXi host connecting with 4 1G nics and the current box using 1G nic, but I'm not coming close to saturating any of the nics.

I would ultimately like to be able to run vmware with <10ms latency and be able to saturate the dual 1G nics in the new box. The 1.7TB was as you stated, RAW hdd capacity before giving up space to raidz1 or 2.
 
Just picked up a HP DL360 G6, planning on doing an all in one setup with a Norco shelf.

Would it be unwise of me to use Engineering sample CPUs for this? Just looking in places to save some $$ and RAM and HBA aren't one of them, so I am hoping the CPUs could be?
 
Well, your 1tb are not going give you <10ms, they are rated at somewhere between 8-12ms for local access, not including your network and other overhead.

The sas disks are likely to go it, as they should be 3-4ms. But I wouldn't want to put them in a raidz to handle vm's.
 
What would be the suggested configuration for the pool to handle VMs? I have 12 drives total, so I could create multiple pools and use the raidz1 just for standard file storage and time machine backups.
 
I'm not sure timemachine backups are even that friendly for it, atleast looking at the wife's here.

using mirrors would be best for vm, yes it will cut your 1.7 down to about 800gigs of usable space, but it won't suffer from lack of random i/o, and should keep you under 10ms or atleast around.

I would be tempted to also store the timemachine backups there also, unless you have a lot of stuff your saving, my wife only needs 100gigs, so it fits well in mine, and I reserve the raidz for larger things, vm backups/hibernation/iso's/...
 
So, looking to see how performance would be with the following setup -

Separate physical zfs san linked to esxi host(s) over 4gbps fc (I got the fibre cards cheap, so even if I wouldn't saturate them it's still fun to use, and allows for overhead if I can push more)

On the zfs box, it will have a simple pool with 4 7200 1tb drives in raid 10 (presumably better performance characteristics than raidz1) with something like a consumer ssd (m4, 830) for the l2arc and < ? > for a zil slog. I can throw somewhere between 16gb and 32gb ram at it, depending on how much they may help.

As far as the l2arc, using linked clones with a ssd slog should make the l2arc provide a decent amount of snappiness for vm's, correct?

For the zil, I'm a little more at a loss. This isn't mission critical storage where I'm concerned if there's issues on power loss, much less since I will at least have a ups for this with clean shutdown. With that in mind, and the fact that I'm over fc, I believe that means I would disable sync writes for best performance. However, does disabling sync writes completely bypass the slog, and then basically limit my writes to what the raid 10 array can deliver, both in throughput and iops?

Essentially, I'd like to be able to use a true write cache ssd that would remove many of the limitations of spindle arrays faced by several simultaneous vms hitting time. But it seems that a zil slog only gives performance gains for sync writes, and even then has issues because of latencies short of using expensive zeusdrives. Or is my understanding off here? If it would be beneficial to use a ssd for my writing situation, which ssd's would actually be good choices, and in what configuration?

For clarification, the usage pattern will have several random low loads from esxi guests stored there with occasional spikes, as well as some continuous writes for media encodes, but its also for my learning so I'd just like to be able to build a well-performing array for shared storage.
 
Last edited:
What would be the suggested configuration for the pool to handle VMs? I have 12 drives total, so I could create multiple pools and use the raidz1 just for standard file storage and time machine backups.

Remember its the number of vdev's you have that decides your performance mostly. If you have 1 big vdev like a 10-12 disk raidz2 you will only have the 1 vdev and base performance of one drive for random IO performance. For sequential large block performance like video you may still get high throughput though with this setup.

If you want the highest performance for quick response times to small data access then multiple vdev's is the way to go. 5x 2 disk mirrors (10 disks with 2 hot spares) would probably be the fastest but only 5x140GB of usable storage! Note this is 5 vdev's on one pool and not 5 pools. If you do this you would probably have to add a few 3TB drives in a second pool for slow bulk storage as well.

Also remember lots of memory will make read speeds very fast from ARC so for the main common data it doesn't hit the disks and you can add a cheap l2arc SSD to make read performance a mostly Network/CPU limited setup. For write performance you can turn off sync and you will get decent write performance but its not 100% safe for certain server VM's that host databases etc. A good SSD with power loss protection as a cache is good to fix this properly and it will make VM's complete their writes safely quicker that may help apparent performance. Even using one of you 15k drives as a ZIL cache disk will probably give you great performance.

It's probably best just to do a bit of a trial by building up a pool and checking its performance with bonnie etc.

Michael
 
So, looking to see how performance would be with the following setup -

Separate physical zfs san linked to esxi host(s) over 4gbps fc (I got the fibre cards cheap, so even if I wouldn't saturate them it's still fun to use, and allows for overhead if I can push more)

On the zfs box, it will have a simple pool with 4 7200 1tb drives in raid 10 (presumably better performance characteristics than raidz1) with something like a consumer ssd (m4, 830) for the l2arc and < ? > for a zil slog. I can throw somewhere between 16gb and 32gb ram at it, depending on how much they may help.

As far as the l2arc, using linked clones with a ssd slog should make the l2arc provide a decent amount of snappiness for vm's, correct?

For the zil, I'm a little more at a loss. This isn't mission critical storage where I'm concerned if there's issues on power loss, much less since I will at least have a ups for this with clean shutdown. With that in mind, and the fact that I'm over fc, I believe that means I would disable sync writes for best performance. However, does disabling sync writes completely bypass the slog, and then basically limit my writes to what the raid 10 array can deliver, both in throughput and iops?

Essentially, I'd like to be able to use a true write cache ssd that would remove many of the limitations of spindle arrays faced by several simultaneous vms hitting time. But it seems that a zil slog only gives performance gains for sync writes, and even then has issues because of latencies short of using expensive zeusdrives. Or is my understanding off here? If it would be beneficial to use a ssd for my writing situation, which ssd's would actually be good choices, and in what configuration?

For clarification, the usage pattern will have several random low loads from esxi guests stored there with occasional spikes, as well as some continuous writes for media encodes, but its also for my learning so I'd just like to be able to build a well-performing array for shared storage.

I think the way ZFS works is if you disable sync write it still does the same actions as if it had a ZIL log device execept it does not write to the log. So same performance as if you had the perfect high speed log device. This means any writes are queued up in RAM and then sent to the disks in batches every 5-10 seconds or something. This means the writes will complete quite fast and it will seem very responsive. If your machine locks up due to memory/CPU fault/crash or the power gets cut then you lose the last 5 seconds of written data but assuming you have no bad disks/HBA's hiding their internal write cache you will still have a consistent zpool but it will be a bit out of date. This could result is problems for mail servers for example where the last 3 e-mails are just gone.

Note that I think that once you write too much data at once to your sync disabled pool it will fill up the log buffers and start blocking on the writes until data is safe on disk leaving more room for write log entries. This means that sustained write performance will not be that great and limited by your pool disks and number of vdev's etc but small bursts of writes will be quite fast.

Michael
 
I think the way ZFS works is if you disable sync write it still does the same actions as if it had a ZIL log device execept it does not write to the log. So same performance as if you had the perfect high speed log device. This means any writes are queued up in RAM and then sent to the disks in batches every 5-10 seconds or something. This means the writes will complete quite fast and it will seem very responsive. If your machine locks up due to memory/CPU fault/crash or the power gets cut then you lose the last 5 seconds of written data but assuming you have no bad disks/HBA's hiding their internal write cache you will still have a consistent zpool but it will be a bit out of date. This could result is problems for mail servers for example where the last 3 e-mails are just gone.

Note that I think that once you write too much data at once to your sync disabled pool it will fill up the log buffers and start blocking on the writes until data is safe on disk leaving more room for write log entries. This means that sustained write performance will not be that great and limited by your pool disks and number of vdev's etc but small bursts of writes will be quite fast.

Michael
thanks for the response, as it seems to confirm my interpretation. I guess I wish there was a way to use a simple yet intelligent ssd write cache - my concern is that a single sustained write for more than whatever can fit in ram will quickly reduce all other write performance significantly, and will instantly be limited to the spindle array iops/speed. Does anyone know a good way to work around this other than simply adding more vdevs - is just dumping more ram at it the only solution if I don't want to add more to the array?
 
thanks for the response, as it seems to confirm my interpretation. I guess I wish there was a way to use a simple yet intelligent ssd write cache - my concern is that a single sustained write for more than whatever can fit in ram will quickly reduce all other write performance significantly, and will instantly be limited to the spindle array iops/speed. Does anyone know a good way to work around this other than simply adding more vdevs - is just dumping more ram at it the only solution if I don't want to add more to the array?

One solution is to Just create a pool out of SSD's. This bypasses most of your problems and you will have great write performance :D

SSD pools cost a lot more per GB so unless you have money to burn you have to keep them small and then have a second pool of large slow cheap disks for backup/bulk/sequential use.

you may also want to consider the option of creating separate dedicated pools with one set up for random access and one for sequential/ bulk access and then split your VM's up to the one that suits them. Two Zpools are independant and won't effect each others performance except for sharing network/FC bandwidth, CPU and ARC ram. Gea I think has added support for partitioning drives like SSD's which I wonder if it would work with such a setup so you could share L2ARC and ZIL SSD's between the two pools which by default is not supported.

Also i've been working on an idea to create a cheaper ZIL log device that in theory could have the performance and reliability of the likes of a zeus drives but using very cheap consumer SSD's. Setup is as follows:
1 x Low power UPS that is independant from your main UPS
1 x 12V 1A AC adapter
2 x power diodes
1 x 5V DC to DC buck converter
1 x high performance consumer SSD like intel 520 etc
1 x molex to SATA power adapter
and a pair of Molex male to female power plugs (eg from an old fan)

AC adapter plugs into UPS and is connected to a power diode into the input of the buck converter. A male molex adapter which is plugged into the computers internal power supply and the 12V line connected to a power diode and then into the same input of the buck converter. The output of the buck converter which is 5V is connected to a female molex and this is then converted to SATA power and plugged into the SSD.

The two power diodes mix the power from external 12V and internal down to a stable 11.3V or so (assuming a .7v drop because of diode) that will draw power from ether internal or external power which means it should never lose power except for many hour long power cuts. The Buck converter converts this down to 5V which keeps the SSD powered up all the time even if the computer crashes or is powered off etc.

This alone gives us the advantage of a high end enterprise SSD that has super capacitors etc but has the advantage that many consumer SSD's have higher write performance than the safer enterprise versions.

But the real big speed boost is if you can tweak the SSD as well to set it to ignore write flush requests it will report back to your ZFS server that it has committed its log writes much quicker. In theory operating at your SSD's internal RAM buffers latencies. Its then got similar performance to some of the battery backed up RAM solutions out there but at a much lower price.

Anyway just an Idea I'm working on. To make it easier to maintain long term it would be good to add a couple of status LED's (just a resister and and LED) so that you can quickly see if both internal and external power is powered on fine. And ideally adding a buzzer that activates when internal power is on but external fails which could happen if your small UPS fails or the AC adapter is unplugged by mistake. Also note the small UPS needs to power its 2-3Watt load for longer than the main UPS keeps the machine alive for. If the main UPS fails first and the small UPS fails some time later then the SSD will have had plenty of time to commit the last log entry. It only probably needs an extra 10 seconds of power really

Edit: Also note assuming the SSD's you use have a low power draw you should be able to power 2 or more of them with this setup so you could have more safety with mirrored Log drives if you wanted. Also in the ideal world someone would build a version of this that just uses a battery or SuperCap to keep your SSD online for enough time to commit everything. You could even build one of these that just uses a 9V battery or 5+ AA batteries (batteries would only drain when computer power is off so don't leave your computer off!).

Michael
 
Last edited:
Gent's

Sorry for being a total Newbie at this but I get this warning on my newly created system.

Nappit is at Release 0.9a1

System release:

SunOS MediaServer 5.11 oi_151a5 i86pc i386 i86pc

Pool overview:



pool: MediaTank
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: none requested
config:

NAME STATE READ WRITE CKSUM
MediaTank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c3t50024E900258B3A8d0 ONLINE 0 0 0
c3t50024E900258B3ACd0 ONLINE 0 0 0
c3t50024E900258B3AEd0 ONLINE 0 0 0
c3t50024E900258B3B0d0 ONLINE 0 0 0
c3t50024E90028CE1FDd0 ONLINE 0 0 0
c3t50024E90028CE200d0 ONLINE 0 0 0
c3t50024E90028CE239d0 ONLINE 0 0 0
c3t50024E90028CE247d0 ONLINE 0 0 0
c3t50024E90028F36BAd0 ONLINE 0 0 0
c3t50024E90028F3A0Fd0 ONLINE 0 0 0
c3t50024E90028F50E3d0 ONLINE 0 0 0

errors: No known data errors

Can someone explain this and what if any steps I need to take to correct it.

Also I have left a Disk as a hotspare but can't find how to assign it as such or is it Automatic?

Any help gratefully accepted
 
There is nothing to correct. It's warning you that the booted OI supports pool features your pool is not using. I agree it is overly alarming. IIRC, OI151a5 supported async destroy of objects. You can ignore this if you don't want to make a non-revertible change to your pool.
 
Hello guys,

I currently have six WD Black 2TB set as 3 two way mirror.

So 3* 2tb = 6Tb.

I went that way because I am hosting both VM using NFS and SMB.

I am now running out of space. And my case an controller support 8 disk. (i could expand to + 5 if i had a hdd module + an extra controller..)

So i'm looking at two options

Buying two ssd in mirror them and move my vm there and re organize my 2tb in raidz2 to go from 6 to 8tb available. Or buy two 4tb 7200rpm black drive to move my pool to 10tb.

Any opinions?

Drawback i can think of:

How is a pool mixed ib 512b and 4k sector size.?

Also having 3 2tb mirror and one 4tb mirror.

How would SSD perform over time with the lack of trim?

Thanks!
 
Last edited:
Hello guys,

Buying two ssd in mirror them and move my vm there and re organize my 2tb in raidz2 to go from 6 to 8tb available. Or buy two 4tb 7200rpm black drive to move my pool to 10tb.

Any opinions?

Drawback i can think of:

How is a pool mixed ib 512b and 4k sector size.?

Also having 3 2tb mirror and one 4tb mirror.

How would SSD perform over time with the lack of trim?

Thanks!

If pool created with 512b disks it will probably have a ashift of 9 and adding 4k drives will cause major performance problems unless you blow away and recreate your pool with ashift 12. Also adding a new vdev to the pool may cause some performance issues if the pool is already very full as all the free space will be on just the new vdev and don't know if there is any auto re balancing option yet or if you have to delete lots of data and then copy it back to get it balanced again.

You probably have 1 or 2 VM's that use loads of space and maybe loads of bulk CIFS data with your core VM's taking up only a few hundred GB. So migrating your VM's to a new smaller pool may make sense. Another option would be to leave your pool as is if it is performing well and create a new pool with a few 4TB drives in raidz1 or 2 and move bulk storage onto this. Even 2 4TB drives in a mirror gives you an extra 4 TB of bulk space but it will be less secure.

As for TRIM its probably not as big a problem in practice but it would be nice if they would add TRIM support into Illumos sometime. Different SSD's react differently to not being TRIMed so you could research which is best.

If you are paranoid about TRIM then you can partition your SSD's so only the first half of them is available for use. you lose half your space but should not suffer much performance loss as the drive will never get over full. Also this will improve the SSD's wear leveling ability and make them last longer. You could also just plan to move all the data off and secure erase and recreate the pool again in a years time when lack of TRIM starts effecting you. Another interesting option is to use the redundancy of zfs to allow you to manually remove one SSD from the pool and secure erase it in an external machine and then reintegrate it back into the pool and rebuild. This has to be repeated for each drive you want to manually optimize again so may take a while. Its also best to do this carefully with raidz1 as you will not be redundant! With mirrors you can take your hot spare and add it as a third mirrored drive to that vdev and once its built remove a drive and secure erase this so you never risk any lack of redundancy.

Also note don't worry about lack of TRIM support for ZIL log SSD's as they don't need it at all since they only ever keep reusing the same block at the start of the drive and so never having anything that needs TRIMing.

Michael
 
Last edited:
i dont think that is accurate michael. depending on the drive of course but wear leveling should evenly write across all the cells. background cleanup should be more than fine to handle the rest since yes, the required size for zil data is minimal.
 
Hi!

About disks reporting the correct physical sectorsize problem:

I want to build a new machine with WD RED series disks.

@Gea or anyone knows if the WD30EFRX works without problems and
report the correct size, or is there any trouble expected with the disks?

Greetings schleicher
 
i dont think that is accurate michael. depending on the drive of course but wear leveling should evenly write across all the cells. background cleanup should be more than fine to handle the rest since yes, the required size for zil data is minimal.

I assume you are referring to my opinion that ZIL drives don't need TRIM at all. I was curious about this so I did some testing using VMWare. I created a thin partitioned virtual disk and used it as the ZIL for my OI VM. As soon as I started to write to the POOL this virtual disk started to get some space allocated to it but it quickly grew to around the 1GB mark I think and then just stayed there no matter how much data was written though the ZIL. This tells us that it only allocates the first 1 or so GB of what it thinks is a physical drive. And it just constantly loops around writting to the same sectors of the disk. Now the SSD internally will be wear leaviling and remapping the secotors all over the NAND but its internal sector allocation table which is the issue that TRIM helps address will only have this 1GB worth of entries that will never grow. Instead they just constantly change where they point. Therefore TRIM will not be needed at all.

If your SSD's are pool disks its another story completely. ZFS is copy on write for pool data and as you write data it is constantly using new sectors of the disk to store data that changes which means it will slowly spread out over the sectors of the disk and this makes the SSD's sector allocation tables grow and performance suffers. This would be worst if you ever let your SSD pool fill up as once its full all the sectors will be marked as in use and internally your SSD can only has the over provisioned sectors left to use to even out the wear. If your data is constantly changing this would not be an issue but many data sets you fill up with much static data and and data that never changes will be written to once and then those NAND cells will never be written again which means other active cells get more writes and wear out faster. I'm guessing that some SSD's have better back ground wear leveling systems that will remap some of these static cells to more used cells as needed to fix this problem and wear the whole disk more evenly.

Also it would be interesting to know how aggressive ZFS is at reusing empty space before it starts allocating new sectors towards the end of the disk. For example you may have a pool that never gets more than 50% full but has constantly changing data. Would ZFS spread out over the whole disk or is it good at reusing the unallocated sectors at the start first so that only say 55% of the SSD's sectors will be allocated (remember it will be using 100% of the NAND cells for wear leveling but only 55% of the virtual sectors the SSD publishes will be allocated). In theory if someone had the time to bother they could test this using VMWare thin provisioning to create a set of virtual disk to make up a pool and test this.

Michael
 
Hi all,

I have a script which will mount a usb device, checks it has been mounted (so it doesnt write to a folder in /media) then runs some rsync jobs then disconnects the usb device; if the device is not mounted it will email me.

Can anyone explain why this script will run from the command line (with pfexec) but not when I setup a Napp-it job to run it? I get the email error when run from Napp-it but no problems when run from the console.

http://pastebin.com/LPs5xaYi

Any assistance would be greatly appreciated.

Thankyou
Paul
 
Hi all,

Can anyone explain why this script will run from the command line (with pfexec) but not when I setup a Napp-it job to run it? I get the email error when run from Napp-it but no problems when run from the console.

I'm not an expert on how Napp-it jobs run sorry but it could be something to do with the user/permissions the job is run as.

If it just sends the e-mail and never rsync's that means that the df -H command is not seeing the disk. To help find where the problem is you can create a script with just two lines like this:

rmmount rmdisk0 /media/usbbak > /home/user/Log/rmmount.log
df -H > /home/user/Log/df.log


once you run this you will find what errors and output it is giving and that may help.

Michael
 
If pool created with 512b disks it will probably have a ashift of 9 and adding 4k drives will cause major performance problems unless you blow away and recreate your pool with ashift 12.

Michael

Thanks for this, I thought the ashift was on each vdev and not on the pool itself.

Assuming I can re-create the pool entierly and move the data back on, is there an issue with 512b disks with ashift 12? (since i would have 6x 2TB and 2x 4TB)?

The SSD option is kinda out since I would need at least two 480TB+ SSD and that get's me in the 1000$ for 500GB.. at least for now.
 
Thanks for this, I thought the ashift was on each vdev and not on the pool itself.

Assuming I can re-create the pool entierly and move the data back on, is there an issue with 512b disks with ashift 12? (since i would have 6x 2TB and 2x 4TB)?

Actually I think I was wrong and you are right here :)

In theory ashift is per vdev and if you make sure you don't add a 4k disk to an existing ashift=9 vdev then you will not get the major performance hit problem. You have to be careful though because if the drives have advanced format they probably report as 512b and need manually tweaking to force them to ashift=12 when you add the vdev. I think mixing ashift's in the same pool is not recommended as it may have a few minor performance issues and also makes it harder to maintain and upgrade later.
 
Please ignore - all fixed.

I forgot the
Code:
#! /bin/bash -l
command at the start.....

I'm not an expert on how Napp-it jobs run sorry but it could be something to do with the user/permissions the job is run as.

If it just sends the e-mail and never rsync's that means that the df -H command is not seeing the disk. To help find where the problem is you can create a script with just two lines like this:

rmmount rmdisk0 /media/usbbak > /home/user/Log/rmmount.log
df -H > /home/user/Log/df.log


once you run this you will find what errors and output it is giving and that may help.

Michael
 
I'm planning on running an esxi box that will ultimately be storage-centric - I'd like to run basically two vm's on it, a ZFS vm that will share out higher performance storage for other esxi hosts as fc targets, and a secondary simple nas for media storage, and will run something easily expandable that I have in place, like unraid. I'm planning on each vm having a passed-through M1015 hba, and then I also have a 4gb FC hba on this box.

My question here (and please bear with me, as this is an "adventure" in fc for me to learn) - is it possible for the fc hba to be left alone in esxi and zfs export the datastores over fc via some "esxi-managed route" - in the same way that I could export over a virtualized vmxnet3 nic through the standard host ethernet - , or will I need to pass the fc card through to the zfs vm for it to function and present an array as a viable target for other hosts? In either instance, if someone has a little bit of detail more than just "yes" or "no," it'd be greatly appreciated!
 
I'm planning on running an esxi box that will ultimately be storage-centric - I'd like to run basically two vm's on it, a ZFS vm that will share out higher performance storage for other esxi hosts as fc targets, and a secondary simple nas for media storage, and will run something easily expandable that I have in place, like unraid. I'm planning on each vm having a passed-through M1015 hba, and then I also have a 4gb FC hba on this box.

My question here (and please bear with me, as this is an "adventure" in fc for me to learn) - is it possible for the fc hba to be left alone in esxi and zfs export the datastores over fc via some "esxi-managed route" - in the same way that I could export over a virtualized vmxnet3 nic through the standard host ethernet - , or will I need to pass the fc card through to the zfs vm for it to function and present an array as a viable target for other hosts? In either instance, if someone has a little bit of detail more than just "yes" or "no," it'd be greatly appreciated!

Well from my experience of ESXi I would say that you would have to pass the FC HBA to your SAN VM just like you do with the passthough of the M1015. That way the VM sees a real physical FC card and it can use that to share Storage. If a FC HBA is connected to ESXi without passthough then it just acts just as an HBA for ESXi to access remote storage and will not share out any FC LUN's. You will note that if you try to edit settings for a VM their are simply no options to add this to a VM. You can only really add disks, networking and passthough whole devices.

Once you have the FC passed though its then up to you to set up the drivers and settings on your SAN VM just like it was a physical machine with that FC HBA connected.

http://docs.oracle.com/cd/E23824_01/html/821-1459/glddq.html <-- this may help with this and others who have had more experience with FC on solaris may be able to give some better tips.

One limitation with this method is the local ESXi will not be able to see your FC SAN unless you have a second FC HBA in that machine. And I wouldn't think you could split a dual channel FC card. You can always use NFS/iSCSI locally though.

Michael
 
I assume you are referring to my opinion that ZIL drives don't need TRIM at all. I was curious about this so I did some testing using VMWare. I created a thin partitioned virtual disk and used it as the ZIL for my OI VM. As soon as I started to write to the POOL this virtual disk started to get some space allocated to it but it quickly grew to around the 1GB mark I think and then just stayed there no matter how much data was written though the ZIL. This tells us that it only allocates the first 1 or so GB of what it thinks is a physical drive. And it just constantly loops around writting to the same sectors of the disk. Now the SSD internally will be wear leaviling and remapping the secotors all over the NAND but its internal sector allocation table which is the issue that TRIM helps address will only have this 1GB worth of entries that will never grow. Instead they just constantly change where they point. Therefore TRIM will not be needed at all.
this sounds logical.
Also it would be interesting to know how aggressive ZFS is at reusing empty space before it starts allocating new sectors towards the end of the disk. For example you may have a pool that never gets more than 50% full but has constantly changing data. Would ZFS spread out over the whole disk or is it good at reusing the unallocated sectors at the start first so that only say 55% of the SSD's sectors will be allocated (remember it will be using 100% of the NAND cells for wear leveling but only 55% of the virtual sectors the SSD publishes will be allocated). In theory if someone had the time to bother they could test this using VMWare thin provisioning to create a set of virtual disk to make up a pool and test this.

Michael

good question i'll see if i can get an answer from a nexenta kernel guy.
 
Well from my experience of ESXi I would say that you would have to pass the FC HBA to your SAN VM just like you do with the passthough of the M1015. That way the VM sees a real physical FC card and it can use that to share Storage. If a FC HBA is connected to ESXi without passthough then it just acts just as an HBA for ESXi to access remote storage and will not share out any FC LUN's. You will note that if you try to edit settings for a VM their are simply no options to add this to a VM. You can only really add disks, networking and passthough whole devices.

Once you have the FC passed though its then up to you to set up the drivers and settings on your SAN VM just like it was a physical machine with that FC HBA connected.

http://docs.oracle.com/cd/E23824_01/html/821-1459/glddq.html <-- this may help with this and others who have had more experience with FC on solaris may be able to give some better tips.

One limitation with this method is the local ESXi will not be able to see your FC SAN unless you have a second FC HBA in that machine. And I wouldn't think you could split a dual channel FC card. You can always use NFS/iSCSI locally though.

Michael

Thanks - that would make sense, as in the mean-time I had done just what you mentioned - checked through what options were configurable on the vm's, and couldn't find anything that would suggest a virtualized fibre interface along the same vein as ethernet and vmxnet.

As a quick follow-up for anyone interested, some quick poking around shows that the 4 port qlogic fibre hba I have in the storage box does in fact allow for separation into 2 dual-port passthroughs; that is to say, I could passthrough 2 of the fibre hba's (check one off and the second shows up as dependent) out of the 4 total on the card, and leave the other 2 for esxi if I wanted to - to allow a loop-back to other vm's that could run on that box, as awkward as that setup may be, or I even suppose to pass through to another vm on that host if you had a use for it.

On the other hand, the dual-port qlogic fc card I have can only be passed through altogether (as logic would seem to suggest).
 
As a quick follow-up for anyone interested, some quick poking around shows that the 4 port qlogic fibre hba I have in the storage box does in fact allow for separation into 2 dual-port passthroughs; that is to say, I could passthrough 2 of the fibre hba's (check one off and the second shows up as dependent) out of the 4 total on the card, and leave the other 2 for esxi if I wanted to - to allow a loop-back to other vm's that could run on that box, as awkward as that setup may be, or I even suppose to pass through to another vm on that host if you had a use for it.

On the other hand, the dual-port qlogic fc card I have can only be passed through altogether (as logic would seem to suggest).

Interesting to know. I don't think anyone has built 4 port FC chipsets yet so 4 port cards are really just two PCI express devices on the same card with probably a bridge chip connecting them to the one slot. This setup would also have slightly better fault tolerance as the devices are mostly independent.
 
Hi Gea and others!

As some of you know I use solaris(OI, OmniOS) as a SAN controller, mostly fibre channel. Right now I'm fiddeling with snapshotting on the SAN level, and the issue I would like to discuss here is the rollback of snapshots to LUN's(or acually the underlaying volume) in the SAN controller.

The LUN I'm testing with now is the /root LUN of a Ubuntu 12.04 server, yesterday I performed the same operation on a Fedora 18 server.

What I do is to simply snapshot the volume that the LUN is representing:

root@san1:~# zfs snapshot mainpool/os/node1@test2

oot@san1:~# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mainpool/os/node1@test2 450K - 3.04G -
rpool/ROOT/openindiana-1@install 69.8M - 2.79G -
rpool/ROOT/openindiana-1@2012-11-18-17:40:45 136M - 3.28G -

Then I perform some change in the ubunut server so after that you can see the snapshot has grown from 450K to 197M:

root@san1:~# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
mainpool/os/node1@test2 197M - 3.04G -
rpool/ROOT/openindiana-1@install 69.8M - 2.79G -
rpool/ROOT/openindiana-1@2012-11-18-17:40:45 136M - 3.28G -

Then I'd like to rollback the snapshot:

root@san1:~# zfs rollback mainpool/os/node1@test2
cannot rollback 'mainpool/os/node1': dataset is busy

Ok, perhaps if I off-line the LUN:

root@san1:~# stmfadm offline-lu 600144F0CD16CB00000050EF19880001

root@san1:~# stmfadm list-lu -v 600144F0CD16CB00000050EF19880001
LU Name: 600144F0CD16CB00000050EF19880001
Operational Status: Offline
Provider Name : sbd
Alias : os_node1
View Entry Count : 1
Data File : /dev/zvol/rdsk/mainpool/os/node1
Meta File : not set
Size : 10737418240
Block Size : 512
Management URL : not set
Vendor ID : OI
Product ID : COMSTAR
Serial Num : not set
Write Protect : Disabled
Writeback Cache : Enabled
Access State : Active
root@san1:~# zfs rollback mainpool/os/node1@test2
cannot rollback 'mainpool/os/node1': dataset is busy

I try again:
root@san1:~# zfs rollback mainpool/os/node1@test2
cannot rollback 'mainpool/os/node1': dataset is busy

Still busy???

I don't see any other way than removing the view, deleting the LUN, and after that doing the rollback. And after the rollback, I need to create a new LUN of the volume, and a view for the LUN to the hostgroup...

That works, but it seems to me a very complicated route to go.

Someone else here has any input on this matter...???

Regrds Johan
 
Can anyone advise if these would/are supported in OI151a5 to enable iSCSI between 2 OI+Napp-it servers (Or any other suggestions on how best to achieve this)

Plan is to join 2 servers together as such or is there a better way of going behond 20 drives (I currently have 16 drives in one chasis (Maxed out) and 12 in the other primary chasis which has space for a further 8 drives.

I am on a very limited budget at the moment but can obtain 2 x cards and a cable for arround the equivalent of $50 here in the UK if they are any use that is.

Advise gratefully accepted please guys

Doug
 
Last edited:
Back
Top