OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

Hmmm, no idea. Nothing shows up in zpool or zfs properties or man pages. Something napp-it does?
 
I would *assume* that this is a quota limiting the amount of available space you can fill, since a COW filesystem like ZFS does not deal well with no free space - but that's just a guess - I haven't looked at the code to see what that option is doing.
 
Yeah, I get that. What confuses me is I can't find anything the zpool or zfs commands do that relates to any kind of limit like this.
 
Gea might be able to weigh in. Given he inputted the feature. I'm going to wait on that input, but it's looking like I'll be deleting the pools and recreating them. In one case almost 1TB has been set aside?
 
about overflow protection

Its always bad, if you fill up your pool up to 100% with data.
Overflow protection sets a 10% reservation on the pool ZFS itself.

The available space for your ZFS folder is then 90%.
That means, if you fill your folder up to the max, 10% of the pool remains always free.
You can check/modify that always in menu ZFS folder under reservation.

Gea
 
So filling the pool is problematic but not the folder if I understand correctly. But why such a large number. can't you just get by with .01%? Or is the idea that the 10% gives you some room to grow in the event you have filled the folder and have to make arrangements to add additional drives/vdevs.. Basically just to buy you some time without having to delete any of your data.
 
The pool is the parent dataset of all ZFS datasets that you create.
So the pool is also a ZFS folder but with some special and predefined properties like case sensivity.
(naming: A ZFS folder is dataset sometimes called a ZFS filesystem, its like a partition on conventional filesystems)

A reservation on any dataset reduces the available space for all others.

In my understanding, you should not fill a pool more than 90% (disks are cheap)
If you need other values you can change this always by reducing/ disabling the
reservation on this pool.

Gea
 
Last edited:
Many thanks Gea.

Any recommendations for multiple LAN connections. Bonding etc? I poked around Napp-It and don't see anything for instructions or anything specific on it.

I'm thinking of doing two bonded 1gbe upstream for iSCSI on a vlan for ESXi traffic
separate two bonded 1gbe connections upstream for the rest of the network for standard cifs/nfs file sharing and management
 
Many thanks Gea.

Any recommendations for multiple LAN connections. Bonding etc? I poked around Napp-It and don't see anything for instructions or anything specific on it.

I'm thinking of doing two bonded 1gbe upstream for iSCSI on a vlan for ESXi traffic
separate two bonded 1gbe connections upstream for the rest of the network for standard cifs/nfs file sharing and management

Link aggregation may help a little with a lot of parallel connections
but mostly its complicated only with a lot of possible problems and
no or minimal performance advantage.

i always follow these rules

1. keep it simple
2. use 10 Gb/ FC if you need speed
3. If you have an All-In-One ESXi/ SAN solution, use one VLAN Uplink
from ESXi to your physical switch (1 GB or 10 Gbit) and divide your LANS there
Use highspeed virtual vnics to internally connect your VM's with your SAN
4, On ESXi use virtual software switches and not physical nics beside failover


1 Gb aggregation is outdated. 10 GB is on the way to become cheap.
Currently 2x10 Gb cards are about 300 Euro but you can expect them to
be onboard in 2012 on better mainboards or as cheap as good 1 GB Nics 5 years ago

10 Gb on switches is currently availabe for about 250 Euro per port.
I use HP 2910 switches with up to 4 x 10 Gb ports. They are not really cheap
(about 1300 Euro with 24 x 1Gb Ports + 2 x 10 Gb for about 500 Euro)
but affordable if you need the speed.

If you only need high-speed between one server and a few clients, you do not need
a switch immideatly (example small video editing pool) and can connect them directly
and buy the 10 Gb switch later.

Gea
 
I just found out I have HARD and Transfer errors on one of my hard drive:
Error: S:0 H:17 T:83

What does Hard and Transfer errors mean? Can hard error be bad sector? What would the best solution be right now, to get the drive out and check with manufacturer software for defects?

Other drives report no problem.

lp, Matej
 
I just found out I have HARD and Transfer errors on one of my hard drive:
Error: S:0 H:17 T:83

What does Hard and Transfer errors mean? Can hard error be bad sector? What would the best solution be right now, to get the drive out and check with manufacturer software for defects?

Other drives report no problem.

lp, Matej

translate it to:
Currently there is no problem with data security, but keep an eye on this disk
you may use a manufacturer test-tool to check this disk - i would suggest

Gea
 
Link aggregation may help a little with a lot of parallel connections
but mostly its complicated only with a lot of possible problems and
no or minimal performance advantage.

i always follow these rules

1. keep it simple
2. use 10 Gb/ FC if you need speed
3. If you have an All-In-One ESXi/ SAN solution, use one VLAN Uplink
from ESXi to your physical switch (1 GB or 10 Gbit) and divide your LANS there
Use highspeed virtual vnics to internally connect your VM's with your SAN
4, On ESXi use virtual software switches and not physical nics beside failover


1 Gb aggregation is outdated. 10 GB is on the way to become cheap.
Currently 2x10 Gb cards are about 300 Euro but you can expect them to
be onboard in 2012 on better mainboards or as cheap as good 1 GB Nics 5 years ago

10 Gb on switches is currently availabe for about 250 Euro per port.
I use HP 2910 switches with up to 4 x 10 Gb ports. They are not really cheap
(about 1300 Euro with 24 x 1Gb Ports + 2 x 10 Gb for about 500 Euro)
but affordable if you need the speed.

If you only need high-speed between one server and a few clients, you do not need
a switch immideatly (example small video editing pool) and can connect them directly
and buy the 10 Gb switch later.

Gea

Hi Gea,
After doing quite a bit of research I am incredibly impressed by ZFS. Re-reading this thread ( I remember reading it when you created it) and after speaking to a Solaris admin friend of mine, I am starting to implement ZFS as a AIO esxi solution for at home and stand alone at work for testing and development.

I am curious, what problems you have come across with link aggravation? I have a scenario where I want to have a quad port nic on the host and dual quad port nics in each server, connected to a HP1800-24g as their own storage network switch. I realize that for r/rw speeds I would be maxing out a single 1gb connection due to physical hardware limitations around 100MB/s, but am curious as to the benefits or waste having these quad port nics bonded would be. Same goes for 10gb fiber, if the hardware limitations are at 100MB/s for the disc access, what is the point of a 10gb link? Just more speed for multiple virtual servers? We currently have the quad nics and switch (and a spare of each on hand in case of failure). 10gb equipment is out of budget at this point, but could be purchased here as it drops in price over the next few years.


What is your recommend setup for the follow drive sets in ESXi? I have read several oracle blogs about different configs (on sun hardware) showing several options (raid z, z2 z3, mirror, ect). I have 6 vm's currently and a second development box with an additional 10 test boxes. We currently have the following drives.
10 300gb SAS 15k
6 750gb Sata 7.2k (commercial drives)
2 2tb Sata 7.2k 24/7 drives

My initial thoughts were to create the following:
Pool 1 (10 SAS drives in sets of mirrors (2x2x2x2x2) for high spindle speeds for the VM's)
Pool 2 ( Raidz 5 750gb + 1 hot spare for storage)
Pool 3 ( 2tb Mirror)
But I really like the error correction features of Raidz2 and the ability to loose up to two discs without data corruption. I am fine buying more discs, if I can retain similar speeds for my VM's, hence the above networking question. Would investing in SSD's for the scratch files be worth it?

For reference, my hardware for the work deployment is a Supermicro MBD-X8DAH+-F-O with a single quad xeon, 16gb ram (soon to be doubled), and a lsi controller.

Thanks!
 
with link aggregation, you complicate things with often no or minimal benefit and add an extra
problem field for example together with jumbo frames (mostly not working at all)
in my opinion, its not worth the effort today

about your pools:
if you need best IO and speed for VM's use always mirrors so you pool 1 is perfect
- add at least one hotspare!!
- you may add a ssd read cache and eventually a mirrored write cache (Hybrid storage)


pool2
a hotfix to a Raid-Z is not very efficient.
if you have a failure you need a rebuild with a at this moment untested disk
use next Raid-Z level Raid-Z2 instead and you have a 'hot' hotfix
use hotfix always on mirrors and on a Raid-Z3 if needed

Depending on your workload, SSD cache dtives can help to improve performance.
For my own i switched to SSD only pools as ESXi datastore.
(although they are not as reliable as good SAS disks, so i use 3 x mirrors now)

The time for expensive 15k SAS is over for new installations (imho)


Gea
 
I just found out I have HARD and Transfer errors on one of my hard drive:
Error: S:0 H:17 T:83
What does Hard and Transfer errors mean? Can hard error be bad sector? What would the best solution be right now, to get the drive out and check with manufacturer software for defects?

translate it to:
Currently there is no problem with data security, but keep an eye on this disk
you may use a manufacturer test-tool to check this disk - i would suggest
Gea

Today, numbers got higher, H to 19 and T to 90. I guess something is wrong with the hard drive. I will take it out and have a look.

Matej
 
The biggest problem with link aggregation is not a "problem" as much as it is a misconception. People think link aggregation of 4 1GBe links gives them a 4GBe link. That's not really true - you still have 4 1GBe links - you just get to see them all as if they were a single IP address.

In reality, any single "flow" on the aggregated link is still limited to the speed of a single link. This means that a server with a 4-link aggregation group speaking to many, many clients can have a net throughput of 4GBe, but any single client will never get more than 1GBe.

Similarly, a client might have a 4-link1GBe aggregation group and get a net throughput of 4GBe when speaking with multiple servers, but the speed to any single server will never exceed 1GBe.

So the example above, a server with a 4x1GBe aggregation group and a client with a 2x2GBe aggregation group will see no more than 1GBe flowing between them. Probably not what was desired...

[There is a while-lie in the above. Assuming the use of sophisticated hashing methods or the use of multiple IP endpoints between the client/server, you actually can get more speed - but I'm assuming that anyone with enough understanding of how this all works to set that up probably isn't posting the question here.]
 
I would also never ever build a Raid-5 or Raid-Z1 from 8 drives.
Too risky to have a second disk failure during a rebuild.

I would use a Raid-Z2 with 8 disks. If you think about a extra hot spare,
I would build a Raid-Z3 instead. In case of a failure your Raid is in the same
state like a Raid-Z2 + hotspare AFTER a rebuild. Also the extra-drive is under ZFS control-
no suddenly damaged hotspare when you need it.

Hotspare is best if you have mirrors.


Gea

OK, i have 2 vdev's that are both raidz1. If i were to change that and re-create both of the vdevs, i would have 4 of the 8 drives for parity which means i only get to use half of the space. Is this correct? If i have 2 vdevs both a raidz1, doesn't that mean each vdev would have a parity drive? I'm still not clear on multiple vdevs and the raid levels for each. Should I just have 1 big vdev with all 8 drives in raidz2, so then it only takes up 2 of the drives?

Thanks again to everyone helping me on this. I really haven't found a basic tutorial on this matter, so it's still trial and error for me.
 
Not sure about your math. If you take 8 drives, and create 2 4-disk raidz and stripe them, you have 2 parity drives, not 4. Gea is recommending an 8-disk raidz2, which also gives you 6 drives worth of data, but any two can fail. If you want really good performance, and don't mind losing half the storage, go with 4 2-disk mirrors striped together. You will get best performance this way (especially for reads.)
 
with link aggregation, you complicate things with often no or minimal benefit and add an extra
problem field for example together with jumbo frames (mostly not working at all)
in my opinion, its not worth the effort today

about your pools:
if you need best IO and speed for VM's use always mirrors so you pool 1 is perfect
- add at least one hotspare!!
- you may add a ssd read cache and eventually a mirrored write cache (Hybrid storage)


pool2
a hotfix to a Raid-Z is not very efficient.
if you have a failure you need a rebuild with a at this moment untested disk
use next Raid-Z level Raid-Z2 instead and you have a 'hot' hotfix
use hotfix always on mirrors and on a Raid-Z3 if needed

Depending on your workload, SSD cache dtives can help to improve performance.
For my own i switched to SSD only pools as ESXi datastore.
(although they are not as reliable as good SAS disks, so i use 3 x mirrors now)

The time for expensive 15k SAS is over for new installations (imho)


Gea
Thanks Gea,
I ended up reading for several hours on link aggregation and believe I have found a setup I like. Since ESXi can only actively use one 1gb link per datastore, it would make sense to me that it is a waste of time on its end. Setting up a fall over on the other hand is beneficial and supported for redundancy. If anything, I would use two dual port nics to provide 2 trunks from OpenIndiana to two switches, then connecting each switch to one port on each of the servers, giving me a fall over redundancy. Both my switches support this and this way I could provide each server a redundant connection to the ZFS host. If this does not give me the speed I need to each host, I can always look in to setting up a more complex 1gb connection for each data store.

The 15k SAS were on hand, I will only be replacing them if they fail.
I will most likely be making pool 1 a 8 disc + 2 spare to start.

What have you been recommending for a SSD? I really want to incorporate one on my Pool 1 for the added read speed.
 
Last edited:
Not sure about your math. If you take 8 drives, and create 2 4-disk raidz and stripe them, you have 2 parity drives, not 4. Gea is recommending an 8-disk raidz2, which also gives you 6 drives worth of data, but any two can fail. If you want really good performance, and don't mind losing half the storage, go with 4 2-disk mirrors striped together. You will get best performance this way (especially for reads.)

I guess I didn't do a good enough job with my original post in this thread talking about my vdev setup. I have a pool with 2 vdevs with 4 drives each. Each vdev is a raidz1 meaning 2 parity drives for the pool. I read online that it's better to have more vdevs, which is why i set it up as 2 vdevs instead of 1 big vdev in raidz1.
 
Opps, forgot one thing.
I have decided to retire the 750gb drives to the backup server. I have a spare IBM x3650 I want to turn in to a ZFS box and make it our backup server (with 6 750gb drives). This lets me buy new drives for pool 2. This is mainly nothing but a storage pool for our fileserver. It needs 2tb of space with room to easily grow (which is why I love ZFS). Looks like it is time to research drives :).
 
So from my understanding you do not need a RAID controller for this? poeple use cards just for the extra drives they can support of the motherboard cannot support that many? ZFS then pools the drives together?

how do you backup your ZFS config or move it and all the drives to a new machine?
 
Correct, if you do use a raid card, use it in jbod mode. The raid config is on the drives in the pool...
 
I have problems with Solaris 11 and Link Aggression, I have a dell powerconnect switch, I put it to LAG mode and set up LACP at solaris, I set a static IP and don't get any connection with the local network, I check, dladm say it is up and I put it to DHCP and I get a IP from the router but still no connection with the local network, WHATS UP? :eek:
 
The dell powerconnect switches require a L2 policy and a static aggregation (no LACP)

if you have 2 interfaces, say ige0 and ige1

ipadm delete-if ige0
ipadm delete-if ige1
dladm create-aggr -P L2 -l ige0 -l ige1 aggr1
ipadm create-addr -T dhcp aggr1/v4

should work for you. This is assuming you have disabled the nwam service and are using the physical service:


svcadm disable svc:/network/physical:nwam
svcadm enable svc:/network/physical:default
 
So if you have a bunch of disks in a ZFS pool how can you move them to a new server?

How does rebuilding work when you lose a drive?
 
The dell powerconnect switches require a L2 policy and a static aggregation (no LACP)

if you have 2 interfaces, say ige0 and ige1

ipadm delete-if ige0
ipadm delete-if ige1
dladm create-aggr -P L2 -l ige0 -l ige1 aggr1
ipadm create-addr -T dhcp aggr1/v4

should work for you. This is assuming you have disabled the nwam service and are using the physical service:


svcadm disable svc:/network/physical:nwam
svcadm enable svc:/network/physical:default

Thank you! :D
 
So if you have a bunch of disks in a ZFS pool how can you move them to a new server?

If you have done it like suggested (Use HBA controller, never use hardware-raid)
you can just plug your disks into your new computer with any disk controller
and import your pool - no problem -

How does rebuilding work when you lose a drive?

If you have set ZFS pool property autoreplace=on you just need to replace
a failed drive, otherwise plug in a new disk and do a replace failled drive -> new disk

If your controller does not support hot-plug you need a reboot after plug-in new disks

Gea
 
Hello, I run Solaris 11 Express and napp-it for stuff liek SMB and zpools

I was wondering how to I make custom SMB users or groups, like users that only can access some files, they can read at one place and write on another, how does this work?
 
Hello, I run Solaris 11 Express and napp-it for stuff liek SMB and zpools

I was wondering how to I make custom SMB users or groups, like users that only can access some files, they can read at one place and write on another, how does this work?

With napp-it you can create user and smb-groups in menu user.
Connect from Windows as user root and set desired file and folder ACL
(works from Win XP pro, Win 2003 and Win 7 pro, problems are reported with home editions
and Win 7 ultimate)

Problem: Solaris ACL are order sensitive - Windows ACL not
non-trivial ACL should be set from Solaris

From Solaris you can set ACL via CLI or via napp-it ACL extension
(in development, currently you can set share level ACL and ACL on shared folders
not on other files and folders)


Gea
 
Last edited:
So if you have a bunch of disks in a ZFS pool how can you move them to a new server?

Exactly what Gea said, but to give you specific commands

#zpool import
to list all the pools found

#zpool import poolname
to import the pool.

If you had a failure and didn't do a clean export you just need to use the -f (force) switch
#zpool import -f poolname
 
If you have done it like suggested (Use HBA controller, never use hardware-raid)
you can just plug your disks into your new computer with any disk controller
and import your pool - no problem -



If you have set ZFS pool property autoreplace=on you just need to replace
a failed drive, otherwise plug in a new disk and do a replace failled drive -> new disk

If your controller does not support hot-plug you need a reboot after plug-in new disks

Gea

Ok thanks for the info.
 
One remark about problems on moving a pool

If your pool have had a write cache SSD that is missing on import
you may have a serious problem with importing (pool may be lost)

On problems sometime it helps if you try to import a pool in read-only mode
or with the newest available ZFS OS (use bootable Live DVD)

but usually its absolutely trouble free -
it does not matter if you had exported a pool correctly or if you have moved
from a dead machine without proper export.


Gea
 
Last edited:
With napp-it you can create user and smb-groups in menu user.
Connect from Windows as user root and set desired file and folder ACL
(works from Win XP pro, Win 2003 and Win 7 pro, problems are reported with home editions
and Win 7 ultimate)

Problem: Solaris ACL are order sensitive - Windows ACL not
non-trivial ACL should be set from Solaris

From Solaris you can set ACL via CLI or via napp-it ACL extension
(in development, currently you can set share level ACL and ACL on shared folders
not on other files and folders)


Gea

Great job on the project. I finally got my all-in-one build working this weekend and everything has been fantastic so far except sharing issues with windows 7 ultimate and home.

I had no problems getting sharing to work on windows xp or 7 professional but I am wondering if there was a suggested work around for the smb sharing issues for the other versions of windows? I have been looking all weekend for a good work around and have not found an easy solutions.
 
One remark about problems on moving a pool

If your pool have had a write cache SSD that is missing on import
you may have a serious problem with importing (pool may be lost)

On problems sometime it helps if you try to import a pool in read-only mode
or with the newest available ZFS OS (use bootable Live DVD)

but usually its absolutely trouble free -
it does not matter if you had exported a pool correctly or if you have moved
from a dead machine without proper export.


Gea

Can you explain to me what an Cache SSD is? I see no mention of it in the guides. You should add it to the first page. I am pretty sure i know what it does just, it is just like a normal HDD cache but for the entire ZFS pool?

Can you select an SSD to use from the napp-it gui?
 
Hi,

I would like to say thank you, Gea, for your napp-it.

I have a small question: what is "raw lu"?
It is on Comstar / Logical Units / create raw LU. Sun document doesn't mention about this.

I tried to create a raw LU, share that LU via iSCSI. But esxi can't create vmfs on it. Esxi can create vmfs on thin prov LU, volume LU successfully.

Best regards.
 
Back
Top