OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

unclerunkle · Mar 1, 2014

Hello everyone -

This may be a dumb question, but after setting the power.conf settings, how can I tell if it's actually working? Also, is there a power log somewhere to see when/why the disks spin up and down? I'm hoping to eliminate any automated processes spinning up the disks.

Here are my settings for reference:

Code:

Status

fmri         svc:/system/power:default
name         power management
enabled      true
state        online
next_state   none
state_time   Sat Mar  1 15:59:13 2014
logfile      /var/svc/log/system-power:default.log
restarter    svc:/system/svc/restarter:default
dependency   require_all/none svc:/system/filesystem/minimal (online)

Code:

device-dependency-property removable-media /dev/fb
autopm                  enable
autoS3                  default
cpu-threshold           1s
# Auto-Shutdown         Idle(min)       Start/Finish(hh:mm)     Behavior
#autoshutdown            30              9:00 9:00               noshutdown
cpupm  disable
#device-thresholds         /dev/dsk/c2t0d0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDDF84Cd0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDD228Ad0     15m
device-thresholds         /dev/dsk/c6t5000C5006D028327d0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDEA757d0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDCE307d0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDDBDADd0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDD03DAd0     15m
device-thresholds         /dev/dsk/c6t5000C5006D16ECBDd0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDD1722d0     15m
device-thresholds         /dev/dsk/c6t5000C5006CDDC3B1d0     15m

device-thresholds         /dev/dsk/c6t50014EE103AB86E4d0     30m
device-thresholds         /dev/dsk/c6t50014EE103AB80FFd0     30m

Note: I don't want CPU throttling, so I turned that feature off.

unclerunkle · Mar 7, 2014

I feel like I killed the thread, lol.

GreenLED · Mar 7, 2014

Subscribing this way, cause I'm too lazy to find the subscribe button

.

grendel19 · Mar 8, 2014

I couldn't open my shares today and noticed that the smb server service suddenly stopped today.

root@omnios:~# svcs -xv svc:/network/smb/server:default
svc:/network/smb/server:default (smbd daemon)
State: offline since Sat Mar 8 19:31:36 2014
Reason: Service svc:/network/rpc/bind:default
is not running because a method failed.
See: http://illumos.org/msg/SMF-8000-GE
Path: svc:/network/smb/server:default
svc:/system/idmap:default
svc:/network/rpc/bind:default
See: man -M /usr/share/man -s 1M smbd
Impact: This service is not running.

So then checked rpc bind service

root@omnios:~# svcs -xv svc:/network/rpc/bind:default
svc:/network/rpc/bind:default (RPC bindings)
State: maintenance since Sat Mar 8 20:04:47 2014
Reason: Start method exited with $SMF_EXIT_ERR_CONFIG.
See: http://illumos.org/msg/SMF-8000-KS
See: man -M /usr/share/man -s 1M rpcbind
See: /var/svc/log/network-rpc-bind:default.log
Impact: 6 dependent services are not running:
svc:/network/rpc/gss:default
svc:/network/smb/client:default
svc:/network/smb/server:default
svc:/system/idmap:default
svc:/system/filesystem/autofs:default
svc:/network/rpc/smserver:default

Checking the log file:

root@omnios:~# tail /var/svc/log/network-rpc-bind:default.log
[ Mar 8 19:54:15 svc.startd could not set context for method: ]
[ Mar 8 19:54:15 chdir: Permission denied ("/root") ]
[ Mar 8 19:54:15 Method "start" exited with status 96. ]
[ Mar 8 20:04:38 Leaving maintenance because disable requested. ]
[ Mar 8 20:04:38 Disabled. ]
[ Mar 8 20:04:47 Enabled. ]
[ Mar 8 20:04:47 Executing start method ("/lib/svc/method/rpc-bind start"). ]
[ Mar 8 20:04:47 svc.startd could not set context for method: ]
[ Mar 8 20:04:47 chdir: Permission denied ("/root") ]
[ Mar 8 20:04:47 Method "start" exited with status 96. ]

Haven't made any changes, this just suddenly happened. Any ideas?

TCM2 · Mar 9, 2014

Actually read the logs?

chdir: Permission denied ("/root")

grendel19 · Mar 9, 2014

TCM2 said:
Actually read the logs?

chdir: Permission denied ("/root")

Of course I did, but if I knew how to approach it, I wouldn't have posted for assistance.

EDIT: NM, issue fixed.

moose517 · Mar 11, 2014

Man i could really use some help. I"m getting frustrated but i can't afford to do anything else right now. napp-it has marked a bunch of drives as unavailable but if i run fmadm repair it can't ind anything wrong with anything. What can i do to get my pools back online so i can at least figure out something to get my data if this is just gonna crap on me

.

cw823 · Mar 11, 2014

moose517 said:
Man i could really use some help. I"m getting frustrated but i can't afford to do anything else right now. napp-it has marked a bunch of drives as unavailable but if i run fmadm repair it can't ind anything wrong with anything. What can i do to get my pools back online so i can at least figure out something to get my data if this is just gonna crap on me .

have you tried to export and then import your pool?

moose517 · Mar 11, 2014

it won't let me do anything since its unavailable, well suspended.

cw823 · Mar 11, 2014

moose517 said:
it won't let me do anything since its unavailable, well suspended.

Can you attach a screenshot of the pool? Are you using napp-it?

moose517 · Mar 11, 2014

wn't let me export, shut down and reseated all the connectors and such between my M1015 and the intel SAS expander and still nothing.

chune · Mar 11, 2014

moose517 said:
wn't let me export, shut down and reseated all the connectors and such between my M1015 and the intel SAS expander and still nothing.

There has been much debate on problems with zfs when using sata drives behind SAS expanders. Can you bypass the expander and see if the drives show up? Not sure why you are using one anyways if you only have 4 drives...

moose517 · Mar 11, 2014

not enough onboard ports to plug that raid pool in. but yeah its not just the 5 drives LOL, its a full norco chassis.

chune · Mar 11, 2014

moose517 said:
not enough onboard ports to plug that raid pool in. but yeah its not just the 5 drives LOL, its a full norco chassis.

Not sure i follow, the pool you screencapped only has the four disks... throw the disks in a different server to rule out hardware?

moose517 · Mar 11, 2014

i didn't screencap the other pools as they are fine, ish, couple unavail on them as well

twistacatz · Mar 12, 2014

moose517 said:
i didn't screencap the other pools as they are fine, ish, couple unavail on them as well

If some drives are coming up and some are not it sounds like you have a hardware issue. What is the topology of your system?

Is it possible one of your backplanes has gone bad? I know Norco backplanes are notorious for dying. What Norco case do you have?

moose517 · Mar 12, 2014

twistacatz said:
If some drives are coming up and some are not it sounds like you have a hardware issue. What is the topology of your system?

Is it possible one of your backplanes has gone bad? I know Norco backplanes are notorious for dying. What Norco case do you have?

I've got a M1015 in a PCIe slot of course with an intel RES2SV240 connected together. From there i have SATA break out cables to the HDD bays, its a norco 4020. I'm highly thinking backplane at this point because i started moving around the SAS connectors on the expander last night and still got nothing, even moved it directly to the controller and still got nothing. But this morning the system stopped responding and HDD lights all over the place are flashing activity LEd's making me think its doing something but i don't have monitor i can hook up ironically enough to see whats going on and the web interface seems to have stopped working. Really like to get my hands on a quality supermicro chassis but i don't have a job ATM so its not even feasible.

twistacatz · Mar 12, 2014

moose517 said:
I've got a M1015 in a PCIe slot of course with an intel RES2SV240 connected together. From there i have SATA break out cables to the HDD bays, its a norco 4020. I'm highly thinking backplane at this point because i started moving around the SAS connectors on the expander last night and still got nothing, even moved it directly to the controller and still got nothing. But this morning the system stopped responding and HDD lights all over the place are flashing activity LEd's making me think its doing something but i don't have monitor i can hook up ironically enough to see whats going on and the web interface seems to have stopped working. Really like to get my hands on a quality Supermicro chassis but i don't have a job ATM so its not even feasible.

Yeah man I think they have a Norco owners thread somewhere around here. I would check it out and hit up google to see what you can find. Like I said you wouldn't be the first person to have issues with a Norco backplane.

I know you can't make the purchase now but I would keep my eye on the deal thread as I've seen a lot of people buy Supermicro rigs for the low, used.

Good Luck

moose517 · Mar 12, 2014

well i managed to hookup a monitor today, all 24 drives are seen on boot, so its purely got to be something with ZFS.

EDIT: did some checking and all 25 drives are showing up to the OS just fine, even able to use the activity LED to show the drive no problem from napp-it so i really have no clue how to clear the drives and bring my pool back online

_Gea · Mar 13, 2014

moose517 said:
well i managed to hookup a monitor today, all 24 drives are seen on boot, so its purely got to be something with ZFS.

EDIT: did some checking and all 25 drives are showing up to the OS just fine, even able to use the activity LED to show the drive no problem from napp-it so i really have no clue how to clear the drives and bring my pool back online

If your hardware is working properly and the WWN-IDs are not changed for whatever reason the pool sould be imported on reboot without any action.

Typical ZFS mechanism on such problems are:
- clear errors (napp-it menu pool), that clears former errors that are no longer valid
- export + import pool where disk IDs are newly assigned if needed

If you cannot export + import, you can unplug the pool, bootup and hotplug the disks and try a import.

As this does not happen you must think of reasons why the disks are unavail. These are usually driver + HBA problems, power and cabling problems, SAS expander problems with Sata disks and single disks that are blocking the controller.

Connect one and then another disk to Sata then the IBM without expander to check if this disk is available then (pools stays unavail unless enough disks are detected and online)

moose517 · Mar 13, 2014

the disks are already directly connected to the IBM without the expander. The OS sees the disks i just don't know how to clear the suspended I/O state. i'm getting so fed up with ZFS, thought it would make it easier to manage but its been more of a pain than just going with a true RAID setup.

_Gea · Mar 13, 2014

moose517 said:
the disks are already directly connected to the IBM without the expander. The OS sees the disks i just don't know how to clear the suspended I/O state. i'm getting so fed up with ZFS, thought it would make it easier to manage but its been more of a pain than just going with a true RAID setup.

You are working with a true Raid setup that is more intelligent than any hardware Raid. But that does not help if your OS knows the pool while the disks are not available.

have you tried

- clear errors (napp-it menu pools)
- clear napp-it cache (menu disks-delete disk buffer and zfs filesystems - delete ZFS buffer)
- pool import
- pool export + import
- is one or another single disk online (in case of a disk blocking all others)
- use Sata (in case of the IBM is bad)
- connect directly without the backplane (in case of the backplane is bad)

moose517 · Mar 13, 2014

_Gea said:
You are working with a true Raid setup that is more intelligent than any hardware Raid. But that does not help if your OS knows the pool while the disks are not available.

have you tried

- clear errors (napp-it menu pools)
- clear napp-it cache (menu disks-delete disk buffer and zfs filesystems - delete ZFS buffer)
- pool import
- pool export + import
- is one or another single disk online (in case of a disk blocking all others)
- use Sata (in case of the IBM is bad)
- connect directly without the backplane (in case of the backplane is bad)

if i do clear errors i get the following: cannot clear errors for NFS: I/O error
clearing napp-it cache: no errors
- pool import NFS SUSPENDED One or more devices are unavailable in response to IO failures. The pool is suspended.
- pool export + import: cannot export 'NFS': pool I/O is currently suspended
- accoding to both napp-it and the OS all disk drives from that pool are online, if its blocking i have no idea how to check but no activity LED's are lit solid
- can't use sata as there aren't enough onboard ports to connect + have my OS drive hooked up so wouldn't get anywhere
- if backplane was bad woudln't they not show in the BIOS either?

_Gea · Mar 13, 2014

moose517 said:
if i do clear errors i get the following: cannot clear errors for NFS: I/O error
clearing napp-it cache: no errors
- pool import NFS SUSPENDED One or more devices are unavailable in response to IO failures. The pool is suspended.
- pool export + import: cannot export 'NFS': pool I/O is currently suspended
- accoding to both napp-it and the OS all disk drives from that pool are online, if its blocking i have no idea how to check but no activity LED's are lit solid
- can't use sata as there aren't enough onboard ports to connect + have my OS drive hooked up so wouldn't get anywhere
- if backplane was bad woudln't they not show in the BIOS either?

Connect only one or another disk to Sata to check if these disks are displayed as online - it does not matter if the pool remains offline. Remove the IBM for this test. Be aware that these disks gets new short port numbers like c0t1d0 on AHCI Sata (only LSI SAS2 in IT mode shows disk unique WWN numbers as disk id).

If disk are seen, call pool-import to check if the pool is detected as not importable due to too many disks are missing. In such a case the problem is located at the IBM or the backplane.

If you can, move the disks to another server where you can import the pool (to rule out a power problem)

moose517 · Mar 13, 2014

ok that makes sense. I'll have to give it a shot later tonight/tomorrow and see what i can turn up. I guess when i get a job again i need to look into new hardware maybe

mkush · Mar 14, 2014

Strange thing happened and I'd like someone to reassure me that nothing is really wrong. Caused me to lose confidence in my box and OmniOS/Illumos so I hope there is a good explanation.

What happened is that I built my box, installed OmniOS and did my tried-and-true command sequence to share out my pool via iSCSI. I was investigating other paths but in the end that is what I've come back to. The iSCSI target is connected to a Mac Pro (OS X 10.9.x). The zpool is about 32TB in size. As you'll see below I initially left about 1TB free, then subsequently 2TB. This was only to keep up performance as the iSCSI volume approaches capacity.

Here is what I did originally:

Code:

zpool create -f zp raidz2 [disk list]
zpool set feature@lz4_compress=enabled zp
zfs create -s -V 31T zp/iSCSI
svcadm enable stmf
pkg install storage-server
svcadm enable iscsi/target
sbdadm create-lu /dev/zvol/rdsk/zp/iSCSI
itadm create-target
stmfadm add-view [GUID from sbdadm above]

I then connected to it from my Mac, formatted it and copied many terabytes of data to it. Finally I scrubbed it, no data errors, cool! I should mention that my data to be copied to it was on individual 4TB drives and it took each one probably 10 hours to copy, maybe a bit more.

Then I realized that I'd left out something important: when I created the filesystem (zfs create), I neglected to turn on lz4 compression. Ooops. I figured, no big deal, I'll just recreate it an copy the stuff again.

So, I got rid of the view, target, lu, and filesystem (in that order). Then I redid the above commands, starting with the zfs create like so:

Code:

zfs create -s -o compression=lz4 -V 30T zp/iSCSI

Followed by the other steps to create the lu/target/view. Note that I did NOT recreate the pool, just the filesystem on it. Reconnected the Mac, formatted, started copying files. Except this time, the Mac said it was going to take a couple of DAYS to copy the files. I left it all day, came back and it was still less than half. So there was no doubt that it was MUCH slower than the first time.

Baffled and doubting my design choices, I just redid the whole thing including the zpool itself. Why I didn't do that before is a mystery to me since it's just one command more, and the pool didn't contain any other data. The Mac is currently copying files and it is again fast, in fact it seems faster than the first time which may make sense given the lz4 compression this time (less data to write to disk).

So after all that blah blah, does anyone see a reason that the performance was so much slower the second time and why recreating the zpool seemed to "fix" it?

_Gea · Mar 14, 2014

Have you modified any other settings?
Compress needs some CPU power but improve performane in most cases

I would
- disable compress and compare performance
- use iostat to check CPU load and compare busy/wait values of disks

- check/set iSCSI blocksize (try higher values like 64KB)
- check LU writeback cache settings (on=fast, off=secure but slow without a fast ZIL)

- use a local benchmark like bonnie to check basic values

mkush · Mar 15, 2014

No other settings modified. In fact, the commands shown are everything I did with the exception of setting up the NIC (an Intel x540-T1 10GbE, jumbo frames enabled). Very vanilla.

The system has plenty of CPU avail, it's a Xeon E5-2620 v2 (6 cores at 2.1GHz), 64GB memory.

To me the question boils down to: why does a "fresh" pool, one that never had a zfs create run on it, perform so much better than a pool which had a filesystem created and deleted on it?

In other words, same commands run for a pre-used but now-empty pool vs. being run on a freshly created pool result in vastly different performance. I don't see where the lz4 has anything to do with it... That has effectively been "factored out".

_Gea · Mar 15, 2014

compare exact the same settings:
disable jumboframes (can be the reason)

mkush · Mar 15, 2014

Didn't touch that setting, it's always been on.

I think I can summarize best like this:

1. Create vanilla OmniOS install
2. Create pool
3. Create volume (no lz4), share with iSCSI, write lots of data from Mac

-> PERFORMANCE IS GREAT

4. Destroy volume and iSCSI "share"
5. Create volume (with lz4), share with iSCSI, write lots of data from Mac

-> PERFORMANCE IS BAD

6. Destroy volume, iSCSI "share" and POOL
7. Create pool
8. Create volume (with lz4), share with iSCSI, write lots of data from Mac

-> PERFORMANCE IS GREAT

In the above I noted whether or not lz4 was enabled but I do not believe it matters here since it was enabled for both "try 2" and "try 3" above. It seems to me that it is then "factored out" of the equation. The only difference seems to be whether the volume was created on a fresh pool or not.

By the way, I'm now copying my third huge disk of data to the server and it is still very fast.

Freak1 · Mar 17, 2014

Hi.

I like to update my All in one. I have esxi 5.0 and Napp-IT 0.8h nightly May.03.2012 on Open indiana.

Last time when i updated to esxi 5.0, you (_Gea) me to make a fresh installation and not an update. Is that also recommended this this time? Is there a mini how to for making a update?

_Gea · Mar 17, 2014

You need to

1. update ESXi
- download new ISO, boot and select update
do not update virtual machine hardware version to v10 if you use ESXi free

2. update OI and vmware tools,
or better reinstall OI, install vmwaretools and install napp-it per wget

or use the prebuild napp-it virtual storage appliance (OmniOS stable)
downloadable from napp-it.org

kronik8 · Mar 19, 2014

So, when I reboot my file server, after unlocking the encrypted volume, SMB needs to essentially be disabled and re-enabled before it'll work. This does not happen with NFS. Anything I can do about this?

moose517 · Mar 19, 2014

_Gea said:
Connect only one or another disk to Sata to check if these disks are displayed as online - it does not matter if the pool remains offline. Remove the IBM for this test. Be aware that these disks gets new short port numbers like c0t1d0 on AHCI Sata (only LSI SAS2 in IT mode shows disk unique WWN numbers as disk id).

If disk are seen, call pool-import to check if the pool is detected as not importable due to too many disks are missing. In such a case the problem is located at the IBM or the backplane.

If you can, move the disks to another server where you can import the pool (to rule out a power problem)

I removed the IBM and directly connected and i'm still not able to import the pool due to too many IO Error. Would reinstalling the OS just clear everything or is like suspension state stored on the drives as well? THis particular raid pool its no big loss but my other is but it responds just fine.

_Gea · Mar 20, 2014

kronik8 said:
So, when I reboot my file server, after unlocking the encrypted volume, SMB needs to essentially be disabled and re-enabled before it'll work. This does not happen with NFS. Anything I can do about this?

Have you tried to restart NFS service?

svccadm disable nfs/server
svcadm enable nfs/server

_Gea · Mar 20, 2014

moose517 said:
I removed the IBM and directly connected and i'm still not able to import the pool due to too many IO Error. Would reinstalling the OS just clear everything or is like suspension state stored on the drives as well? THis particular raid pool its no big loss but my other is but it responds just fine.

There is nothing on disk. If the I/O error remains on a different controller one ore more disks are damaged blocking the damaged disk(s) or all disks on a controller - even good ones. Other possible problem is a PSU problem (not enough/ bad power)

Freak1 · Mar 20, 2014

_Gea said:
You need to

1. update ESXi
- download new ISO, boot and select update
do not update virtual machine hardware version to v10 if you use ESXi free

2. update OI and vmware tools,
or better reinstall OI, install vmwaretools and install napp-it per wget

or use the prebuild napp-it virtual storage appliance (OmniOS stable)
downloadable from napp-it.org

Thanks. I like to use the new Omni. Do i need to export the pool or anything like that before i close open indiana?

_Gea · Mar 20, 2014

It is a good usage to export prior import but if you forgot, you can import without a former export. All pool properties are kept. You may need to recreate users and reassign permissions.

moose517 · Mar 20, 2014

_Gea said:
There is nothing on disk. If the I/O error remains on a different controller one ore more disks are damaged blocking the damaged disk(s) or all disks on a controller - even good ones. Other possible problem is a PSU problem (not enough/ bad power)

I really hope its not a PSU problem i'm on my third one now. I would think a 560wwould be sufficient for 24 drives wouldn't it?

EDIT: managed to get it to recognize 2 of the disk in the "damaged" pool but now my other pool is suspended and unavailable. WTF.

EDIT2: what the fuck ever. I'm done with ZFS bullshit. there isn't a damn thing wrong with the drives, garbage raid system. Screw the 30TB of movies i had stored on it.

schleicher · Mar 20, 2014

ZFS is not garbage, your hardware maybe is... I know a dozen of people personally that are totally happy with ZFS and also this and other forums prove you are wrong

OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

Weaksauce

Weaksauce

Limp Gawd

Gawd

Gawd

Gawd

Gawd

n00b

Gawd

n00b

Gawd

Weaksauce

Gawd

Weaksauce

Gawd

Limp Gawd

Gawd

Limp Gawd

Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

Supreme [H]ardness

Gawd

n00b

Supreme [H]ardness

n00b

Supreme [H]ardness

n00b

Limp Gawd

Supreme [H]ardness

Limp Gawd

Gawd

Supreme [H]ardness

Supreme [H]ardness

Limp Gawd

Supreme [H]ardness

Gawd

n00b