OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

You do not want open Internet access to the management interface unless you use a VPN router ex GLinet and Wireguard. With VPN you can even simply use SMB. If you want to use Apache you can configure at /var/web-gui/_my/tools/apache/httpd.conf

Without VPN enabling a webbased Amazon S3 compatible cloud (minIO) is faster and more secure
 
I have manually replicated a ZFS filesystem with all intermediary snaps to another host using zfs send. Is there any way to convert this manually replicated zfs filesystem into a scheduled replication using napp-it? Or do i have to start all over?
 
A manual replication and a napp-it replication job do basically the same with one difference and this is the name of snaps. A manual replication job can use any snapnames while a napp-it replication job uses snap names with "repli", source/target and an ongoing number in the name.

To convert:
- create a replication job with same source/target (use -I if you want intermediate snaps or -r for all filesystems/snaps recursively)
- rename last source and target snaps accordingly
- start job, it should continue or give a snap error

If you are unsure about naming, create a test replication and use adjust names accordingly.
 
I have already started with this. :-/

It's driving me crazy. After spreading a whole 18 TB across x individual hard drives and USB sticks and uploading again almost completely it hangs again. 4 x resilvers on 5 disks I have never seen before....

Bash:
root@aio-pod:~# zpool status -v smallpool
  Pool: smallpool
 status: DEGRADED
status: One or more devices are being reordered.  The pool will
        continue to function, possibly in a degraded state.
Action: wait for the resilver process to complete.
  scan: resilver in progress since Wed Mar 22 04:02:31 2023
        3.47T scanned at 24.1M/s, 3.21T output at 22.3M/s, 12.1T total.
        1.61T resilvered, 26.52% done, 4 days left 20:11:42
Config:


        NAME STATE READ WRITE CKSUM
        smallpool DEGRADED 0 0 271
          raidz1-0 DEGRADED 0 0 630
            c1t6d0 DEGRADED 23 0 0 too many errors (re-silvering)
            c1t9d0 DEGRADED 2 0 5 too many errors (resilvering)
            spare-2 ONLINE 0 0 9
              c7t0d0 ONLINE 0 0 (resilvering)
              c1t8d0 ONLINE 0 0 (Resilvering)
            c1t10d0 ONLINE 0 0 0
        Spare parts
          c6t2d0 INUSE currently in use.


Errors: Permanent errors were detected in the following files:


        <0xca21>:<0xc>
        <0x680>:<0xc>
        <0x680>:<0x181>
        <0x680>:<0x18a>
        smallpool/backup_appliance@30min-1578647679_2023.03.18.11.30.16:<0x0>
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Veeam/Backup Job HP/Backup Job HP2019-10-09T071722.vib
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Veeam/Backup Job WS001/Backup Job WS0012019-01-07T040115.vbk
        <0x10193>:<0x181>

the corrupt files are only in snapshots I don't care about them the real file should still be intact. What are these <xyz>:<xyz> specifications for the corrupt files?

OmniOS runs in a VM with disk passed through from an LSI 2008 SAS HBA. Should I wait for the resilver or save the data again? Or pass the disks to the VM via USB? The disks have no problems according to the external S.M.A.R.T test.

By the way, the problem is probably more on my ESXi in the log:

Bash:
2023-03-24T22:04:50.221Z: [scsiCorrelator] 122236088814us: [vob.scsi.scsipath.por] power-on reset occurred at vmhba2:C0:T1:L0
2023-03-24T22:05:01.222Z: [scsiCorrelator] 122247089178us: [vob.scsi.scsipath.por] power-on reset occurred on vmhba2:C0:T1:L0
2023-03-24T22:05:07.222Z: [scsiCorrelator] 122253089360us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce
2023-03-24T22:05:17.222Z: [scsiCorrelator] 122263089687us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce
2023-03-24T22:05:57.224Z: [scsiCorrelator] 122303091027us: [vob.scsi.scsipath.por] power-on reset occurred on vmhba2:C0:T1:L0
2023-03-24T22:06:06.224Z: [scsiCorrelator] 122312091328us: [vob.scsi.scsipath.por] power-on reset occurred at vmhba2:C0:T1:L0
2023-03-24T22:06:11.224Z: [scsiCorrelator] 122317091492us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce
2023-03-24T22:06:20.474Z: [scsiCorrelator] 122326341814us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce


Thanks in advance
 
It's driving me crazy. After spreading a whole 18 TB across x individual hard drives and USB sticks and uploading again almost completely it hangs again. 4 x resilvers on 5 disks I have never seen before....

Bash:
root@aio-pod:~# zpool status -v smallpool
  Pool: smallpool
 status: DEGRADED
status: One or more devices are being reordered.  The pool will
        continue to function, possibly in a degraded state.
Action: wait for the resilver process to complete.
  scan: resilver in progress since Wed Mar 22 04:02:31 2023
        3.47T scanned at 24.1M/s, 3.21T output at 22.3M/s, 12.1T total.
        1.61T resilvered, 26.52% done, 4 days left 20:11:42
Config:


        NAME STATE READ WRITE CKSUM
        smallpool DEGRADED 0 0 271
          raidz1-0 DEGRADED 0 0 630
            c1t6d0 DEGRADED 23 0 0 too many errors (re-silvering)
            c1t9d0 DEGRADED 2 0 5 too many errors (resilvering)
            spare-2 ONLINE 0 0 9
              c7t0d0 ONLINE 0 0 (resilvering)
              c1t8d0 ONLINE 0 0 (Resilvering)
            c1t10d0 ONLINE 0 0 0
        Spare parts
          c6t2d0 INUSE currently in use.


Errors: Permanent errors were detected in the following files:


        <0xca21>:<0xc>
        <0x680>:<0xc>
        <0x680>:<0x181>
        <0x680>:<0x18a>
        smallpool/backup_appliance@30min-1578647679_2023.03.18.11.30.16:<0x0>
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Veeam/Backup Job HP/Backup Job HP2019-10-09T071722.vib
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Veeam/Backup Job WS001/Backup Job WS0012019-01-07T040115.vbk
        <0x10193>:<0x181>

the corrupt files are only in snapshots I don't care about them the real file should still be intact. What are these <xyz>:<xyz> specifications for the corrupt files?

OmniOS runs in a VM with disk passed through from an LSI 2008 SAS HBA. Should I wait for the resilver or save the data again? Or pass the disks to the VM via USB? The disks have no problems according to the external S.M.A.R.T test.

By the way, the problem is probably more on my ESXi in the log:

Bash:
2023-03-24T22:04:50.221Z: [scsiCorrelator] 122236088814us: [vob.scsi.scsipath.por] power-on reset occurred at vmhba2:C0:T1:L0
2023-03-24T22:05:01.222Z: [scsiCorrelator] 122247089178us: [vob.scsi.scsipath.por] power-on reset occurred on vmhba2:C0:T1:L0
2023-03-24T22:05:07.222Z: [scsiCorrelator] 122253089360us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce
2023-03-24T22:05:17.222Z: [scsiCorrelator] 122263089687us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce
2023-03-24T22:05:57.224Z: [scsiCorrelator] 122303091027us: [vob.scsi.scsipath.por] power-on reset occurred on vmhba2:C0:T1:L0
2023-03-24T22:06:06.224Z: [scsiCorrelator] 122312091328us: [vob.scsi.scsipath.por] power-on reset occurred at vmhba2:C0:T1:L0
2023-03-24T22:06:11.224Z: [scsiCorrelator] 122317091492us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce
2023-03-24T22:06:20.474Z: [scsiCorrelator] 122326341814us: [vob.scsi.scsipath.por] power-on reset occurred at naa.5000c500c3ff78ce


Thanks in advance
Looks like the issue is with with your LSI card if you are getting hardware errors at the hypervisor level. Swap the HBA out with a known good one and pass that through to the VM and you should be able to complete the resilver. Although you said you are using a mix of usb drives and sata?? I would ditch the usb interface and hook them up direct if possible.
 
If you see "too many errors" on some or all disks you have not a single disk problem but a general hardware problem, mostly (in this order) a RAM, PSU/cabling, HBA or backplane problem. As there are no smart errors disks are not the reason. USB is a quite unreliable interface for a server.

I would:
- check RAM (ex memtest86)
- opt. remove half of RAM, check, if problem remains use the other half, opt reduce RAM speed in bios or increase voltage if possible.
- check cabling

If you have spare parts
- replace PSU
- replace HBA

if not simplify setup, reduce error options
- boot OmniOS directly (rule out RAM or PSU problem if ok)
- connect disks to Sata (rule out HBA problem )

Errors without a file reference are metadata errors
Pool performance is very low, check also iostat for single bad/weak disks that performs worse than others, ex on busy/wait.
 
Last edited:
If you see "too many errors" on some or all disks you have not a single disk problem but a general hardware problem, mostly (in this order) a RAM, PSU/cabling, HBA or backplane problem. As there are no smart errors disks are not the reason. USB is a quite unreliable interface for a server.

I would:
- check RAM (ex memtest86)
- opt. remove half of RAM, check, if problem remains use the other half, opt reduce RAM speed in bios or increase voltage if possible.
- check cabling
Not to be misunderstood. The disks are all attached to the LSI SAS HBA. My idea was to complete the resilvering by moving the disks to USB cases and passing them through to the VM. That the performance is then low, I am aware. The idea was to save the pool. Unfortunately, I don't have a second LSI HBA lying around. I will test the cabling and the RAM. OmniOS starts normally. I have three other pools that all run normally. But these are local SATA
 
USB only helps (but can add new problems) if a bad HBA is the reason.More likely is a RAM or power problem as I have had not a single bad LSI HBA in last 10 years but quite often bad RAM or sometimes power problems. If you boot OmniOS directly and switch disks to Sata (if you have enough Sata ports) followed by a resilver/scrub and some heavy load, you can rule this out.
 
USB only helps (but can add new problems) if a bad HBA is the reason.More likely is a RAM or power problem as I have had not a single bad LSI HBA in last 10 years but quite often bad RAM or sometimes power problems. If you boot OmniOS directly and switch disks to Sata (if you have enough Sata ports) followed by a resilver/scrub and some heavy load, you can rule this out.
The RAM is already ECC RAM on a supermicro board (A1SAi-2750F). Power supply, yes could be. I have to debug that individually. I will report my findings. I still wish a quiet evening and a pleasant rest of the weekend.
 
Have you already checked IPMI event log?
 
Last edited:
Yes I have , the eventlog does not contain any relevant, info except that I log in incorrectly from time to time. No ECC errors

Good morning, I would personally exclude the RAM now. After more than 9 hours of running through no error. Remains wiring (I think) and power supply and LSI.


Bildschirmfoto 2023-03-26 um 07.16.49.png
 
It seems that the cables were to blame. In my super compact NAS, the cables were very tense. I have now relocated them and had the zpool resilvered. The metadata errors are gone.

Thanks for the tip with the cables !


Bash:
root@aio-pod:~# zpool status -v smallpool
  pool: smallpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 3,02T in 2 days 07:40:19 with 239 errors on Sat Apr  1 21:45:55 2023
config:

        NAME         STATE     READ WRITE CKSUM
        smallpool    ONLINE       0     0     1
          raidz1-0   ONLINE       0     0     4
            c1t6d0   ONLINE       0     0     0
            c1t9d0   ONLINE       0     0     0
            c1t8d0   ONLINE       0     0     0
            c1t10d0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        smallpool/backup_appliance@30min-1578647679_2023.03.18.11.30.16:<0x0>
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Archiv Acronis/Archiv(1)1.TIB
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Veeam/Backup Job HP/Backup Job HP2019-10-09T071722.vib
        smallpool/backup@stuendlich-1578553078_2023.03.23.15.00.17:/Veeam/Backup Job WS001/Backup Job WS0012019-01-07T040115.vbk
root@aio-pod:~#
 
Unfortunately the shares do not go :-/
When I export the pool cleanly and import it again, I get errors with the shares. On file level all data are there. Also the directory smallpool is empty during the export.

How could I fix this error?



Bash:
root@aio-pod:~# zpool import smallpool
cannot mount '/smallpool/backup_appliance': directory is not empty
SMB: Unable to enable service
cannot share 'smallpool/administration': smb add share failed
SMB: Unable to enable service
cannot share 'smallpool/backup': smb add share failed
SMB: Unable to enable service
cannot share 'smallpool/cloud': smb add share failed
SMB: Unable to enable service
cannot share 'smallpool/documents': smb add share failed
SMB: Unable to enable service
cannot share 'smallpool/movies': smb add share failed
SMB: Unable to enable service
cannot share 'smallpool/music': smb add share failed
SMB: Unable to enable service
cannot share 'smallpool/scans': smb add share failed
 
I would export the pool.
If there is a regular directory "/smallpool", delete it (if not empty with needed files, save them elsewhere ex via WinSCP or midnight commander).
Then import and check shares.

btw.
Is smallpool and descendend filesystems shared in a nested way (not suggested)
or are shares enabled only on sub filesystems like movies, music etc (suggested)
 
How to duplicate a pool with ongoing replications?
New pool should be exact identical to old pool but for ex. larger or with a different vdev structure


You must care of the following:

1. Transfer ZFS filesystems with a "recursive" job setting
This includes all datasets in the transfer (sub filesystems, snaps, zvols)

2. A ZFS replication creates the new filesystem(s) below the destination ZFS filesystem
pool1 -> pool2 results in pool2/pool1
pool1/fs1 -> pool2 results in pool2/pool1/fs1

If you want an identical structure you must create (recursive) jobs for each 1st level filesystem ex
pool1/fs1 -> pool2 gives you pool2/fs1

If the new pool should be named like oldpool (ex pool1):
-destroy old pool pool1 after transfers are done, export pool2 and import as pool1

3. A replication transfers not all ZFS properties like compress or sync
Filesystem attibute like ACL are preserved.

If you want the same ZFS properties you must apply them after the replication.
A better way is to set them on the parent target filesystem ex pool2 prior replications.
They are then inherited to the new filesystems,

4. Some ZFS properties can only be set on creation time
ex upper/lowercase behaviour or character sets

If you use napp-it to create pools, settings are identical.

5. Ongoing Replication/backup jobs
You can continue old replication jobs if
- pool structure remains identical
- you have snappairs of former replications on both sides ex jobid_repli_source/target_nr_1037

If you want to recreate a replication job that continues incremental transfers
- recreate the job with same source/destination settings and the old jobid
Jobid is part of old snapnames

Or rerun an initial transfer
Rename the old destination filesystems ex to filesystem.bak to preserve them in casae of problems. Then rerun a replication (full transfer). After success, destroy the .bak filesystem. Next replications are then incremental again.
 
New feature in napp-it 23.dev (Apr 05):
ZFS autosnaps and ZFS replications of ESXi/NFS filesystems with embedded ESXi hot memory snaps.

If you want to backup running VMs on ESXi, you mostly use commercial tools like VEEAM that supports coalesce (stop a filesystem during backup) or can include ESXi hot memory state.

If you use ZFS to store VMs you can use ZFS snaps for versioning or to save and restore them either via a simple SMB/NFS copy, Windows previous versions or ZFS replication. This works well but only for VMs at down state during backup as a ZFS snap is like a sudden power off. There is no guarantee that a running VM becomes not corrupted in a ZFS snap. While ESXi can provide save snaps with coalesce or hot memory state, you cannot use them alone for a restore as they rely on the VM itself. A corrupt VM cannot be restored from ESXi snaps while you can restore a VM from ZFS snaps. As ESXi snaps are delta files they grow over time so you should under no circumstances use more than a few ESXi snaps for no longer than a few days.

So why not combine both. Unlimited ZFS snaps with the recovery options of ESXi snaps. This can be achieved if you create an ESXi snap prior the ZFS snap that then includes the ESXi snap. After the ZFS snap is done, the ESXi snap can be destroyed.

Napp-it 23.dev automates this


Howto setup:

- update napp-it to current 23.dev
- add the needed Perl modules to OmniOS,
see https://forums.servethehome.com/ind...laris-news-tips-and-tricks.38240/#post-367124
- Enter ESXi settings (ip, root, pw and NFS datastores) in napp-it menu System > ESXi > NFS datastore

-list autosnap or replication snaps in napp-it menu Jobs
Click on the jobid to enter settings, add the ip of the ESXI server
- run the autosnap or replication job
Each ZFS snap will then include an ESXi snap. As a VM is stopped for a few seconds run this at low usage times.
- click on replicate or snap in the line of the job to check log entries

Restore a VM in a running state:
- shutdown all VMs
- restore a single VM folder from a ZFS snap, either via SMB/NFS copy, Windows previous versions,
filesystem rollback or replication

ESXi will see the ESXi snaps after a vim-cmd vmsvc/reload vmid (Putty) or reboot
- power on a VM and restore the last ESXi snap. The VM is then at the state of backup time in power on state.


more,
https://www.napp-it.org/doc/downloads/napp-in-one.pdf
https://forums.servethehome.com/ind...news-tips-and-tricks.38240/page-2#post-372432
 
Last edited:
A Data risk analysis
Main risks for a dataloss listed by relevance.
https://www.napp-it.org/doc/downloads/risks_of_dataloss.pdf


1. OMG, something happened to my data
Human errors, Ransomware or sabotage by a former employee


- Last friday, i deleted a file. I need it now
- 6 weeks ago i was infected by Ransomware that already encrypted some data
- 6 months ago data was modified by a former employee when he left after a dispute

- Occurance: very often
- methods against:
read only data versioning with at least a daily version for current week,
a weekly version for last month and a monthly version for last year

Howto protect agaist: Best solution is to use ZFS snaps, ou can hold thousands of readonly snaps without problem. They are created without delay and space consumption is only amount of modified datablocks fo a former snap. Only Unix root and not a Windows admin user can destroy ZFS snaps and not remote but only locally on server.

- Restore: simple, connect a ZFS filesystem via SMB und use Windows „previous versions“ to restore single files or folders.
On Solaris SMB „ZFS Snaps=previous versions“ is zero config, with SAMBA you must care about snap folder settings.

or use ZFS rollback to set back a whole filesystem in time.

Alternative: use tape backups (hundreds of). A restore is very time consuming and
can be quite complicated with differential backups.

2. OMG, a diaster has happened
not the daily problems, a real disaster


- A fire destroyed your server(s)
- A thieve has stolen your server(s)

Occurance: maybe never, but be prepared or anything can be lost
methods against: Create ongoing external daily disaster backups
Use disks or tapes or ZFS replicate to an external site or a removeable pool that you unplug/move after backup.

- Restore: simple but very time consuming. A restore of a large pool from backup can last days
and without a current disaster backup data state is not recent.
As ZFS is a Unix filesystem, Windows AD SMB ACL permissions are not restored automatically ex on SAMBA
but requires a correct mapping Unix uid -> Windows SID. The Solaris kernelbased server is not affected
by mapping problems as it uses Windows SID directly as extended ZFS attributes.

If a disaster restore is not easy and straight forward with a simple copy method, test it prior an emergency case

3. OMG, i missed a hardware failure for too long
Servers are not „set and forget“


- a disk failed, then the next, then the pool
- a fan or climatisation failed and due overtemperature disks are damaged, data is corrupted

occurance: maybe once every few years
methods against: Monitor hardware and use mail alerts or you have a disaster case (see 2.)

4. OMG, I cannot trust my data or backups
suddently you discover a corrupted file or image with black areas, text errors or problems to open applications or files.


Can you trust then any of your data?

occurance: prior ZFS sometimes, never with ZFS as ZFS protects data and metadata with checksums and repairs during read
on the fly from Raid redundancy or on a regular base ex once every few months with a pool scrub.

5. Ok, I cared about OMG problems with ZFS, many snaps and an external daily or weekly disaster backup
Anything left to consider?


There are indeed some remaining smaller problems even with ZFS

- Server crashes during write:
In this case the affected file is lost. ZFS can not protect you. With small files there is a minimal chance that data
already completely in the rambased write cache. With sync enabled the file is written on next reboot.

Only the writing application can protect whole f iles against such incidents ex Word with temp files.
ZFS can protect the filesystem and commited writes.

- incomplete atomic writes. Atomic writes are minimal dependent write operations that must be done completly.
https://www.raid-recovery-guide.com/raid5-write-hole.aspx
An example is when a system crashes after data is written to storage and prior the needed update of metadata or when a database writes dependent transactions ex move money from one account to another with the result of dataloss or a corrupted filesystem.

In a Raid ex a mirror all data is written sequentially disk by disk. A sudden crash results in a currupted Raid/ mirror.

ZFS itself is not affected due Copy on Write where atomic writes are done completey or discarded at all so a currupted filesystem or Raid cannot occure by filesystem design. If you additionally want that any commited write is on save storage, you must enable sync.

If you have VMs with non Copy on Write filesystems, ZFS can guarantee for iteslf but not for guest operating systems. A crash during write can corrupt a VM. Activating sync write on ZFS can guarentee atomic writes and filesystems on VMs..

-RAM errors
Google for „ram error occurrence“ about the risk

All data is processed in RAM prior write. A RAM problem can corrupt these data that is then written to disk.
Even ZFS checksums cannot help as it can happen that you have bad data with proper checksums.

The risk of RAM errors is a small statistical risk that increases with RAM. A modern 32GB server has 64x the risk than a 512MB server with same quality of RAM. Only single errors are a problem. Bad RAM with many errors mostly results in an OS crash/ kernel panic or „too many errors“ on reads with a disk or pool offline on ZFS. The „myth“ of a ZFS scrub to death where ZFS wrongly repairs good data is a myth.

Anyway, if you care about data on a filer, always use ECC.
Even without ECC ZFS offers more security than older filesystems without ECC.

-Silent data errors/ bit rot
Google „Bit Rot“

This affects mostly long term storage with a statistical amount of data corruptions by chance over time. Some but not all can be repaired by the disk itself. ZFS can detect and repair all bitrot problems during read or a scrub of all data.
On long term storage, start regular scrubs ex once every few months to validate a pool prior problems become serious.

- Insufficient redundancy.
ZFS holds metadata twice and can hold data twice with a copies=2 setting even on single disks.
A ZFS raid offers a redundancy that counts in allowed disk failures until a pool is lost

As this is also a statistical problem, a rule of thumb is:
Best is when you allow any two disks to fail. This is the case with a raid-Z2 or 3way mirror. With more than say 10 disks per vdev, consider Z3. Slog or L2Arc do not need redundancy due a fallback to pool without dataloss. Only in case of an Slog failure combined with a system crash you see a loss of data in ramcache beside a performance degration.
If you use special vdevs, always use mirrors as a vdev lost means pool lost.

- SSD without powerloss protection
A crash during write on such SSDs can result in a dataloss or corrupted filesystem. If the SSD is part of a ZFS raid, problems can be repaired based on checksum errors, If the SSD is not in a Raid ex an Slog, SSD powerloss protection is mandatory.
 
Last edited:
I would export the pool.
If there is a regular directory "/smallpool", delete it (if not empty with needed files, save them elsewhere ex via WinSCP or midnight commander).
Then import and check shares.

btw.
Is smallpool and descendend filesystems shared in a nested way (not suggested)
or are shares enabled only on sub filesystems like movies, music etc (suggested)


I had to delete the ZFS file system "backup_appliance", only then could all file systems be remounted ;-)
-> Shares only on sub filesystms

Thanks for the help with my problem. The tip about the cables was also worth its weight in gold. I hadn't thought about that at all.
 
I had to delete the ZFS file system "backup_appliance", only then could all file systems be remounted ;-)
-> Shares only on sub filesystms

Thanks for the help with my problem. The tip about the cables was also worth its weight in gold. I hadn't thought about that at all.

It seems then that there was a regular folder with files that hinders a ZFS filesystem to be mounted at this mountpoint. Solution is then always (pool not mounted) to check for regular folders and delete them prior a pool import. If the regular folder with files is not at pool level but on a nested ZFS filesystem, remove the regular folder at filesystem level ex when /pool/backup_appliance is a ZFS filesystem with a regular folder with same name at same location.
 
Last edited:
I’m in a bit of a jam.

Upgraded my esxi and somehow in the process I lost my data store which was an NVME slog.

I’m booting into OmniOS and am getting errors that an IO device has been retired. I can’t even run a zpool status as the pool is faulted and is unavailable.

Any thoughts on how to fix things?
 
There are two solutions
If an slog is missing while the pool is ok, import with -m (napp-it Pool > Import)

"Retired" is a message from the Solaris fault manager.
It means that an error was detected that can affect pool health. To avoid further damage this device is temporarily removed. Check he affected device.

If you have repaired a faulted device or problem ex
Fault class : fault.io.pciex.device-interr
Affects : dev:////pci,0/pci15ad,7a0/pci8086,3901
faulted and taken out of service


you can recover the retired device with the command
fmadm repaired dev:////pci,0/pci15ad,7a0/pci8086,3901

more
see napp-it menu System > Faults > repair or
https://docs.oracle.com/cd/E36784_01/html/E36834/gdryb.html#SAGDFSgdryc
 
O
Release notes: omnios-build/ReleaseNotes.md at rn · citrus-it/omnios-build · GitHub

OmniOS "On Monday, Mai 1st OmniOS plans to release OmniOS r151046. There are a bunch of pretty cool new features in the upcoming release, but as it is with cool new features, they also tend to cause regressions. Have a look at the preliminary release notes to get an idea of what is in store (and please let us know if you see any errors or omissions there).

At the moment we are testing the release candidate, and in order to have the best quality release possible, we could really use your help!"


To upgrade to the release candidate use the following package repositories:

OmniOS r151046 (release candidate) (omnios)
OmniOS r151046 extra (extra.omnios)

If you upgrade to the release candidate now, you can later upgrade to the final release.

If you want to try the installation media, you can find them here

Index of /media/r151046/.rc/


The unique selling point of Solaris is the kernelbased SMB server with the following enhancements:
  • SMB now supports 256-bit ciphers.
  • SMB now has a new configuration option to enable support for short names. Only very old applications on old clients need short names, however it is necessary to support running the Windows Protocol Test Suites.
  • ls can show a Domain user or the Windows SID

ls.PNG


If you look at the output of ls it looks strange for a Unix filer.
The file 1.txt is created locally by root and owner/group is Unix root and the output of ls is as expected

The file 2.txt was created by a Windows client on OmniOS 151046 (both AD members).
ls -l gives [email protected] (AD user) and ls -n returns the Windows SID of [email protected]
ls -nn gives the (ephemeral) Unix uid/gid

This is exactly the same info that you get from Windows and Properties > Security > Owner

Why can you see the AD user as owner and not a Unix uid/gid as you would expect on a Unix filesystem like ZFS.
The reason is that the Solaris SMB server use Windows SID directly as an extended ZFS attribute. This gives a worldwide unique user identification as the domain is part of the SID. A Unix uid like 102 is not unique. Even after a backup/restore AD ACL remain intact without any mappings that you need when you use SAMBA instead the Solaris kernelbased SMB .

Udate
If you update to the release candidate, you can later update to the final release
 
Last edited:
I have been happily running AiO Napp-it for years.
I had to update a Windoes server VM and thought I would also do other updates
I updated OmniOS to r151044 and Napp-it to 21.06a10
I tried to update esxi from 6.7u1 to 8 and I got a warning about not supported hardware which I suppressed. The update went OK, but I got "no healthy upstream" when trying to load the UI
I then made a clean installation of 6.7u1
Everything is up and running now except for 1 major concern
I cannot get Napp-it to work using the vmxnet3 adapter
Everything works fine using e1000, but as soon as I switch to vmxnet3 - I can boot fine - but there seems to be no connectivity to esxi. Hence I cannot ping the Napp-it ourside esxi or even browse the NFS datastore from Napp-it
I anyone could help me with advice I will appreciate it
 
The vmxnet3 vnic driver is part of VMware tools.
They are included in the OmniOS repository and can be installed via
pkg install open-vmware-tools
 
Thx Gea 🙏
However I get this message when doing "pkg install open-vmware-tools"

pkg install: The following pattern(s) did not match any allowable packages. Try
using a different matching pattern, or refreshing publisher information:
open-vmware-tools

I also thought I already had open-vmware-tools installed and running (that is also what esxi host says)

When I do "pkg list" I see a line like this:
driver/network/vmxnet3s 0.5.11-151044.0 i--
 
Thx Gea 🙏
I got it working now
My older IP settings was still present and I got myself confused about the IP addresses
 
The new minimalistic Solaris fork OmniOS long term stable r151046 (Unix) is out
https://omnios.org/releasenotes.html

Open-ZFS in its genuine Solaris environment
Perfect ZFS/OS integration with bootenvironments for troublefree up/downgrades and lowest resource needs for ZFS.

Opensource but with a commercial support contract option.
Regular often biweekly Security and bugfix updates are free.
https://omnios.org/schedule.html

A dedicated software repository per stable release
No sudden new features or unexpected behaviours, only security and bugfixes
Update is possible up from last r151038LTS. Switch repository, pkg update and reboot.
https://www.napp-it.org/doc/downloads/setup_napp-it_os.pdf

Fileservices like iSCSI/FC, kernelbased NFS and the multithreaded SMB server are part of the Solaris OS
with unique integration of Windows ntfs alike ACL and direct support of Windows SID for AD users to preserve permissions in backups, local Windows alike SMB groups and zero config ZFS snaps as Windows previous versions. Easy SMB config (no samba.cfg), just turn it on/off.
 
When trying to set publisher I get this message:

The origin URIs for 'omnios' do not appear to point to a valid pkg repository.
Please verify the repository's location and the client's network configuration.
Additional details:

Unable to contact valid package repository: https://pkg.omnios.org/r151046/core/
Encountered the following error(s):
Transport errors encountered when trying to contact repository.
Reported the following errors:
Framework error: code: E_COULDNT_CONNECT (7) reason: Failed to connect to pkg.omnios.org port 443 after 0 ms: Couldn't connect to server
URL: 'https://pkg.omnios.org/r151046/core' (happened 4 times)
 
The illumos security team have today published a security advisory concerning CVE-2023-31284, a kernel stack overflow that can be performed by an unprivileged user, either in the global zone or in any non-global zone. A copy of their advisory is below.


ACTION: If you are using any of the supported OmniOS versions, see below, (or the recently retired r42), run pkg update to upgrade to a version that includes the fix. Note, that a reboot is required. If you have already upgraded to r46, then you are all set as it already includes the fix.


The following OmniOS versions include the fix:
https://omnios.org/releasenotes.html
  • r151046
  • r151044y
  • r151042az
  • r151038cz

If you are running an earlier version, upgrade to a supported version (in stages if necessary) following https://omnios.org/upgrade.



##########################
--- illumos Security Team advisory ---


We are reaching out today to inform you about CVE-2023-31284. We have pushed a commit to address this, which you can find at
https://github.com/illumos/illumos-gate/commit/676abcb77c26296424298b37b96d2bce39ab25e5. While we don't currently know of anyone exploiting this in the wild, this is a kernel stack overflow that can be performed by an unprivileged user, either in the global zone, or any non-global zone.

The following details provide information about this particular issue:

IMPACT: An unprivileged user in any zone can cause a kernel stack buffer overflow. While stack canaries can capture this and lead to a denial of service, it is possible for a skilled attacker to leverage this for local privilege escalation or execution of arbitrary code (e.g. if combined with another bug such as an information leak).


ACTION: Please be on the look out for patches from your distribution and be ready to update.


MITIGATIONS: Running a kernel built with -fstack-protector (the illumos default) can help mitigate this and turn these issues into a denial of service,but that is not a guarantee. We believe that unprivileged processes which have called chroot(2) with a new root that does not contain the sdev (/dev) filesystem most likely cannot trigger the bug, but an exhaustive analysis is still required.

Please reach out to us if you have any questions, whether on the mailing list, IRC, or otherwise, and we'll try to help as we can.

We'd like to thank Alex Wilson and the students at the University of Queensland for reporting this issue to us, and to Dan McDonald for his work in fixing it.

The illumos Security Team
 
Last edited:
Critical Windows SMB security warning

In response to CVE-2022-38023, Microsoft is removing support for RPC Signing in the Netlogon server, instead requiring Sealing when establishing a 'secure channel'. More details can be found here: https://support.microsoft.com/en-us...22-38023-46ea3067-3989-4d40-963c-680fd9e8ee25 and here: https://msrc.microsoft.com/update-guide/vulnerability/CVE-2022-38023

Timeline
June, 13: signing remains possible but cannot disable sealing on Windows server
July, 11: sealing is enforced, no authentication without sealing

Action
Update at least every AD member device like Windows or AD members like OmniOS or SAMBA prior July 11 !!
For an Illumos/OmniOS OS/ZFS kernelbased SMB server as an AD member the sealing feature is under final approvement

https://www.illumos.org/issues/15670.
https://forums.servethehome.com/index.php?threads/omnios-netlogon-rpc-sealing-support.40075/

Newest SAMBA suppports sealing
 
Security update OmniOS r151046e (2023-05-31)
https://omnios.org/releasenotes.html
Weekly release for w/c 29th of May 2023.
This is a non-reboot update

Security Fixes
Curl has been updated to version 8.1.2, fixing CVE-2023-28319, CVE-2023-28320, CVE-2023-28321, CVE-2023-28322.
OpenSSL has been updated to versions 1.1.1u and 3.0.9, fixing CVE-2023-2650. OpenSSL 1.0.2 has also been patched against this.
 
What is the proper way to clear messages that IO devices have retired?

Recently had to restore some VMs because of hardware failure and everything is fine but am seeing that message when Omni starts.
 
If the Solaris Faultmanager fmadm detects a faulted device it sets it to retired to avoid further damage.
You can check faulted devices with menu System > Faults > Faulty (fmadm faulty)

If you have repaired a faulted device, ex from a
Fault class : fault.io.pciex.device-interr
Affects : dev:////pci,0/pci15ad,7a0/pci8086,3901
faulted and taken out of service


you can recover the device with the command
fmadm repaired dev:////pci,0/pci15ad,7a0/pci8086,3901

more
https://docs.oracle.com/cd/E36784_01/html/E36834/gdryb.html#SAGDFSgdryc
in napp-it: see menu System > Faults > Repair
 
Last edited:
What is the command that napp-it uses to get that repaired list. Because fadm faulty pulls up nothing and when I try to repair what napp-it found it tells me it’s already been fixed.
 
_Gea for some reason napp-it has stopped running AUTO Service Jobs since Dec 2022!

I can't remember if I updated napp-it or OmniOS back in December, but I think it was one or the other. Before then, AUTO Service Jobs were running just fine!

napp-it v: 21.06a8
OmniOS: r151044
 
Back
Top