OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

Question for all you other all-in-one devotees: what is the best way to back up the OpenIndiana VM that hosts your ZFS SAN when you have used PCI hardware passthrough of the HBA? Using VMDirectPath I/O has the trade off of getting better performance by eliminating the virtual I/O overhead at the cost of preventing the use of the following VMWare capabilities:
Code:
1. VMotion
2. High availability
3. Suspend and resume
4. Record and replay
5. Fault tolerance
6. Memory overcommitment and page sharing
7. Hot add/remove of virtual devices
8. No Snapshot backup

For those us using the free version of ESXi 5, we lose the first five of that list anyway. It is the loss of VMWare Snapshot backup that causes our particular problem. We were trying out Veeam Backup and Replication (free version) to back up all our VMs before upgrading the server to ESXi 5.1 and discovered that backing up the OpenIndiana VM failed due to no VMware snapshot capability.

So I'm curious how the experts out there back up their OI VM (assuming they don't have a second ZFS installation where they could do ZFS send).

Thanks in advance,
--peter
 
Question for all you other all-in-one devotees: what is the best way to back up the OpenIndiana VM that hosts your ZFS SAN when you have used PCI hardware passthrough of the HBA? Using VMDirectPath I/O has the trade off of getting better performance by eliminating the virtual I/O overhead at the cost of preventing the use of the following VMWare capabilities:
Code:
1. VMotion
2. High availability
3. Suspend and resume
4. Record and replay
5. Fault tolerance
6. Memory overcommitment and page sharing
7. Hot add/remove of virtual devices
8. No Snapshot backup

For those us using the free version of ESXi 5, we lose the first five of that list anyway. It is the loss of VMWare Snapshot backup that causes our particular problem. We were trying out Veeam Backup and Replication (free version) to back up all our VMs before upgrading the server to ESXi 5.1 and discovered that backing up the OpenIndiana VM failed due to no VMware snapshot capability.

So I'm curious how the experts out there back up their OI VM (assuming they don't have a second ZFS installation where they could do ZFS send).

Thanks in advance,
--peter


You should not look on all-in-one and its storage VM like a regular VM. If you use it like you would use a separate SAN (iSCSI or NFS) you are well.

There is also no VMotion/ HA scenario, because you need real hardware/disc access for performance

You should also avoid memory overcommitment with ZFS and with most modern OS's. If you assign RAM to them, they will use it - especially with ZFS and its ARC read-caching.

If you need snapshots of the ZFS OS itself, you have them on ZFS level as boot environments - much better than those on ESXi level

If you need disaster recovery for the SAN VM, you can do on ESXi local datastore level.
(I use hardware Raid-1 enclosures for ESXi and local datastores for that - and remove the mirror after rebuild)

But the storage-VM is not critical.
You can reinstall within half an hour and import the datapool - similar to ESXi.
 
Hi guys ...

I'm having a problem managing automatic snapshots through the napp-it UI.

Jobs > snap > create autosnap job

I'm configuring everything as I'd like it to be and when I hit the submit button the page reloads, but it's not creating/displaying the job in the task list at the top of the page.

At first I thought this meant that it wasn't creating the task for some reason, however looking at my list of snapshots now, I can see that the jobs ARE being run, but they're still NOT being displayed at all on the autosnap page. This means that I've got no way of modifying/deleting these jobs through the napp-it UI.

When I thought that the jobs weren't being created properly, I tried configuring them multiple times, and now I've ended up with identical jobs that are creating a whole bunch of duplicate snapshots.

1. Does anyone know where napp-it actually saves the autosnap configuration when a job is created? In the short term, I want to just manually edit the jobs to remove the duplicate entries.

2. (This is aimed mainly at _Gea) Any ideas why this is happening in the first place, or perhaps more importantly, how I might fix it?

If it's useful at all, I'm running Solaris 11 (fully updated) and was originally using napp-it 0.8k. When I noticed the problem I tried updating napp-it to the latest version 0.8l3, but I'm still having the same problem.


each job creates two files with the job-id in /var/web-gui/data/napp-it/_log/jobs/
one (*.job) contains the timetable in the filename, the other (.pl) the actions.
 
You should not look on all-in-one and its storage VM like a regular VM. If you use it like you would use a separate SAN (iSCSI or NFS) you are well.

There is also no VMotion/ HA scenario, because you need real hardware/disc access for performance

You should also avoid memory overcommitment with ZFS and with most modern OS's. If you assign RAM to them, they will use it - especially with ZFS and its ARC read-caching.

If you need snapshots of the ZFS OS itself, you have them on ZFS level as boot environments - much better than those on ESXi level

If you need disaster recovery for the SAN VM, you can do on ESXi local datastore level.
(I use hardware Raid-1 enclosures for ESXi and local datastores for that - and remove the mirror after rebuild)

But the storage-VM is not critical.
You can reinstall within half an hour and import the datapool - similar to ESXi.

Gea,
We do have the datastore for the OI VM mirrored. We do not over-commit memory. And we do not need VMotion (at least for the virtual SAN). So how would one backup a boot environment? That's all we probably need. I think what you are saying is that you can copy the BE to offline storage and later could import the BE somehow. Is this different from a zfs import?
 
Gea,
We do have the datastore for the OI VM mirrored. We do not over-commit memory. And we do not need VMotion (at least for the virtual SAN). So how would one backup a boot environment? That's all we probably need. I think what you are saying is that you can copy the BE to offline storage and later could import the BE somehow. Is this different from a zfs import?

This may be possible
http://docs.oracle.com/cd/E19963-01/html/821-1448/ghzvz.html#ghzur

For myself, I hot-unplug a mirrorred boot disk to have working bootdisks (ESXi+OI) in case of problems. Disks are so cheap and any efforts in restoring a BE are too complicated in case of problems where ou need a quick and easy solution to restore ESXi and OI or prepare a update of them.
 
Question for all you other all-in-one devotees: what is the best way to back up the OpenIndiana VM that hosts your ZFS SAN when you have used PCI hardware passthrough of the HBA? Using VMDirectPath I/O has the trade off of getting better performance by eliminating the virtual I/O overhead at the cost of preventing the use of the following VMWare capabilities:
Code:
1. VMotion
2. High availability
3. Suspend and resume
4. Record and replay
5. Fault tolerance
6. Memory overcommitment and page sharing
7. Hot add/remove of virtual devices
8. No Snapshot backup

For those us using the free version of ESXi 5, we lose the first five of that list anyway. It is the loss of VMWare Snapshot backup that causes our particular problem. We were trying out Veeam Backup and Replication (free version) to back up all our VMs before upgrading the server to ESXi 5.1 and discovered that backing up the OpenIndiana VM failed due to no VMware snapshot capability.

So I'm curious how the experts out there back up their OI VM (assuming they don't have a second ZFS installation where they could do ZFS send).

Thanks in advance,
--peter

If your OI VM does not change often you could power it down and try to export it as a .OVF..?

Still curious about :
Guys, was just wondering
Do you use the OI Update Manager on a regular basis to update your OI NAS systems or do you leave it as is and just update Napp-IT..?

@_Gea congratz on your new documentation looks nice..!
 
I update it every now and then when I have the time... Usually together with Nappit...

Matej
 
For any NFS users out there, is it actually possible to tune an NFS server inside a Solaris VM and associated client(s) such that performance seen from a client's perspective can actually approach or beat the performance of a local disk? Say for example in a specific task, like compiling a Linux kernel?

I'm wondering if it is even possible - due to latency and the RTTs needed to get metadata etc.. I have an Ubuntu VM as the client on my desktop, and just pinging my solaris VM yields:

26 packets transmitted, 26 received, 0% packet loss, time 25055ms
rtt min/avg/max/mdev = 0.462/0.906/1.910/0.321 ms

Which seems awfully high. I'm currently using an emulated NIC for Solaris, maybe passing through a physical one would reduce that a bit.

I am running a large build right now and it is dog slow compared to building on another server with a local disk. Are there any standardized benchmarks being used around here for NFS that I could use to compare against and see how far behind this machine is?

My NFS fstab line is:
solaris:/main/home /mnt/nfs nfs4 hard,timeo=10,rsize=65536,wsize=65536 0 0

Any help is much appreciated.
 
This may be possible
http://docs.oracle.com/cd/E19963-01/html/821-1448/ghzvz.html#ghzur

For myself, I hot-unplug a mirrorred boot disk to have working bootdisks (ESXi+OI) in case of problems. Disks are so cheap and any efforts in restoring a BE are too complicated in case of problems where ou need a quick and easy solution to restore ESXi and OI or prepare a update of them.

I'm curious how you were able to install ESXi onto a mirrored array?
 
_Gea,
for all napp-it+Solaris 11 users: is it a good idea to hold for a while our update enthusiasm for Solaris 11.1 ?
 
@ Gea

I'm now running OI + Napp-it for more than a year and am now thinking of virtualizing my server!
I have all the needed hardware : SM X9SCM-F, 16GB Ram, E-1230, Norco 4224 case and 3 IBM M1015 flashed HBA cards.
I've been reading your AIO-tutorial and was wondering if I can put all OS : ESXi, Win7,OI and eventually 1 more on a mirrored SSD data store? Or would it better be to use the onboard sata connectors on mobo?
Also I need to export my 2 pools in ZFS and detache all my data drives, right? Cuz ESXi deletes them?

ty
 
_Gea,
for all napp-it+Solaris 11 users: is it a good idea to hold for a while our update enthusiasm for Solaris 11.1 ?

One may try and go back to a former system snap if its not ready to use.
(You should share your experience)
 
@ Gea

I'm now running OI + Napp-it for more than a year and am now thinking of virtualizing my server!
I have all the needed hardware : SM X9SCM-F, 16GB Ram, E-1230, Norco 4224 case and 3 IBM M1015 flashed HBA cards.
I've been reading your AIO-tutorial and was wondering if I can put all OS : ESXi, Win7,OI and eventually 1 more on a mirrored SSD data store? Or would it better be to use the onboard sata connectors on mobo?
Also I need to export my 2 pools in ZFS and detache all my data drives, right? Cuz ESXi deletes them?

ty

The idea behind all-in-one

- install ESXi on USB or on a sata disk
- use the sata disk as local datastore and install OI on it
- pass-through your SAS controllers to OI
- autostart OI with ESXI and share your pools via NFS and SMB
OI acts as a normal NAS or SAN filer, similar to a barebone server

- mount the NFS share from ESXi
- put your other VM's to this shared NFS storage to have the same NAS/NAS features like with separate barebone server
but all with one box and very fast connectivity up to several GBit (you do not need fast SAN switches, all traffic is ESXi internal in software)

In this scenario, you just need to import your current pools in OI just like you would do with any other pool-move to a new server.
 
Last edited:
I keep having errors with my server lately. I've known that when my server restarts a few disks pop up with too many errors and it has always been able to clear the errors fine. Now it just says they are unavailable but the disks are all there and being read by the OS. is there something i can try at the command line to get it back up, i reallly need a file off of it like ASAP then after that i can work on getting some shorter SAS cables like i was recommended.
 
The idea behind all-in-one

- install ESXi on USB or on a sata disk
- use the sata disk as local datastore and install OI on it
- pass-through your SAS controllers to OI
- autostart OI with ESXI and share your pools via NFS and SMB
OI acts as a normal NAS or SAN filer, similar to a barebone server

- mount the NFS share from ESXi
- put your other VM's to this shared NFS storage to have the same NAS/NAS features like with separate barebone server
but all with one box and very fast connectivity up to several GBit (you do not need fast SAN switches, all traffic is ESXi internal in software)

In this scenario, you just need to import your current pools in OI just like you would do with any other pool-move to a new server.
I was planing on using m1015 or maybe m5016 for same purpose (SSD raid1 for esxi, OI and the rest of VMs), so what speed am I to expect if I go with NFS shared VM datastore (on SSD ofcourse) as you suggest?
 
I think it is, but you have to "soft" remove it in CLI i guess, not just unplug it from computer:)

MAtej
 
I found where liam137 and Captainquark were having a slowdown issue with their zfs storage. Their issue sound very familiar to me as well as I have been struggling with this difficult problem for a very long time. I read through pages after that but it looked like the conversation stopped. Did anyone find out what is causing this? I am really frustrated at this point and am really hoping someone can help me figure this out as I am stumped. Here is my config:

Supermicro X8DT6-F with onboard LSI SAS2008 flashed to IT mode (version P13.5)
2 x Intel Xeon E5620 @ 2.40GHz
48 GB RAM
2 onboad NICS + 4 port Dell (Intel) PCIe NIC for iSCSI traffic
2 x 300 GB SATA for mirror rpool (onboard SATA only)
1 x 128 GB OCZ Vertex 3 Max IOPS SSD l2arc cache (onboard SAS2008)
2 x 80 GB Intel 320 SSD for mirror zil log (only partitioned 8GB slice - onboard SATA only)
6 x 1TB Seagate 7200rpm SATA RAIDZ2 + 1 for hotswap (onboard SAS2008)
Right now I have compression and dedupe turned off (no encryption either)

I primarily use iSCSI. I am seeing huge latencies from both ESXi hosts and a Windows host (like 30 seconds or worse). What is strange is it seems to have to build up to it. At first I don't see it after a reboot, but after I have transferred approximately 42 GB to it (suspiciously close to my RAM). After that I get huge latencies that freeze everything up. On the Windows side if I try a 10 GB transfer, it starts at ~130MB/s but then drops down to as low as 10MB/s and many times dies. When it dies I have to disconnect from the iSCSI host and reconnect to get the drive back. Sometimes when it doesn't die it will start going fast (~80MB/s) again after a number of minutes.

On the VMware side the VMs just grind to a halt (even when I only have one VM running) and typically is a crash. Local transfers didn't seem to have a problem though a little slower than ~100MB/s as I was transferring from the slower OS drives. I've done network captures, disabled TSO offloading but then I noticed that when latencies hit from iSCSI then a local transfer dropped too??? Then I tried some limited testing of transfers from CIFs share but that didn't seem to be affected when run by itself.

I'm about to pull out precious hair that I need to keep as much as possible.:eek: Does anyone have any ideas?
 
Look for disk errors or timeouts AND check the SMART status of the drives.. I've ran into similar issues and both cases have been a drive that was starting to be flaky.

In the last instance this happened there were no timeouts reported by ZFS, but the drive had a number of reallocated sectors. Replacing the drive fixed all the problems I was having.

Riley
 
wheelz,

I was having a different setup and also a different problem, I think.

The setup was different as I was using NFS for the VMWare pool and CIFS/SM
 
wheelz,

I was having a different setup and also a different problem, I think.

The setup was different as I was using NFS for the VMWare pool and CIFS/SMB for my Windows data. I never used iSCSI. My NFS pool was always more or less fine, as soon as I switched to SAS disks. My CIFS pool had something that remembers your problems, but I don't think it's the same. I solved my CIFS problem by replacing the (poor) Samsung 1 TB disks with 750GB 24x7 SATA drives from Seagate and attaching them to the mainboards SATA ports and attaching the SAS drives to the LSI controller and then rebuilding the VM pool to the SAS drives. Also, I rebuilt everything from scratch several times. Since then, I reboot the box maybe once per month and it works fine. If I don't reboot it, I run sooner or later in the problem that my CIFS shares are not available anymore. The VM pool on SAS is still fine then, which is a bit odd...
After having suffered from month and month of troubleshooting, I was so frustrated that I can live now happily with this little constraint.

When I copy my VM's from the NFS pool to local disks as a backup, it really performs very good. I don't know numbers, but I am satisfied with the speed (also with the running VM's, I don't see a problem).
Copying from the NAS CIFS to my PC for local backup is also fine. If I copy to SSD directly, I get almost 100 MB/s permanently. If I copy to my SATA2 larger backup disk, it goes down to ~60-65 MB/s, which I think is probably fine for SATA2 and slow magnetic disks. I have a raidz1-0 pool, though, with 5 drives and 1 spare.

If you can get a hold of a few other drives, I would try to rebuild just to see if it works any better. I have the impression that ZFS is a bit picky with low cost SATA drives...

One last question relating to your NIC's... you're having a lot of ports there. Do you do Link Aggregation of some sort? If so, try disabling that and just use one wire at a time... Gigabit is usually not easily fully used from a bandwidth point of view. Might help to narrow down the cause.

Please also let us know what OS you run - all in one, standalone, OI, Solaris?
 
Hoping someone in this thread or maybe Gea can give me a hand.

I have an OI+nappit all-in-one setup. One of my VM guests is a Server 2012 acting as a Domain Controller. I was able to get the OI to join the domain and everything was working great. When I started working on handling power outages and doing graceful shutdowns I noticed that the OI machine had issues on reboot finding the DC. Course the DC would not be up waiting for OI but I assumed that once the DC was up and running OI would be able to find it. That does not seem to be the case. I have to remove the OI box from the DC then re-join the domain from the OI box. Below are some of the errors I see on the OI box:

These happen at bootup before the DC is up, expected but it does not seem idmap ever trys again.
SAN idmap[590]: [ID 280452 daemon.error] Error: smb_lookup_sid failed.
Oct 14 13:55:28 SAN idmap[590]: [ID 455671 daemon.error] Check SMB service (svc:/network/smb/server).
Oct 14 13:55:28 SAN idmap[590]: [ID 174421 daemon.error] Check connectivity to Active Directory.

The following are happening after the DC is up and running:

Oct 14 19:03:05 SAN smbd[701]: [ID 702911 daemon.notice] smbd_dc_monitor: domain service not responding
Oct 14 19:03:20 SAN smbd[701]: [ID 702911 daemon.notice] smbd_dc_update: XXX.XXX: locate failed
Oct 14 19:03:25 SAN vmxnet3s: [ID 654879 kern.notice] vmxnet3s:0: getcapab(0x200000) -> no
Oct 14 19:04:25 SAN last message repeated 5 times
 
Hi Gea,

Can you please tell me how/what napp-it uses to send emails?

I want to use the settings that are already there for my NUT (UPS) alerts rather than re-invent the wheel if possible.

Thankyou
Paul
 
Look for disk errors or timeouts AND check the SMART status of the drives.. I've ran into similar issues and both cases have been a drive that was starting to be flaky.

In the last instance this happened there were no timeouts reported by ZFS, but the drive had a number of reallocated sectors. Replacing the drive fixed all the problems I was having.

Riley

I am using Seagate higher end raid class drives (I buy Constellation ES line for my replacements). I have ST31000340NS and ST31000524NS models right now. I do have one drive removed as it is failed (waiting on the replacement to get here). In addtion I've already replaced 2 other drives in the last year. I don't see any errors on the pool currently (outside of the failed drive of course). Is there somewhere else I should look for errors or timeouts? I looked in /var/adm/messages to see if I could see anything but I think I just saw references to the failed drive. Also I can boot to the Seagate tools if I need to but is there any way to check the SMART status in Solaris 11?
 
I am using Seagate higher end raid class drives (I buy Constellation ES line for my replacements). I have ST31000340NS and ST31000524NS models right now. I do have one drive removed as it is failed (waiting on the replacement to get here). In addtion I've already replaced 2 other drives in the last year. I don't see any errors on the pool currently (outside of the failed drive of course). Is there somewhere else I should look for errors or timeouts? I looked in /var/adm/messages to see if I could see anything but I think I just saw references to the failed drive. Also I can boot to the Seagate tools if I need to but is there any way to check the SMART status in Solaris 11?

There is under napp-it (I assume it's installed). Under Disks-> smartinfo you can check the status.

One thing to note is that checking the SMART status this way will increase the timeout values for the disks. I think it's because querying the SMART status "locks" up the drive for a moment. I've safely ignored this.

Riley
 
wheelz,

...
One last question relating to your NIC's... you're having a lot of ports there. Do you do Link Aggregation of some sort? If so, try disabling that and just use one wire at a time... Gigabit is usually not easily fully used from a bandwidth point of view. Might help to narrow down the cause.

Please also let us know what OS you run - all in one, standalone, OI, Solaris?

The NICs are going to 2 different Cisco switches. Each switch has 2 ports using LACP aggr and then just recently I was playing around with IPMP (though the problem precedes this). Am I wrong in assuming it is more SAS/SATA related rather than network related as the local transfer was affected when the problem occurred over iSCSI?

I'm on Solaris 11 (11/11 I think).
 
I primarily use iSCSI. I am seeing huge latencies from both ESXi hosts and a Windows host (like 30 seconds or worse). What is strange is it seems to have to build up to it. At first I don't see it after a reboot, but after I have transferred approximately 42 GB to it (suspiciously close to my RAM). After that I get huge latencies that freeze everything up. On the Windows side if I try a 10 GB transfer, it starts at ~130MB/s but then drops down to as low as 10MB/s and many times dies. When it dies I have to disconnect from the iSCSI host and reconnect to get the drive back. Sometimes when it doesn't die it will start going fast (~80MB/s) again after a number of minutes.

On the VMware side the VMs just grind to a halt (even when I only have one VM running) and typically is a crash. Local transfers didn't seem to have a problem though a little slower than ~100MB/s as I was transferring from the slower OS drives. I've done network captures, disabled TSO offloading but then I noticed that when latencies hit from iSCSI then a local transfer dropped too??? Then I tried some limited testing of transfers from CIFs share but that didn't seem to be affected when run by itself.

I'm about to pull out precious hair that I need to keep as much as possible.:eek: Does anyone have any ideas?

I wonder how quickly you can reproduce your problem. If it is easy enough to reproduce you could isolate if it is a disk problem or some other problem by using 1-2 spare SAS/SATA ports to plug in 1-2 extra drives that you think have no issues (could be different model/make even). Create a new single disk or mirrored pool with just these drives and then you can share this with iSCSI. Move a test VM onto it and test it to try and reproduce your problem. If it still happens then it is not your main pool disks. Also maybe as a second test try removing one of your ZIL cache drives from its mirror and add it to this mini pool to make sure its not this causing your problems. Not sure if testing moving the L2ARC is worth it or not as well.

One great thing about this kind of testing is you can do it in parallel meaning your main pool can be online and still going and in use while you do this testing. Might be interesting to see if your main pool faults again does the second mini pool have any problems if you test it at that point. This may be an easier way to test it if it is not easy to make it fault on demand.

Michael
 
Hi Gea,

Can you please tell me how/what napp-it uses to send emails?

I want to use the settings that are already there for my NUT (UPS) alerts rather than re-invent the wheel if possible.

Thankyou
Paul

You can look at the job-script ("/var/web-gui/data/napp-it/zfsos/15_jobs and data services/03_email/01_status/action.pl")

The mail ist send via function auto_job :

the basic of the script
Code:
 use Net::SMTP;

  my $mailer = Net::SMTP->new($server) || &mess("could not connect smtp-server $server");
  if (($user ne "") && ($pw ne "")) {
    $mailer->auth ($user,$pw);
  }
  $mailer->mail($from);


  $mailer->to($t. "\n") || &mess("send error to $t ");
  
  $mailer->data();
  $mailer->datasend("From: $from\n");
  $mailer->datasend("To: $to\n");
  $mailer->datasend("Subject: $sub\n\n");

  $mailer->datasend("$text\n");
  $mailer->dataend();

  $mailer->quit;
 
Hoping someone in this thread or maybe Gea can give me a hand.

I have an OI+nappit all-in-one setup. One of my VM guests is a Server 2012 acting as a Domain Controller. I was able to get the OI to join the domain and everything was working great. When I started working on handling power outages and doing graceful shutdowns I noticed that the OI machine had issues on reboot finding the DC. Course the DC would not be up waiting for OI but I assumed that once the DC was up and running OI would be able to find it. That does not seem to be the case. I have to remove the OI box from the DC then re-join the domain from the OI box. Below are some of the errors I see on the OI box:

These happen at bootup before the DC is up, expected but it does not seem idmap ever trys again.s

It should be enough to restart the smb service after the DC is up.
You can do this manually with menu services-smb or you can write a script or create an "other" job that pings to the dc controller and restarts the SMB service once when there is a reply)
 
Thank _Gea. I actually was thinking of making smb manual startup and making a startup script that checks for the DC controller being up before trying to start smb.

First I am going to test that starting smb AFTER the DC is up fixes the issue. Cause I thought I tried just restarting smb on OI box to rejoin and that was not working. I am new to Opensolaris services so may have messed that up.

svcadm restart smb/server

Is that correct?
 
Okay so I was able to solve my issue with OI not seeing my DC controller after reboot. Turns out it was a DNS issue. I did not realize that my resolve file was still using DHCP :( And it apparently is very important that the FIRST DNS entry is your DC. Even if the second is the DC it will not work.

However it seems that I have slowed down something in the process of tweaking my DNS settings and I am not entirely sure how. Boot up of the OI box takes about the normal amount of time and ESX detects vmware tools in about the same time as before, however ESX does not detect the NFS share for at least another 2-3 minutes after vmware tools is up. The other odd thing is after about 30 seconds after tools shows online, esx then shows tools offline until a few minutes later when tools comes back and then the nfs mount shows up. Also I noticed the napp-it service does not start up until after that 2-3 minute delay.

Any ideas at what to look at? My gut is telling me it is a dns timeout, if so I can probably solve it with a host entry if I knew what it was trying to lookup?

Besides the dns changes I made, I did have a number of OI updates that I performed.
 
I wonder how quickly you can reproduce your problem. If it is easy enough to reproduce you could isolate if it is a disk problem or some other problem by using 1-2 spare SAS/SATA ports to plug in 1-2 extra drives that you think have no issues (could be different model/make even). Create a new single disk or mirrored pool with just these drives and then you can share this with iSCSI. Move a test VM onto it and test it to try and reproduce your problem. If it still happens then it is not your main pool disks. Also maybe as a second test try removing one of your ZIL cache drives from its mirror and add it to this mini pool to make sure its not this causing your problems. Not sure if testing moving the L2ARC is worth it or not as well.

One great thing about this kind of testing is you can do it in parallel meaning your main pool can be online and still going and in use while you do this testing. Might be interesting to see if your main pool faults again does the second mini pool have any problems if you test it at that point. This may be an easier way to test it if it is not easy to make it fault on demand.

Michael

I can reproduce it if I can "preload" over 50GB of data. Actually I've already moved all my data off so I can use it like a full test bed. I've set aside today to try to get this resolved. I'm going to try destorying my current pool, then create a separate pool for each disk and see on which disks I can recreate the problem just to eliminate all those possibilities after checking their SMART status.
 
Well I got a bad drive hit off one of the drives I was sent from warranty replacement... Seagate diagnostic tools found it with the short test. I still have to test after I get a replacement, but this really begs the question. How can I prevent this in the future? Having a single drive going flaky taking down the whole array without any indication as to what is wrong effectively kind of defeats the purpose of having all this redundancy. Is there any way to do periodic SMART checks from within Solaris? The weekly scrubs did not pick up on it.
 
wheelz are your Seagate SAS or SATA? I wonder if the SATA controller was re-trying to get the data and was successful but slow. You are using "enterprise class" seagate I would hope did not do that. If you were using consumer drives it would be easy to point the finger at the drive not failing early on read errors and nothing that ZFS could do to detect something that is hidden from it.
 
They are SATA - model ST31000340NS but they are supposed to be enterprise class. This was the barracuda ES line before they came up with the constellation line.
 
Back
Top