Setup:
Solaris 11
Supermicro motherboard
8 3TB Hitachi SATA (mix of 5400 and 7200RPM)
Sans Digital 8-drive SAS enclosure (no Expander)
Array is ZFS RaidZ1
Array keep intermittently dropping out...shows a CRC error for each of the 8 drives...remote server and then all is good again for a couple of days/weeks.
Diags from LSIUtil show this:
Basically small number of errors spread across ALL drives.
Not sure if we have an HBA, cable, enclosure or drive problem.
Thoughts on best approach to isolate?
On bootup, get this in /var/adm/messages :
Thinking I should probably disable multipathing in the OS as that never seems to cause anything but trouble.
Solaris 11
Supermicro motherboard
8 3TB Hitachi SATA (mix of 5400 and 7200RPM)
Sans Digital 8-drive SAS enclosure (no Expander)
Array is ZFS RaidZ1
Array keep intermittently dropping out...shows a CRC error for each of the 8 drives...remote server and then all is good again for a couple of days/weeks.
Diags from LSIUtil show this:
Code:
Diagnostics menu, select an option: [1-99 or e/p/w or 0 to quit] 12
Adapter Phy 0: Link Up
Invalid DWord Count 18
Running Disparity Error Count 16
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 1: Link Up
Invalid DWord Count 18
Running Disparity Error Count 17
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 2: Link Up
Invalid DWord Count 16
Running Disparity Error Count 15
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 3: Link Up
Invalid DWord Count 63
Running Disparity Error Count 60
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 4: Link Up
Invalid DWord Count 16
Running Disparity Error Count 16
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 5: Link Up
Invalid DWord Count 16
Running Disparity Error Count 15
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 6: Link Up
Invalid DWord Count 16
Running Disparity Error Count 16
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Adapter Phy 7: Link Up
Invalid DWord Count 16
Running Disparity Error Count 16
Loss of DWord Synch Count 4
Phy Reset Problem Count 0
Basically small number of errors spread across ALL drives.
Not sure if we have an HBA, cable, enclosure or drive problem.
Thoughts on best approach to isolate?
On bootup, get this in /var/adm/messages :
Code:
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event_sync: IOCLogInfo=0x31170000
Mar 3 09:04:45 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:45 zulu04 mptsas_handle_event: IOCLogInfo=0x31170000
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:46 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:46 zulu04 mptsas_check_scsi_io: IOCStatus=0x4b IOCLogInfo=0x31110d00
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:04:47 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:04:47 zulu04 mptsas_check_scsi_io: IOCStatus=0x48 IOCLogInfo=0x31130000
Mar 3 09:06:20 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:20 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:21 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca228c1b5f7 (sd3) multipath status: degraded: path 17 mpt_sas19/disk@w5000cca228c1b5f7,0 is online
Mar 3 09:06:21 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:21 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:21 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca228c1859a (sd6) multipath status: degraded: path 10 mpt_sas14/disk@w5000cca228c1859a,0 is online
Mar 3 09:06:22 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:22 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:23 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca228c1b645 (sd1) multipath status: degraded: path 12 mpt_sas16/disk@w5000cca228c1b645,0 is online
Mar 3 09:06:23 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:23 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:23 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca225eb9c2f (sd7) multipath status: degraded: path 13 mpt_sas17/disk@w5000cca225eb9c2f,0 is online
Mar 3 09:06:25 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:25 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:25 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca225ee29e9 (sd4) multipath status: degraded: path 14 mpt_sas13/disk@w5000cca225ee29e9,0 is online
Mar 3 09:06:25 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:25 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:26 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca225ecbb95 (sd2) multipath status: degraded: path 11 mpt_sas18/disk@w5000cca225ecbb95,0 is online
Mar 3 09:06:26 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:26 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:27 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca228c1aa0a (sd5) multipath status: degraded: path 15 mpt_sas20/disk@w5000cca228c1aa0a,0 is online
Mar 3 09:06:27 zulu04 scsi: [ID 243001 kern.info] /pci@0,0/pci8086,340e@7/pci1000,3080@0 (mpt_sas12):
Mar 3 09:06:27 zulu04 mptsas_access_config_page: IOCStatus=0x22 IOCLogInfo=0x30030116
Mar 3 09:06:27 zulu04 genunix: [ID 483743 kern.info] /scsi_vhci/disk@g5000cca225eddfcf (sd8) multipath status: degraded: path 16 mpt_sas15/disk@w5000cca225eddfcf,0 is online
Thinking I should probably disable multipathing in the OS as that never seems to cause anything but trouble.