Seagate SAS in SM JBOD - lots of read errors (anyone else?)

levak · Oct 13, 2015

Hello!

I have the following new setup:

server with LSI 9207 HBA firmware P19
Supermicro 837E26-RJBOD1 28bay JBOD

I'm having troubles with high Read errors corrected by ECC counter in SMART.

Writing to hard drives gives no errors, but as soon as I start to read data from hard drives, counter goes crazy. I have brand new setup and I'm seeing this errors.

SMART output:

Code:

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        60 C

Manufactured in week 16 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  133
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  133
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 122906373
  Blocks received from initiator = 0
  Blocks read from cache and sent to initiator = 128
  Number of read and write commands whose size <= segment size = 65
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 10.33
  number of minutes until next internal SMART test = 46

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   171072857        0         0  171072857          0         32.863           0
write:         0        0         0         0          0          0.026           0

So far I have tested with these drives:

Seagate Enterprise capacity 3.5 v4 4TB SAS3 -- ERRORS
Seagate Constellation ES.3 ST4000NM0023 4 TB 3.5 SAS2 -- ERRORS
HGST Ultrastar 7K4000 SAS2 4TB HUS724040ALS640 -- NO ERRORS

So far, HGST is working OK, but both Seagates produce errors.

Does anyone else has those drives in the system and could check their SMART stats?
Do you also see high values?
Could you paste your SMART values here?

Thanks, MAtej

_CiPHER_ · Oct 13, 2015

Give us the REAL SMART data, not fake aggregated data.

smartctl -A /dev/<device>

Snowknight26 · Oct 13, 2015

That's all smartctl can show for some SAS drives.

For example, with -a:

Code:

smartctl 6.4 2015-06-04 r4109 [x86_64-w64-mingw32-win7-sp1] (sf-6.4-1)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HITACHI
Product:              HUSSL4020ASS600
Revision:             A131
Compliance:           SPC-4
User Capacity:        200,049,647,616 bytes [200 GB]
Logical block size:   512 bytes
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca013029ce4
Serial number:        xxxxx
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Oct 13 20:56:01 2015 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     21 C
Drive Trip Temperature:        70 C

Manufactured in week 25 of year 2011
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 9719494918275072

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      30237.980           0
write:         0        0         0         0          0      17248.919           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   12846                 - [-   -    -]

Long (extended) Self Test duration: 33 seconds [0.6 minutes]

With -A:

Code:

smartctl 6.4 2015-06-04 r4109 [x86_64-w64-mingw32-win7-sp1] (sf-6.4-1)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     21 C
Drive Trip Temperature:        70 C

Manufactured in week 25 of year 2011
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 9719500253429760

Seeing as my HBA is similar to what levak has (LSI 9211-8i/IBM M1015 versus his LSI 9207), that could also be a factor.

_CiPHER_ · Oct 13, 2015

This is my SMART output:

Code:

[root@zfsguru /home/ssh]# smartctl -A /dev/ada1
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.2-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       5
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       463
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       346
194 Temperature_Celsius     0x0022   124   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

That is REAL SMART data. I am dismayed that you get basically no SMART output at all from your SAS drive. Since where does it get its data from? There is no raw output but only aggregated data. It is useless.

In the REAL SMART data, you can see Raw Read Error Rate. The raw value of my drive here is 0, because the true value has been hidden by the manufacturer. This is because many people will think there is something wrong with their drive if it shows high numbers. What they do not realise is:

1) that RAW read errors are normal for any harddrive of several megabytes or larger; no harddrive can do without read errors

2) RAW read errors are before ECC errorcorrection kicks in. Hence the 'raw' part; this is the BER rate, whereis uBER rate is after ECC errorcorrection. This is crucial because only errors after ECC are truly important; those will transform into bad sectors - unreadable sectors that will show as Current Pending Sector in the SMART output.

3) The Raw Read Error Rate is a RATE - it is relative the number of read operations. The absolute number of read errors is HUUUUUUGGGEEEEE! The RATE is all that matters; as in a percentage of total read operations. This gives a rough measure of how much ECC errorcorrection is required to ensure good operation.

4) In general, the raw value is meaningless, you should look at the normalised value which often is at a good value. For a normalised value (Current/Worst/Threshold) it works differently than people expect: a HIGHER number is BETTER. If the Current value drops BELOW the Threshold value, the drive is faulty/failed according to the SMART data. The harddrive itself determines the values for Current/Worst/Threshold.

In your case you have no information at all - basically I would say your drive does not support SMART at all. No useful information to be retrieved here.

_CiPHER_ · Oct 13, 2015

Since you are on Windows i see now, you can also try CrystalDiskInfo - maybe it can retrieve some real information..

levak · Oct 14, 2015

Hey there,

thanks for your help and input.

smartctl -A returns even less than -a:

Code:

smartctl -A /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-229.14.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     34 C
Drive Trip Temperature:        60 C

Manufactured in week 16 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  133
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  133
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 122906373
  Blocks received from initiator = 0
  Blocks read from cache and sent to initiator = 128
  Number of read and write commands whose size <= segment size = 65
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 19.90
  number of minutes until next internal SMART test = 13

_CiPHER_:
Do you have a SAS drive?
Is your output from a SAS drive or from SATA?
According to smartmontools site and FreeNAS discussion, SAS reports different set of data.
Quote from FreeNAS forum:

SAS has an entirely different set of "informational exceptions." For example, SAS uses the "grown defects list" instead of "reallocated sector count".

Snowknight26: You get the same output as I on SAS drives.

If I take a look at 2.5" hard drives attached to my LSI RAID controller, output is the same:
Drive 1:

Code:

=== START OF INFORMATION SECTION ===
Vendor:               IBM-ESXS
Product:              HUC109030CSS60
Revision:             J2E8
User Capacity:        300,000,000,000 bytes [300 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
Rotation Rate:        10000 rpm
Form Factor:          2.5 inches
Device type:          disk
Transport protocol:   SAS
Local Time is:        Wed Oct 14 09:16:21 2015 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        65 C

Manufactured in week 32 of year 2013
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  40
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  713
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 5711047910490112

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0    1202708      38342.154           0
write:         0        0         0         0       7688      24230.731           0
verify:        0        0         0         0       3059        693.652           0

Non-medium error count:        0

No self-tests have been logged

Drive 2:

Code:

=== START OF INFORMATION SECTION ===
Vendor:               IBM-ESXS
Product:              HUC109030CSS60
Revision:             J2E8
User Capacity:        300,000,000,000 bytes [300 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
Rotation Rate:        10000 rpm
Form Factor:          2.5 inches
Device type:          disk
Transport protocol:   SAS
Local Time is:        Wed Oct 14 09:18:49 2015 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        65 C

Manufactured in week 32 of year 2013
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  40
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  713
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 5717097741025280

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0    1227192      38616.368           0
write:         0        0         0         0       4356      24145.014           0
verify:        0        0         0         0       1715        694.774           0

Non-medium error count:        0

No self-tests have been logged

Can anyone else post smart stats from their drives? Specially if configuration is the same.

Matej

Faldaani · Oct 14, 2015

Try playing with the -d flags for smartmontools, maybe you can force it into SAT mode, or try forcing an ATA command (hoping the JBOD has passthrough of those), that may net you more information.

cantalup · Oct 14, 2015

try to run smartctl -l error /dev/sdX

basically your SAS is OK, since grow list is 0, but has error on the fly during read

can you check cable? or try to move the HD to another slot (that occupied by your Hitachi HD)

you smartctl SAS output is correct

Snowknight26 · Oct 14, 2015

cantalup said:
try to run smartctl -l error /dev/sdX

That'll just show the "error counter log" that he's posted a few times already.

cantalup · Oct 14, 2015

Snowknight26 said:
That'll just show the "error counter log" that he's posted a few times already.

not error counter log

.. error SAS log..
missing the last line
read: 171072857 0 0 171072857 0 32.863 0
write: 0 0 0 0 0 0.026 0

the error log should alike:..
read: 0 0 0 0 1227192 38616.368 0
write: 0 0 0 0 4356 24145.014 0
verify: 0 0 0 0 1715 694.774 0

I think -l would the the error long and running background

levak · Oct 15, 2015

The output from smartctl -l error is the same as the output with -a command.

I can see verify row on Hitachi drive, but not on Seagate.

I will try some more configurations todat, but I think all will yeald the same problems. I have 3 more JBODs and 1 more controller I can test on. I'm waiting for a LSI 9300 HBA to test with a different controller, but I might have to wait a few more days to get it.

Matej

levak · Oct 15, 2015

Today I tested the same drive in 3 different JBODs with 2 different servers and 3 different controllers (but all same brand/model/firmware).

I remembered I also have a brand new IBM server in the rack with 3.5" hard drives. I plugged my Seagate in and powered on. When system booted, I did some dd reading from disk and checked smart stats. Errors were again through the roof, but then I checked other drives in the server which were also IBM branded Seagates and the counters are high as well.

Smart from Seagate I tested in other JBODs:

Code:

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST4000NM0034
Revision:             E001
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Lowest aligned LBA:   0
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c500837568b7
Device type:          disk
Transport protocol:   SAS
Local Time is:        Fri Oct 16 01:22:58 2015 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     28 C
Drive Trip Temperature:        60 C

Manufactured in week 15 of year 2015
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  70
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  96
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 3531092253
  Blocks received from initiator = 2485440232
  Blocks read from cache and sent to initiator = 3952476
  Number of read and write commands whose size <= segment size = 449484
  Number of read and write commands whose size > segment size = 4437

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 167.23
  number of minutes until next internal SMART test = 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1956084437        0         0  1956084437          0      32055.540           0
write:         0        0         0         0          0       5678.375           0
verify:      634        0         0       634          0          0.000           0

Non-medium error count:        0

Brand new IBM branded Seagate:

Code:

Vendor:               LENOVO-X
Product:              ST300MM0006
Revision:             L56Q
User Capacity:        300,000,000,000 bytes [300 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
Logical block provisioning type unreported, LBPME=0, LBPRZ=0
Rotation Rate:        10500 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c5008e4287cb
Device type:          disk
Transport protocol:   SAS
Local Time is:        Fri Oct 16 01:22:52 2015 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     31 C
Drive Trip Temperature:        65 C

Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 10.40
  number of minutes until next internal SMART test = 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   167106149        0         0  167106149          0         53.495           0
write:         0        0         0         0          0          2.297           0
verify:     7808        0         0      7808          0          0.002           0

Non-medium error count:        8

cantalup · Oct 15, 2015

one point to focus on -> Elements in grown defect list
when the grow defect list is growing fast, your sas drive is going to give up...

I would not worry on the counter log much.. since all got corrected including verify.
read ecc err log tells you that SAS is doing correction on the fly...

are you using linux? if yes... look on kernel msgs to know if your HBA card is complaining..
you can post that I can look up.

I am guessing.. firmware issue on seagate sas drive that not playing nicely with your HBA. can you upgrade ?

SAS is pretty good on handling error correction, compared with SATA drive.

levak · Oct 15, 2015

Ok, I will look at Elements in grown defect list, but so far all hard drives have value of 0 - as one would expect for new drives

It's interesting that there are SO many errors. 25 000 000 errors when reading 10GB of data seems like A LOT. If I run the same transfer on a HGST drive, counter is 0. I have some drives in the other server, which have up to 500TB data read and 80 000 000 errors.

I am using Linux, yes, but there are NO errors in logs. That's what's weird.

I will have a look at Seagate site to see if I can find a firmware for upgrade and test it with upgraded firmware.

MAtej

_CiPHER_ · Oct 15, 2015

levak said:
It's interesting that there are SO many errors. 25 000 000 errors when reading 10GB of data seems like A LOT. If I run the same transfer on a HGST drive, counter is 0.

Harddrives have long passed the boundary where they can retrieve information without applying ECC. That is why on SATA drives, the first SMART attribute is 'Raw Read Error Rate' - the last word is key here. Rate means it is relative to the number of reads. The absolute number of errors is far too high to be meaningful. A percentage of the total number of reads is much more interesting because that tells you whether the drive needs to apply ECC error correction more than what is usual for that type of drive. So the errors are relative here.

0 errors simply means the manufacturer is hiding the errors. There are no harddrives which have no BER errors. You need to go far into the past when there was no ECC or Hamming code present, multiple decades ago. Before my time.

Note that these are errors that are corrected. Hence RAW read errors. If they cannot be corrected after applying ECC, they become unreadable - bad sectors. They would show up as Current Pending Sector in the SMART output of a SATA drive.

BER -> ECC -> uBER = bad sectors
10^-10 -> ECC -> uBER 10^-14

Funny enough i am not at all familiar with SAS SMART output, it appears SAS has totally different support for SMART and from what i have seen so far not much useful information is being shared. SATA SMART output is much more meaningful.

cantalup · Oct 15, 2015

_CiPHER_ said:
Harddrives have long passed the boundary where they can retrieve information without applying ECC. That is why on SATA drives, the first SMART attribute is 'Raw Read Error Rate' - the last word is key here. Rate means it is relative to the number of reads. The absolute number of errors is far too high to be meaningful. A percentage of the total number of reads is much more interesting because that tells you whether the drive needs to apply ECC error correction more than what is usual for that type of drive. So the errors are relative here.

0 errors simply means the manufacturer is hiding the errors. There are no harddrives which have no BER errors. You need to go far into the past when there was no ECC or Hamming code present, multiple decades ago. Before my time.

Note that these are errors that are corrected. Hence RAW read errors. If they cannot be corrected after applying ECC, they become unreadable - bad sectors. They would show up as Current Pending Sector in the SMART output of a SATA drive.

BER -> ECC -> uBER = bad sectors
10^-10 -> ECC -> uBER 10^-14

Funny enough i am not at all familiar with SAS SMART output, it appears SAS has totally different support for SMART and from what i have seen so far not much useful information is being shared. SATA SMART output is much more meaningful.

SAS and SATA is not the same beast
but SAS can do SATA tunneling

for detail you can read smartctl author site that explain about SAS versus SATA.
in simple words: SAS is more reliable and smart(not SMART

).

_CiPHER_ · Oct 15, 2015

SAS is just an interface; how can you say SAS drives are more reliable? Just because SAS is usually enterprise with much higher price point per gigabyte and uBER is usually also higher (10^-15 and 10^-16) meaning up to 100 times less bad sectors.

Interestingly, SAS drives can be less reliable for consumer usage. The start/stop-cycles are specified much lower than for consumer drives. Also SAS focuses more on rpm and IOps performance, and 10k and even 15.000rpm SAS drives can easily be beaten by cheap consumer-grade drives which have much higher data density.

SAS interface has some advantages. Out of my head dedicated command channel and 128 queued I/Os instead of 32 for SATA. But the differences are pretty marginal i think. Generally, if you want reliability, you need to look at the software side (ZFS). Using expensive hardware is usually only required when working with legacy storage (NTFS, Ext4, UFS, RAID).

cantalup · Oct 15, 2015

_CiPHER_ said:
SAS is just an interface; how can you say SAS drives are more reliable? Just because SAS is usually enterprise with much higher price point per gigabyte and uBER is usually also higher (10^-15 and 10^-16) meaning up to 100 times less bad sectors.

Interestingly, SAS drives can be less reliable for consumer usage. The start/stop-cycles are specified much lower than for consumer drives. Also SAS focuses more on rpm and IOps performance, and 10k and even 15.000rpm SAS drives can easily be beaten by cheap consumer-grade drives which have much higher data density.

SAS interface has some advantages. Out of my head dedicated command channel and 128 queued I/Os instead of 32 for SATA. But the differences are pretty marginal i think. Generally, if you want reliability, you need to look at the software side (ZFS). Using expensive hardware is usually only required when working with legacy storage (NTFS, Ext4, UFS, RAID).

SAS is superior than SATA.
I dont buy uBER....

have been using SAS and SATA?
those are two different beast.

SAS focus on error correction too and some extra feature for redudancy and recovery.

Queue is not an issue with SATA.. since can handle by SAS HBA Queue depth..

I am not talking about reliability, durability and recovery on the fly with extra features are the key in SAS..

ZFS? been using since 2007

... an not related with this thread...

check SAS and SATA protocol in detail...

_CiPHER_ · Oct 15, 2015

Well you have not provided one key reason to pay the premium price for disks with SAS interface. There are also disks of the same class with SATA interface, such as the Velociraptor. Basically a SAS drive with SATA interface. There you also get many of the characteristics that SAS drives have to offer.

Reliability is an issue, of course that is related to ZFS. With ZFS, you can use cheap drives and 20 drives with multiple layers of redundancy is still way better than a single expensive SAS disk without redundancy. The price difference per GB is simply huge! You compensate the lower reliability of cheap consumer-grade disks by using intelligent software (ZFS). That was the original thought behind RAID by the way; hence the I for Inexpensive, which funny enough often gets substituted for 'Independent' these days. But that was not the idea of RAID; the idea was that by combining cheap disks with intelligent software techniques one could construct a storage device that was more reliable than expensive hardware can buy, for the same price and on top of that much higher storage capacity.

You do not buy the uBER story? Funny since this is one of the key reasons to buy SAS disks which have higher than 10^-14 uBER rate. That is why they generate bad sectors a lot less. Up to 100 times less according to the specifications the manufacturer releases. Are they lying to you?

levak · Oct 21, 2015

Hello again...

I got an answer back from Seagate saying that they don't filter what disk reports back, hence so many errors. They are raw values. All is good and we decided to keep the SAS3 drives.

As far as price goes, SAS drives are not that much expensive. We pay cca 10eur more for SAS drives, compared to SATA (enterprise grade SATA, not cheap "green" series).

Why SAS in our case?
* Better error handling (we are having troubles with ZFS locking up - something gets stuck between HBA - SAS expander - SATA drive)
* SAS expander support (we have multiple JBODs with SAS expanders attached to servers)
* Dual-port support (each JBOD/disk is attached to 2 servers, providing HA)
* Support (if we want support, SAS drives are the only supported. No support for SATA drives from "vendor").

Matej

cantalup · Oct 21, 2015

levak said:
Hello again...

I got an answer back from Seagate saying that they don't filter what disk reports back, hence so many errors. They are raw values. All is good and we decided to keep the SAS3 drives.

As far as price goes, SAS drives are not that much expensive. We pay cca 10eur more for SAS drives, compared to SATA (enterprise grade SATA, not cheap "green" series).

Why SAS in our case?
* Better error handling (we are having troubles with ZFS locking up - something gets stuck between HBA - SAS expander - SATA drive)
* SAS expander support (we have multiple JBODs with SAS expanders attached to servers)
* Dual-port support (each JBOD/disk is attached to 2 servers, providing HA)
* Support (if we want support, SAS drives are the only supported. No support for SATA drives from "vendor").

Matej

typical seagate answer: raw values

including theri SATA S.M.A.R.T data!

as I mentioned.: nothing wrong in your sas drives, just watch out on " Elements in grown defect list"

...
when grown list is increasing very fast and uncontrol, that is the time to replace

SAS is smart. SATA is S.M.A.R.T.
I believe, you know what I imply

levak · Oct 21, 2015

Yea, their SATA SMART is... hard to read?

When "Elements in grown defect list" starts to grow, SMART probably also raises an alarm. I will create a nagios check and collectd script anyway, so I won't be surprised one day

Matej

cantalup · Oct 21, 2015

levak said:
Yea, their SATA SMART is... hard to read?
When "Elements in grown defect list" starts to grow, SMART probably also raises an alarm. I will create a nagios check and collectd script anyway, so I won't be surprised one day

Matej

hard to read, only they can read

.
I see my baracuda SATA S.M.A.R.T data, some fields are very large or random numbers.

I wrote in smartcl config to monitor S.M.A.R.T data and created a script to pull all relocated/pending/temperatures/loadcycle to know quickly by running the script

best solution on my suggestion: create a script to pool "grown list" in duration and pull to nagios

have fun!!!

ST3F · Oct 24, 2015

My 2cts :

Using Expander or external JBOD case with SATA disks :

-> Seagate : ko
-> Western Digital, from Green series to RE enterprise : ko
-> HST 5k*** and 7k*** : ok
-> Samsung (old 1 To / 2 To) : ok

FLECOM · Oct 26, 2015

_CiPHER_ said:
legacy storage (NTFS, Ext4, UFS, RAID).

ST3F said:
My 2cts :

Using Expander or external JBOD case with SATA disks :

-> Seagate : ko
-> Western Digital, from Green series to RE enterprise : ko
-> HST 5k*** and 7k*** : ok
-> Samsung (old 1 To / 2 To) : ok

huh? plenty of people (myself included) are using Seagate, WD (green to RE) etc drives in supermicro chassis without issue, but the OP is using SAS drives so....

anyway, I have some Seagate SAS drives in a supermicro 836 chassis and never looked at the SMART data because the drive just worked, and the controllers didn't complain

levak · Nov 25, 2015

So far no problems. In the mean time, one drive did failed, but it was easy to find:
- drive was 100% busy even then there was almost no traffic
- SMART showed 'Drive failing'

Other drives are humming along nicely and they took some beating by now with various test scenarios...

Matej

Seagate SAS in SM JBOD - lots of read errors (anyone else?)

Limp Gawd

Weaksauce

Supreme [H]ardness

Weaksauce

Weaksauce

Limp Gawd

Limp Gawd

Gawd

Supreme [H]ardness

Gawd

Limp Gawd

Limp Gawd

Gawd

Limp Gawd

Weaksauce

Gawd

Weaksauce

Gawd

Weaksauce

Limp Gawd

Gawd

Limp Gawd

Gawd

Limp Gawd

Modder(ator) & [H]ardest Folder Evar

Limp Gawd