SSD RAID 0 Issue - One of them is bad, but which one?...

Discussion in 'SSDs & Data Storage' started by Semp, Feb 20, 2019.

  1. Semp

    Semp n00b

    Messages:
    5
    Joined:
    Feb 20, 2019
    Let me start off by saying I know RAID 0 is frowned upon, I shouldn't have done this in the first place, etc etc... Data loss wasn't a real concern and I needed the speed bump. Please spare me the posts about what a dumb idea this was.

    Okay, I had 4 Samsung 860 Evo's in a RAID0 config in Ubuntu 16.04. All was well for a few weeks but after a reboot I had some issues with my XFS partition.

    'dmseg | less' showed the following:

    [ 23.197822] XFS (md2p1): Metadata CRC error detected at xfs_inobt_read_verify+0x6c/0xd0 [xfs], xfs_inobt block 0x101ba9200
    [ 23.197826] XFS (md2p1): Unmount and run xfs_repair
    [ 23.197827] XFS (md2p1): First 64 bytes of corrupted metadata buffer:
    [ 23.197829] ffff881007dd6000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 23.197831] ffff881007dd6010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 23.197832] ffff881007dd6020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 23.197833] ffff881007dd6030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    [ 23.197840] XFS (md2p1): metadata I/O error: block 0x101ba9200 ("xfs_trans_read_buf_map") error 74 numblks 8
    [ 23.197844] XFS (md2p1): xfs_do_force_shutdown(0x1) called from line 315 of file /build/linux-PrHwV2/linux-4.4.0/fs/xfs/xfs_trans_buf.c. Return address = 0xffffffffcb60e8e8
    [ 23.198811] XFS (md2p1): I/O Error Detected. Shutting down filesystem
    [ 23.198813] XFS (md2p1): Please umount the filesystem and rectify the problem(s)
    [ 41.158820] XFS (md2p1): xfs_log_force: error -5 returned.

    Running XFS repair didn't help much, and it was clear there was something very much wrong.

    A total rebuild of the RAID0 and a fresh install had similar problems again.

    I'm left to assume one of the drives has an issue, but I don't know which one. I've thrown them all onto my desktop system and checked them all with Samsung's Magician tool, they all show as "Good"...

    What can I do to test the drives individually for problems?
     
  2. Rifter0876

    Rifter0876 [H]Lite

    Messages:
    64
    Joined:
    Nov 1, 2017
    what does the smart data say? anything obvious?
     
  3. Semp

    Semp n00b

    Messages:
    5
    Joined:
    Feb 20, 2019
    No, nothing at all. They all look the same and aren't showing any errors.
     
  4. Rifter0876

    Rifter0876 [H]Lite

    Messages:
    64
    Joined:
    Nov 1, 2017
    In that case i would assume its bad data, something got corrupted. Wipe drives restore from known good backup and carry on.
     
  5. pitingres

    pitingres n00b

    Messages:
    56
    Joined:
    Jul 25, 2018
    Is this the boot filesystem? I'd take one SSD and install it as boot without RAIDing it. If that fails you've found it. If it works, you can try installing an xfs filesystem on each of the remaining three hoping to find the wonky one. If that works, I'd do RAID0 on the 3 paired combinations. And if that works, swap the boot SSD with one of the others and do it again.

    And maybe run a memtest before you mess with all of the above. It could be bad memory.
     
  6. Semp

    Semp n00b

    Messages:
    5
    Joined:
    Feb 20, 2019
    Yes, it was the boot filesystem. I'm going to setup the OS on a separate SSD today or tomorrow, and then try the RAID0 again.

    Raid was configured via Ubuntu server install.

    The fact that I did a fresh raid config and fresh OS install, and had similar weird issues with the system (I can't remember exactly what they were now, as this was a few weeks ago. The system would hang or get stuck on weird errors), made me think that it had to be a bad drive.
     
  7. Rifter0876

    Rifter0876 [H]Lite

    Messages:
    64
    Joined:
    Nov 1, 2017
    Could be a bad SATA cable too, or bad memory.
     
    pendragon1 likes this.
  8. Semp

    Semp n00b

    Messages:
    5
    Joined:
    Feb 20, 2019
    Memory is ECC, and I've ran memtest on the system too.

    Never thought of a bad sata cable though.
     
  9. ChRoNo16

    ChRoNo16 [H]ard|Gawd

    Messages:
    1,346
    Joined:
    Feb 3, 2011
    If you not getting SMART errors and samsung magician says your fine, the SSD's are not your issue. You either have corrupt data in the OS, or a bas cable or bad connection somewhere.
     
  10. Dead Parrot

    Dead Parrot 2[H]4U

    Messages:
    2,337
    Joined:
    Mar 4, 2013
    Nothing wrong with Raid 0 as long as you know the risks.

    2nd the thought of either bad cables, connections or even controllers.

    Given the speed bump you reported, wonder if one of the MB chips that handles HD data xfer isn't up the to task.
     
  11. Rifter0876

    Rifter0876 [H]Lite

    Messages:
    64
    Joined:
    Nov 1, 2017
    I agree with you about the RAID 0, if you need the speed you need the speed, its literally that simple. And as long as you have a good backup/recovery system in place there is no risk.
     
  12. Semp

    Semp n00b

    Messages:
    5
    Joined:
    Feb 20, 2019
    I have the same setup on an identical machine and its never given me an issue, but its possible this motherboard is defective somehow I guess.

    I'm running RAID0 with 2 Kingston drives on it right now and all is well. Waiting for parts to setup the Samsung drives in RAID0 again on another motherboard, will use all new SATA cables as well.

    Thanks for all your posts!
     
Tags: