SSD RAID 0 Issue - One of them is bad, but which one?...

Semp

n00b
Joined
Feb 20, 2019
Messages
5
Let me start off by saying I know RAID 0 is frowned upon, I shouldn't have done this in the first place, etc etc... Data loss wasn't a real concern and I needed the speed bump. Please spare me the posts about what a dumb idea this was.

Okay, I had 4 Samsung 860 Evo's in a RAID0 config in Ubuntu 16.04. All was well for a few weeks but after a reboot I had some issues with my XFS partition.

'dmseg | less' showed the following:

[ 23.197822] XFS (md2p1): Metadata CRC error detected at xfs_inobt_read_verify+0x6c/0xd0 [xfs], xfs_inobt block 0x101ba9200
[ 23.197826] XFS (md2p1): Unmount and run xfs_repair
[ 23.197827] XFS (md2p1): First 64 bytes of corrupted metadata buffer:
[ 23.197829] ffff881007dd6000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 23.197831] ffff881007dd6010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 23.197832] ffff881007dd6020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 23.197833] ffff881007dd6030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 23.197840] XFS (md2p1): metadata I/O error: block 0x101ba9200 ("xfs_trans_read_buf_map") error 74 numblks 8
[ 23.197844] XFS (md2p1): xfs_do_force_shutdown(0x1) called from line 315 of file /build/linux-PrHwV2/linux-4.4.0/fs/xfs/xfs_trans_buf.c. Return address = 0xffffffffcb60e8e8
[ 23.198811] XFS (md2p1): I/O Error Detected. Shutting down filesystem
[ 23.198813] XFS (md2p1): Please umount the filesystem and rectify the problem(s)
[ 41.158820] XFS (md2p1): xfs_log_force: error -5 returned.

Running XFS repair didn't help much, and it was clear there was something very much wrong.

A total rebuild of the RAID0 and a fresh install had similar problems again.

I'm left to assume one of the drives has an issue, but I don't know which one. I've thrown them all onto my desktop system and checked them all with Samsung's Magician tool, they all show as "Good"...

What can I do to test the drives individually for problems?
 
No, nothing at all. They all look the same and aren't showing any errors.
 
No, nothing at all. They all look the same and aren't showing any errors.

In that case i would assume its bad data, something got corrupted. Wipe drives restore from known good backup and carry on.
 
Is this the boot filesystem? I'd take one SSD and install it as boot without RAIDing it. If that fails you've found it. If it works, you can try installing an xfs filesystem on each of the remaining three hoping to find the wonky one. If that works, I'd do RAID0 on the 3 paired combinations. And if that works, swap the boot SSD with one of the others and do it again.

And maybe run a memtest before you mess with all of the above. It could be bad memory.
 
Yes, it was the boot filesystem. I'm going to setup the OS on a separate SSD today or tomorrow, and then try the RAID0 again.

Raid was configured via Ubuntu server install.

The fact that I did a fresh raid config and fresh OS install, and had similar weird issues with the system (I can't remember exactly what they were now, as this was a few weeks ago. The system would hang or get stuck on weird errors), made me think that it had to be a bad drive.
 
Yes, it was the boot filesystem. I'm going to setup the OS on a separate SSD today or tomorrow, and then try the RAID0 again.

Raid was configured via Ubuntu server install.

The fact that I did a fresh raid config and fresh OS install, and had similar weird issues with the system (I can't remember exactly what they were now, as this was a few weeks ago. The system would hang or get stuck on weird errors), made me think that it had to be a bad drive.

Could be a bad SATA cable too, or bad memory.
 
Memory is ECC, and I've ran memtest on the system too.

Never thought of a bad sata cable though.
 
If you not getting SMART errors and samsung magician says your fine, the SSD's are not your issue. You either have corrupt data in the OS, or a bas cable or bad connection somewhere.
 
Nothing wrong with Raid 0 as long as you know the risks.

2nd the thought of either bad cables, connections or even controllers.

Given the speed bump you reported, wonder if one of the MB chips that handles HD data xfer isn't up the to task.
 
Nothing wrong with Raid 0 as long as you know the risks.

2nd the thought of either bad cables, connections or even controllers.

Given the speed bump you reported, wonder if one of the MB chips that handles HD data xfer isn't up the to task.

I agree with you about the RAID 0, if you need the speed you need the speed, its literally that simple. And as long as you have a good backup/recovery system in place there is no risk.
 
Nothing wrong with Raid 0 as long as you know the risks.

2nd the thought of either bad cables, connections or even controllers.

Given the speed bump you reported, wonder if one of the MB chips that handles HD data xfer isn't up the to task.

I have the same setup on an identical machine and its never given me an issue, but its possible this motherboard is defective somehow I guess.

I'm running RAID0 with 2 Kingston drives on it right now and all is well. Waiting for parts to setup the Samsung drives in RAID0 again on another motherboard, will use all new SATA cables as well.

Thanks for all your posts!
 
Back
Top