3TB GPT Drives "go RAW" Randomly in SAS Server

VERTIGGO

n00b
Joined
May 15, 2007
Messages
39
First off, I don't want any crap about data recovery, I've exhausted that topic here:
http://forums.anandtech.com/showthread.php?t=2161566
and never recovered more than .1% of the data I lost, but since they were backups I'm moving on.

I want to know why GPT disks are so weak that they will lose their format at the drop of a hat. If no one has any idea, then has anyone else at least experienced this? Essentially, during normal operation (copying to or from), windows will report "error directory does not exist" or whatever, and as soon as I look at the drive it is still mounted with a letter, but just says RAW data and "do you want to format".

I am running an SAS server using the ever popular HP expander and an LSI controller:
http://hardforum.com/showthread.php?p=1036585197

It has worked nearly flawlessly for maybe 7 months, but with two different 3TB drives, one Hitachi and one WD, I have had this format suicide occur and lost about 1500-2000GB of backed up data each time. I am quite frustrated that I can't seem to find anyone else saying the same thing, but I'm pretty confident that it boils down to the SAS system, since It occurred on two completely different drives, and I had replaced CPU, mobo, RAM, and PSU between times (data corruption worries and the upgrade itch).

I plan to look into the SAS firmware, but I'm using the drives as MBR 2.2GB until I can trust them again. Has anyone experienced this? If so why isn't there an outcry against this suicidal technology?
 
Are you using any utlities for copying (like Microsoft RichCopy)? I remember I wiped my GPT drives couple of times when RichCopy copied whatever hidden file or directory from root of NTFS (2TB) to GPT (3TB) drive. GPT drive became RAW on next reboot (or instantly). It helped me since I marked "don't copy system or hidden files" in options tab.
I know it sounds absurd but I hope it helps......
 
No I only use windows explorer (Win 7 64 bit). I do you Allwaysync for other back ups, but I haven't used it on these drives yet, so it's definitely not related to that. Also, remember that the problem occurs immediately (both times). I'm mid transfer and windows says "can't write to that hard drive" it just pops the cyanide while I'm watching it.
 
OK, what am I looking for? The drives don't ever have health problems or CRC issues, it's a MBR issue...

SMART.png
 
No I only use windows explorer (Win 7 64 bit). I do you Allwaysync for other back ups, but I haven't used it on these drives yet, so it's definitely not related to that. Also, remember that the problem occurs immediately (both times). I'm mid transfer and windows says "can't write to that hard drive" it just pops the cyanide while I'm watching it.

Do you have "show hidden/system files" enabled? As I said I lost GPT drives twice when I copied something from one (non-GPT) to GPT drive (root to root). I don't think this is hardware issue. Something overrides MBR/GPT info while you are copying...........

As a test try the following:
Create your blank GPT drive (3 TB) - don't copy anything. Then create one directory (let's call it BACKUP). Then when copying to this drive you should copy all data to that directory only (ex E:\BACKUP). Don't copy any files or directories to the root of GPT drive (E:\). Let's see if you loose this drive again.
 
Last edited:
Yes I do have hidden files enabled, but not system files. Also, the only thing I really put on it were .mkv files and .m2ts files that I was demuxing (compressing my blurays and dvds to store).

So you suspect that copying media files to the root could wipe the drive out? I can try to use them only within directories, but I'm not sure such a long term experiment is worth the risk... it took months of use for this to happen to each drive, and so the only way to get a result is by losing terabytes of data? I would be better off using it as MBR... losing 5-600GB of space (or $30) doesn't compare to the risk of losing thousands of hours of work does it?

I suppose I could set up one as MBR, using 2.2TB and mirror data daily from it to the other using a 3TB GPT partition. I would still have the data provided the test fails before the primary drive does...
 
This does not make sense. I use GPT and I have not had anything like this. However I do not have a single 3TB drive.
 
I am assuming since they are 3TB drives... they are SATA yes?

if so what is the OVERALL length of the cabling?

ie add up the length of cable from controller to the Expander... then from the expander to the backplane or direct to drive...

MAXIMUM length of SATA cabling is 1Meter... not taking into account backplanes or expanders... so I would suggest maybe MAX length of say 70cm...

I am guessing you are on the limit... ??
 
I'm not sure that SATA limit applies to the total cable length in this case, since the expander is essentially acting as a repeater, regenerating the signal as it passes.
 
Note the SAS Expander part... not SATA Expander....
Not sure on the Regenerate and resend the signal part.... I would suggest it EXPANDS the signal... Not Regenerate....

As if that was the case I would imagine the SAS Expander would have some decent CPU onboard ALA a network switch to keep up with all the SAS packets potentialy flying back and forth up to 65,000 devices.

SATA protocol runs encapsulated inside SAS protocol... which also would mean possiblity of the SAS expander having to reencapsulate the packets as well.... Naturally SAS is simpler for it to not have to look inside the packet for SATA traffic to reencapsulate

Not that I understand it completely... but I would still be wary of overal length

Even the docs for Intels own SAS expander state

Use cables no longer than ten meters for SAS and one meter for SATA. It is better to use the shortest possible cables. The cable length should be reduced by about one foot (.33 meters) if using a backplane.

Altho it does not actuall explain if that is 1 Meter each side of the expander or not. but still they say use the shortest length possible.

Can't hurt to provide the cable length I asked for... and or to try shorter cables if you have them...

If nothing else it will write off cable length (or possibly a dicky cable) as your possible problem;)
 
Not sure on the Regenerate and resend the signal part.... I would suggest it EXPANDS the signal... Not Regenerate....

I am pretty sure it has to repeat the signal to expand the ports.
 
I am pretty sure it has to repeat the signal to expand the ports.

Like I said I am not TOTALLY SURE how it works, tho more reading is interesting


Snip

On the other hand, SATA devices can operate very well on SAS hardware through STP or natively, if no expander hardware is used. STP introduces extra latency, as dwords flow through expanders. The expander has to establish the connection, which is a slower process than SATA’s direct communication. Still, the latency is acceptably low.

Domains, Expanders

SAS domains can be compared to tree-like structures similar to what you may know from Ethernet networks. SAS expanders can manage a large number of SAS devices, but they work based on link switching rather than the more common packet switching. Some expanders include SAS devices while others don’t.

Snip
from
http://www.tomshardware.com/reviews/sas-6gb-hdd,2392-2.html



Altho SAS has a higher signal voltage than SATA hence the reason it's length can be further....

Which also as you suggest could mean effectively you get SAS voltage from the SAS card to the EXPANDER... then again to a second Expander if one is used...so that length from controller to expander is not the limit...

But the SATA side of the equation being a lower voltage (hence the 1meter max) would / could be causing a problem..

Looking at the links you posted... and the photo's of your Norco.
you cables from the Expander to the backplane are what... approx somewhere between 50-70cm?

Then take into account IBM and others spec limitations which recommend taking into account approx 33cm for a backplane..

If your sff-8087 mini sas cables are indeed around 50cm then you should be under 1meter "50cmcable + 33cm for backplane = 88cm".... if they are around 70cm...then you are on the limit with length of approx 93cm.

Have you tried simple thing like swapping the minisas cables for different ones? ie different brand / shorter.

Have you tried the 3tb drives in different bays?? ... that is different backplanes.... not different bays within the same backplane (norco's backplanes go across not down yes?

I know with Solaris and ZFS, you quickly find out how good your HARDWARE is whilst doing a scrub...

That is, people find drives dropping in and out, ZFS spitting errors... where other file systems and Raid cards might not notice. or blindly hide these errors.... all you notice is corrupted data later on.

They start swapping cables around and BANG the issues dissapear...

All you need is one new shorter cable and start swaping it with the others to find the problem cable....and to reposition the 3tb drives to find the problem backplane.

Eliminate the hardware errors first when chasing problems I say.;)

If thats not the problem, then move on to other areas... ie are the drives SATA 2 or SATA 3... and are the backplanes rated to SATA 2 or SATA 3?..
If not can you jumper the drives down to SATA 2 or even SATA 1 and then retest the partitioning and if they still "Lose their formatting"

.
 
I've had a few drives go RAW recently, in the middle of copying like you, and it was caused by poor electric connections (using a poorly made Molex doubler). Most of the time, Windows 7 was able to repair the problem at the next reboot, so definitely something linked to the MBR or the MFT (as I don't use GPT and don't have drives bigger than 2TB).

I'm not sure if this has anything to do with your problem.
 
Going RAW has something to do with my problem, and I can replace power connections or whatever, but why on earth would it single out only 3TB drives, and only GPT partitions?

I have an LSI SAS 3442E-R and the driver is currently this

DriverVer=04/26/2010,1.33.01.00

which is a Vista driver, and there is a Win7 one from December, so I can use that one and see. There is a one meter SAS cable to the Norco box and within that box the SATA cables are all stock (like 8 inches or so).

I will try this driver:

DriverVer=12/07/2010,1.34.02.00

But I would really like some affirmation that this is likely to make a difference... like I've said, I can't no what's wrong I can only assume it's not wrong if it works for a few years or something...
 
It just happened again, this time on my fourth 3TB drive after about a month of use now. I had just synced it, but I've lost a few days of work. This one was in the SAS box as well, but I just don't have the option to put all my 3TB's on SATA, that was the reason I got a server box.

I'll go through the recovery attempts, but I'm really hoping a solution comes out for this chronic issue. It has to be related to my connections, the expander, or controller, but since it happens every few months, I don't know how to test for it.
 
When I said fourth, I meant the fourth one I own, and I believe it's happened to at least 3 different ones, but it's so spread out unfortunately I don't remember. I want to say it's only happened once for each drive. By the way, initialization requires the sata controller in order to use all 3TB, but normal operation is in the SAS box.

I thought about replacing the LSI 3442e-r with a different controller, but it could be a year or another purchase before I know whether or not it was the issue. What do you suggest?
 
Are you plugging the end of the 8087-SAS/SATA cables into a backplane or directly into the drives?
 
haileris - I've used them via SATA sometimes, but because the issue takes so long to replicate, I can never be sure what works. By "cradle" which component is most suspect? The backplane, expander, controller?

ASUS P8P67 > LSI 3442 > HP Expander > Norco 4220 backplanes (5x4)

I used the included cables, the SAS cable is a 1m
 
I'm considering trying out another controller for a while since I'm planning to migrate to two rigs for physical security anyway, but I'm not sure what to get, since none of them list the chip they're based on. I assume I need an LSI SAS2008. Actually, several of these cards appear to be rebrands, but only the Highpoint explicitly states 3tb compliance:


IBM 46M0907 PCI Express 2.0 x8 SAS...
http://www.newegg.com/Product/Product.aspx?Item=N82E16816145017

HighPoint RocketRAID 2721 PCI-Express 2.0 x8...
http://www.newegg.com/Product/Product.aspx?Item=N82E16816115086

LSI SATA/SAS 9212-4i4e 6Gb/s PCI-Express 2.0
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118133

Come to think of it, the fact that while controlled by the 3442 the drives can't be formatted indicates the controller could be insufficient for 3TB drives...

I found something here, but I'm unsure what it means:
http://kb.lsi.com/KnowledgebaseArticle16399.aspx

Are they saying there's an unofficial firmware for large drives?
 
Last edited:
If I read that link correctly mate it looks like you are stuffed.

The sentence below says to me that you need to use SAS drives to use 3TB drives. Whether it works or not is inconsequential, they won't support your setup even if it works (if it works unfortunately)

3Gb/s SATA+SAS HBAs will detect drives with greater than 2 TB in capacity (in other words, the 3Gb/s SATA+SAS HBAs will work with 3 TB SAS disk drives) if Phase 13 or newer IT firmware only (i.e. not IR firmware)(SAS disk drives only, not SATA) is used.

I hope I am wrong
 
The next line seems to provide better context:

6Gb/s SATA+SAS HBAs will detect drives with greater than 2 TB in capacity (in other words, the 6Gb/s SATA+SAS HBAs will work with 3 TB disk drives) if Phase 10 or newer firmware is used.

It seems they want to make sure that people understand their 3Gb/s HBAs only support 3TB SAS drives while their 6Gb/s HBAs don't have that limitation
 
Right, so if I replace it with a new one, one of those I linked on the egg, I should be fine? I don't give a damn about tossing the 3442, I just want my data secure.

By the way, thanks guys.
 
About a month ago I replaced the SAS controller with this one (new cable as well):

SAS 9212-4I4E Single 4PORT 6GB SATA+SAS PCIE 2.0 512MB

And just now lost another 3TB format. I just realized the firmware might still be 7. something, so I'm flashing to 11.
 
Last edited:
is it really so easy to wipe out a hard drive?

No. I have never ever seen this happen.

Just a flaky cable or something?

That should be handled by the SATA protocol. I mean on a transmission crc error the packet should be retransmitted.

Also the question is why is the partition table / format being overwritten in the first place? This should be written once and never ever written after that.
 
Maybe a write request is being corrupted and writing crap over the beginning of the disk where the partition table lives?
 
Possibly some software that directly writes to the disks but does not support / understand >2TB drives?
 
heh - I had similar issues with drives going RAW or file copies suddenly erroring out. The cause: Faulty JMicron S539 chip that cannot handle USB 3.0 properly. The drive enclosure (Verbatim 2 TB USB 3.0 external hdd) would suddenly power off at random intervals.

My guess: Power supply issue, or the hdd is faulty in some way (firmware? bad controller board?), or the cable or the hdd connector is bunk.
 
I have had similar issues with a 3TB GPT partition in an eSATA enclosure. Was robocopying ~ a 700GB database over for backups at a dental office, since my firm's primary backup software would take a shit copying that much data at once for some reason. Got about 2 weeks worth of backups alternating on this drive and a second 2TB backup drive before the 3TB went RAW. Replaced it with a 2TB in the same caddy plugged into the same eSATA port and haven't had any issues in the 6 months since...

Still have it sitting here, tests fine in the WD DLG diagnostic utility, but haven't found a place to use it where I won't mind if it explodes again. May try it as a system drive on my main shop PC in place of the current hard drive, so I won't be super set back if it does melt down.
 
I always had impression that the LSI1068e (and 1064e) based controllers (such as your 3442E-R) do not support 2TB+ hard drives and thought this was a nice way for LSI to make sure you will buy newer ones bases on SAS2008 or better.:mad:

Perhaps the fact that you use expander somehow masked this problem but occasionally the controller (with older firmware/BIOS) will make sure who is the master by nuking the MBR.

Or perhaps these intermittent problems were the reason for LSI to disable the 2TB+ support in the newer versions of its firmware/BIOSes once these 2TB+ drives became popular and affordable...
 
I'm hoping that since my SAS2008's firmware is at version 7 I should be able to fix it by flashing to 11 (LSI states that only 10 or better support 3TB, and it's confirmed by the fact that I have to move the drives to my SATA bus just to format the full 3TB in the first place) I just didn't realize this was such an issue, since 99% of the writes seemed to work fine.
 
I have a problem with 3 TB disks in my HP Z800 workstation. About 5 disks have failed and now I am starting to find a pattern. I format the disks in a USB enclosure as GPT disks (not MBR disks), mount them in the caddies inside Z800, connected to the controller reported as "LSI Adapter SAS 3000 series, 8-port with 1068E-StorPort". Initially all the disks seem to work fine. They show available space of 2794.38 GB. I have run the following scenario multiple times with Western Digital, Hitachi and Seagate, all 3TB disks. I start filling the disks with data. After I reach about 2 TB of data stored on the disk (i.e. about 700 GB free), the disk crashes. I thought that the crash was caused by xcopy and switched to robocopy but the same happens with robocopy. I am not copying anything to the root directory of the disk when the crash happens. I was also under the impression that some disks were better as they were working fine after weeks of working but I now realize that these disks never crossed the 2 TB threshold, i.e. they have less then 2 TB filled and probably that's the reason why they are still OK. Any suggestions? This has taken me months of bad experience and is driving me crazy.
 
This whole thing sounds a lot like a 2TB limitation. I would strongly suspect that writes beyond the maximum 512b block representable by an unsigned 32bit integer wrap over to block 0, i.e. 4294967296 * 512b = 2TB

You could test such things by writing to the 4294967297th block with dd or a hex editor and see if it ends up at block 0.

It would also fit the description of happening only after months, since I guess filling 2TB takes some time. Just a thought for the future.
 
Last edited:
Yes, the key with my disks is that the 3TB drives never show up as 3TB unless I move them to the SATA controller for formatting. I thought it was OK since the 3TB showed up after it was formatted, but I guess it's very possible that they all died as soon as the 2TB mark was crossed.
 
Back
Top