Tricky data integrity / corruption issue - incredibly frustrating!

commando

Weaksauce
Joined
Mar 12, 2004
Messages
103
I'm getting data corruption when I transfer files in Windows 7 and 10RC. It mostly shows up in large files (20GB+), but it likely happens in smaller files too. I replaced my motherboard to one based on a different chipset (P67 -> B75) as I guessed that was the issue, but the problem remains. This happens regardless of the disk used - I have five disks in the machine, including two SSDs. It happens when the motherboard is in IDE or AHCI modes.

The really vexing thing is when I ran Windows XP from a rescue DVD using generic drivers it worked perfectly, not one bit of corruption. It also worked perfectly in Ubuntu 14.10. It also works fine in Windows 10 safe mode.

I've run memtest x86+ with all tests going on both old and new motherboards for 48 hours plus, not a single error. I've run the Intel processor diagnostics, it reports perfect. I've run Prime95 for a day without error. I ran another processor load test as well, I forget which one, no issues. Component temperatures are all low to moderate - I have good cooling. SMART status of all drives is good - no errors - there are 5 drives including one ReFS / mirrored storage space so I doubt it's a drive issue. I've also plugged only one drive in (Samsung 840, and separately the OWC SSD) and made sure that it wasn't one bad drive corrupting data on the SATA bus.

Based on this I'm fairly confident that it's a driver or software issue of some kind. I let Windows set up the drivers initially on install, and I've done my best to update the drivers to the manufacturer recommended drivers. Most of those drivers are a couple of years old, and for example finding an updated C216 AHCI driver on the Intel website isn't easy. Doing this thoroughly and systematically is my next and final plan.

Can anyone give me any suggestions? Which drivers do you think are most critical to update? AHCI perhaps, but it happens in IDE mode too. Motherboard drivers of course I'll double check.

I'm not far from throwing this machine out a window and ordering a SuperMicro Xeon E3/ECC RAM system.

--- Background info below ---
Testing methods:
- Copy between disks using Teracopy with validation turned on. It does CRC checks. W7, W10.
- H2Testw, which writes data to a drive then reads it. W7, W10, XP. This shows that in a 100GB test I will typically get 3 - 6 bytes gone bad.
- Ubuntu "badblocks" command. It writes data to the drive with various patterns then reads it back.
- Macrium Reflect - I verify images that I know are good on one disk after copying to another disk. It appears to have a CRC type check built in.


I had a thread about this general topic a while back, but I've done a lot of experimentation and research so I'm starting this as a new topic.


Computer specification:
- GIGABYTE GA-B75M-D3H LGA 1155 Intel B75 rev 1.2 (brand new)
- Previous motherboard Gigabyte GA-P67A-UD3R-B3 Motherboard, Socket 1155, Intel P67 motherboard
- Intel Core i7-2600K 3.4 GHz, Socket 1155 (running at stock speed)
- Gigabyte GV-N520OC-1GI, GeForce GT 520 Video Card, 1024MB
- Corsair Vengeance CML16GX3M4A1600C9, 4x4GB, DDR3-1600
- Antec High Current Gamer HCG-520
- Nocuna CPU cooler - I forget the model but it has two 120 or 140mm fans
- Seagate 1TB x 2, NTFS
- HGST 4TB x 2, ReFS with mirror storage space
- Samsung 840 pro 120GB running Windows 10 technical preview
- OWC SSD 120GB
 
Last edited:
check smart on the drives for errors

run memtest, its probably bad RAM

Thanks for your thoughts. As mentioned above memtest x86+ and windows memory testing have both been run for 48h+, not a single error reported. Drive SMART status is clean - given the problem is affecting all five drives that's not likely anyway. I've tried with just one drive plugged in too (Samsung SSD by itself first, then OWC SSD by itself), in case one drive was corrupting data on the bus.
 
What components are the same between the motherboard swap? How about that video card?

Everything other than the motherboard - RAM, CPU, video card, and operating system. W10 wasn't reinstalled, but W7 is a fresh install and is displaying the same issues. The drivers all come from the same place of course. I could take the video card out and use the built in video.

However since XP and Ubuntu work 100% fine on the same hardware I strongly suspect software, specifically drivers.
 
I'm getting data corruption when I transfer files in Windows 7 and 10RC. It mostly shows up in large files (20GB+), but it likely happens in smaller files too. I replaced my motherboard to one based on a different chipset (P67 -> B75) as I guessed that was the issue, but the problem remains. This happens regardless of the disk used - I have five disks in the machine, including two SSDs. It happens when the motherboard is in IDE or AHCI modes.

The really vexing thing is when I ran Windows XP from a rescue DVD using generic drivers it worked perfectly, not one bit of corruption. It also worked perfectly in Ubuntu 14.10. It also works fine in Windows 10 safe mode.

(I told you it wasn't a chipset issue...)

Seriously, I agree that your hardware tests would've caught any CPU/RAM/mobo defects.

Chances that it's related to PSU are slim with alternate-OS testing.

Okay, it's software.

You're talking about local transfers only, correct? This will rule out any CIFS/SMB/NIC issues.

If yes, how do non-Win10 OSes see the ReFS volume?
 
Ok, you were right, but it was easy to try! Agree it's unlikely to be power.

Yes local transfers only. Only W10 can see the ReFS disk, however I have two SSDs and two HDD that are NTFS and the errors happen between them. I don't think it's file system based. Happy to run any tests you suggest that would test this.

I see occasional errors in the logs saying ReFS/SS have corrected an error. If I transfer 1TB in a day I might see zero, one, or five - they are relatively rare.
 
I would try that. I doubt software is the cause, except in the sense that different I/O patterns either cause or avoid the hardware problem.

I doubt it's the video card, but it'll only take me 20 minutes to try it. I'll do it tomorrow, my wife is using the computer right now.
 
The only question I have since you mentioned XP is... are you running 64bit OS? XP wouldn't be accessing any memory above 3.5Gig most likely. I'd try running 2 ram sticks at a time instead of 4 and rerun the drive copy tests. There's some latency adjustments that usually have to be made when running 4 sticks.
 
Disable virtualization settings in the BIOS and try again.

If it works with all the VT-x settings off, then you have an incompatibility the OS and your hardware
 
I've run memtest x86+ with all tests going on both old and new motherboards for 48 hours plus, not a single error.

Memtest86+ has not been maintained for over two years now, and is known to have problems. The original Memtest is being supported by Passmark.

There is also HCI Memtest. See this HardOCP thread for comments about running multiple instances of HCI Memtest to catch errors that Memtest86+ doesn't catch at all.

I don't think you're done testing RAM yet. It's such a strong suspect and Memtest86+ was not the best choice for RAM testing.
 
Thanks for all your ideas, you've taken me from thinking it's hopeless to giving me ideas.

Based on your ideas I've done the following:
- Turned RAM from "turbo" to "normal" in the BIOS
- Removed the video card and used the built in graphics
- Removed all but one stick of RAM

I have some generally positive news: in that single stick of RAM configuration I don't get any disk errors :) Unfortunately when I put all four sticks of RAM back in I do get disk errors. I'm running the memory test right now, so that should give more information, but I'm feeling reasonably positive :) Current RAM timing is 9-9-9-24 - I haven't looked at this in a while so not sure what it means, other than I know it's generally about latency.

So now I have to track down which of the fixes actually worked. Plan:
- Add all RAM back, retest disk (done - fail).
- Run the HCI Memtest (underway)
- Try changing RAM speed somehow (suggestions greatly appreciated)
- Add video card back, retest disk


I'll report back in 12 hours or so with results of the ram test recommended above.
 
Last edited:
HCI Memtest has found bad RAM - evilsofa was bang on. I think I owe you a beer.

Their FAQ suggests first seeing if you can use more conservative RAM timings, then testing each DIMM separately to work out which one it is. Of course it can only test free memory, not memory in use by the OS, so an isolation test could take some time.

My RAM timing is set to auto, 9-9-9-24, I've uploaded a gallery showing the BIOS screens/timing here. Can anyone suggest if I should try changing it to something else before I do the isolation test? I understand the physics behind RAM and what the numbers mean but I have little practical experience.
 
It's unlikely to make a difference, either find out which stick it is or buy new ram. Or both.
 
Ok then, I'll do that. Can I happily run with 12GB? I know I won't get the whole dual channel RAM thing working with 12GB, but I'm not too bothered, I don't do anything super performance critical.

I know it's not ideal, but if I wanted to could I buy another 4GB stick, or 2x4GB to get me back to 16GB? Probably not worth it all in all, not really wanting to spend anything on this PC.
 
Well I've spent the entire day running memory tests with HCI Memtest. The first test above showed that there was a memory problem. I've run a series of tests today (to 100%) and I couldn't reproduce the problem with HCI but did with other testing software. It's quite discouraging. I do think RAM is the most likely cause, but I'm not quite sure where to go from here. It's not worth spending money on RAM - I'd build a whole new machine first.

When I ran the tests I closed everything possible, disabled services, etc, in order to maximise the RAM available to test.

Tests run:
- All 4 DIMMS in original configuration (failed HCI memtest)
- Each DIMM individually in slot 1 (passed HCI memtest)
- DIMMS 1 & 2 in slots 1 & 3 - ie dual channel (passed HCI memtest)
- DIMMS 3 & 4 in slots 1 & 3 - ie dual channel (passed HCI memtest)
- DIMMS 3 & 4 in slots 3 & 1 - ie dual channel reversed (passed HCI memtest)
- Each DIMM in a random slot (passed HCI memtest, failed h2testw, failed a 50GB teracopy copy/validate test)
- (I didn't do the obvious DIMMS 1 & 2 in slots 3 & 1 yet)
- (I also didn't try the other DIMM slots)

I guess I should drop the HCI memtest and just use teracopy and h2test2 to test. Seems strange that memtest passes now and the file copy test still fails.

I'm not willing to invest too much more time in this - I have productive work not being done because of this. I'm really tempted to chuck the lot out and buy a new one. Suggestions very welcome.
 
Since you only got the HCI Memtest failure with all four slots full, do you have the PSU plugged into all eight pins of the ATX_12v_2x4 socket? On the picture of the motherboard, that's the one in the upper right corner. Errors can happen if the RAM is not supplied with enough power, and while that socket mainly powers the CPU, it may also power other components on the motherboard, including the RAM.

Do you have the BIOS on the motherboard updated to the latest version? The changelog for the latest version F15 is "Enhanced SATA compatibility" which is interesting with respect to your issue.

You should also make sure you have the latest Intel Chipset Device Driver, which is up to 10.0.26 now, and the latest Intel RST driver if you are using that.
 
Yes the power appears to be plugged in square and well. I've given it a bit of a jiggle and another push in. BIOS is on the latest version. Intel RST was installed, previous version, it wouldn't update so I uninstalled it.

I've downloaded and run the Intel Chipset Device Driver, but I'm not sure it updated things. Auto update thinks it's up to date as well. You can see a screenshot here.

I ran HCI memtest last night with all 4 RAM sticks in. This morning I had two errors pop up, all memory tests were at around 350% by that time. I did my previous testing to 100% only. I think I may need to do overnight testing to find this issue. I guess I can put a single RAM chip in and run the tests overnight over the next week or so. This morning I have two sticks of RAM in, I've done copy tests and disk read/write tests and everything has passed. I'm not seeing much point of doing HCI tests unless they run overnight, but on two sticks of RAM it seems ok so far.

In the background I'm going to specify myself up a workstation with a Xeon processor and ECC RAM. If I don't sort this by next weekend I'm ordering it.

EvilSofa you're the only thing giving me hope!
 
Last edited:
Back in the DDR2 days, a lot of motherboards had trouble if you filled all four RAM slots. A solution in some cases was to either relax the timings or boost the RAM voltage a bit.

When I googled for the RAM you have and "boost voltage", I got this interesting thread in which Corsair support suggested increasing RAM voltage to 1.55 or 1.6 when using four sticks.
 
Interesting, I'll give that a shot with an 8h memory test tonight :) Fortunately the case has pretty good cooling, the RAM is warmish but not hot right now.
 
Interesting results. I've confirmed that any two sticks of RAM in any two slots works perfectly with RAM at stock 1.5V, with HCI memtest and copy tests. I've also confirmed that at 1.58V it works perfectly with all four sticks of RAM, passing HCI and copy tests - so I guess that's the answer then! The RAM reaches 61 degrees instead of 50 degrees so I may consider RAM cooling.

Should a PC really need to be over volted to work properly? I'm not sure I trust this machine any more, even passing all the tests around.

Would there be value trying looser timings? eg going from 9-9-9-24 to 11-11-11-24? Probably no harm giving it a go.

Thanks especially to EvilSofa, you've been super super helpful :)
 
I've had a ESXi host act up with random write errors, erratic performance and various random issues. After a while, I tested the memory, as it worsened and starting giving faulty RAM related issues. I tested the memory with Memtest86, it gave immediate errors with both sticks, but no errors with any single stick. I sent it back for RMA and got a new set, I've had no problems ever since.

If it isn't your RAM, I would also suggest looking at SATA cables, the way they are guided through the case. I've had 3 premium SATA cables give me disk errors due to running on top of eachother for a few feet.
 
I've had a ESXi host act up with random write errors, erratic performance and various random issues. After a while, I tested the memory, as it worsened and starting giving faulty RAM related issues. I tested the memory with Memtest86, it gave immediate errors with both sticks, but no errors with any single stick. I sent it back for RMA and got a new set, I've had no problems ever since.

If it isn't your RAM, I would also suggest looking at SATA cables, the way they are guided through the case. I've had 3 premium SATA cables give me disk errors due to running on top of eachother for a few feet.

The think is MemTest x86 reports everything 100% fine, it's only HCI memtest that finds errors from within Windows, and then only when four sticks of RAM are in use. Two sticks is fine for anything. My RAM is almost four years old, I'll contact Corsair but I'm not that hopeful of a resolution.

I've tried new SATA cables, including only plugging one drive in at a time.
 
Corsair replaced the RAM, which is a bit over two years old, with new RAM that has the same model number. Great service from them I have to say. A couple of hours of testing that easily reproduces the problem with the old RAM has found zero errors. So my conclusion is it's the RAM.

I find it very strange that the old RAM works in pairs but not two pairs, and that memtest x86 doesn't find any problems testing either individually, in pairs, or in fours. HCI Memtest did find errors, but only at 300% - ie after testing the memory three times. The way I found the errors is to use H2TestW and Teracopy in verify mode.

So the moral of the story is if you get data corruption, replace the ram, even if it tests fine.

Thanks to everyone here who helped, you guys were invaluable and stopped me throwing the stupid machine out a window!
 
Back
Top