New drive badblocks.

drescherjm

[H]F Junkie
Joined
Nov 19, 2008
Messages
14,941
I have a new 2TB F4 (have not tested the other 2 F4s) that I decided to do a 4 pass write badblocks on it (over 2 days of testing) and that resulted in 736 bad blocks however there were no errors recorded in SMART and no errors recorded in my dmesg.

Here is the output:

jmd0 ~ # badblocks -svw /dev/sdf -o S2HGJ1BZ836643.txt
Checking for bad blocks in read-write mode
From block 0 to 1953514583
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 736 bad blocks found.
jmd0 ~ # smartctl --all /dev/sdf
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD204UI
Serial Number: S2HGJ1BZ836643
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Oct 4 19:47:42 2010 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21060) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
3 Spin_Up_Time 0x0023 068 068 025 Pre-fail Always - 9724
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 56
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2
181 Program_Fail_Cnt_Total 0x0022 252 252 000 Old_age Always - 0
191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
194 Temperature_Celsius 0x0002 064 064 000 Old_age Always - 28 (Lifetime Min/Max 22/36)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 0
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited:
The machine is a Intel core2 Q9550 running at 3.1GHz instead of 2.83 GHz. Yes, I know overclocking can cause this. However the system has been rock stable (24/7/365) at this overclock (not a single kernel panic ...) for nearly 2 years since I purchased it in November of 2008. I guess I should test the drive on my i7 box. The interesting thing is the first entire pass returned no errors at all.

Edit: I do have the badblocks list that I can post somewhere but I will refrain from copying it here being that its too long..
 
You did not post the list of badblocks. Were they distributed throughout the disk, or were most of them clustering in certain locations?

Did you try setting the block size to 4096 (-b 4096)? I don't know if it makes any difference in the time to complete the test, but the default is 1024 on my copy.

If you wrap code tags around the badblocks list, it should be okay to post.
 
Thanks. I forgot about code tags..

Code:
jmd0 ~ # cat S2HGJ1BZ836643.txt 
386596256
386596257
386596258
386596259
386596260
386596261
386596262
386596263
386596264
386596265
386596266
386596267
386596268
386596269
386596270
386596271
386596272
386596273
386596274
386596275
386596276
386596277
386596278
386596279
386596280
386596281
386596282
386596283
386596284
386596285
386596286
386596287
805674976
805674977
805674978
805674979
805674980
805674981
805674982
805674983
805674984
805674985
805674986
805674987
805674988
805674989
805674990
805674991
805674992
805674993
805674994
805674995
805674996
805674997
805674998
805674999
805675000
805675001
805675002
805675003
805675004
805675005
805675006
805675007
1370054432
1370054433
1370054434
1370054435
1370054436
1370054437
1370054438
1370054439
1370054440
1370054441
1370054442
1370054443
1370054444
1370054445
1370054446
1370054447
1370054448
1370054449
1370054450
1370054451
1370054452
1370054453
1370054454
1370054455
1370054456
1370054457
1370054458
1370054459
1370054460
1370054461
1370054462
1370054463
1475864160
1475864161
1475864162
1475864163
1475864164
1475864165
1475864166
1475864167
1475864168
1475864169
1475864170
1475864171
1475864172
1475864173
1475864174
1475864175
1475864176
1475864177
1475864178
1475864179
1475864180
1475864181
1475864182
1475864183
1475864184
1475864185
1475864186
1475864187
1475864188
1475864189
1475864190
1475864191
1476597664
1476597665
1476597666
1476597667
1476597668
1476597669
1476597670
1476597671
1476597672
1476597673
1476597674
1476597675
1476597676
1476597677
1476597678
1476597679
1476597680
1476597681
1476597682
1476597683
1476597684
1476597685
1476597686
1476597687
1476597688
1476597689
1476597690
1476597691
1476597692
1476597693
1476597694
1476597695
1535326368
1535326369
1535326370
1535326371
1535326372
1535326373
1535326374
1535326375
1535326376
1535326377
1535326378
1535326379
1535326380
1535326381
1535326382
1535326383
1535326384
1535326385
1535326386
1535326387
1535326388
1535326389
1535326390
1535326391
1535326392
1535326393
1535326394
1535326395
1535326396
1535326397
1535326398
1535326399
1687842016
1687842017
1687842018
1687842019
1687842020
1687842021
1687842022
1687842023
1687842024
1687842025
1687842026
1687842027
1687842028
1687842029
1687842030
1687842031
1687842032
1687842033
1687842034
1687842035
1687842036
1687842037
1687842038
1687842039
1687842040
1687842041
1687842042
1687842043
1687842044
1687842045
1687842046
1687842047
1826552736
1826552737
1826552738
1826552739
1826552740
1826552741
1826552742
1826552743
1826552744
1826552745
1826552746
1826552747
1826552748
1826552749
1826552750
1826552751
1826552752
1826552753
1826552754
1826552755
1826552756
1826552757
1826552758
1826552759
1826552760
1826552761
1826552762
1826552763
1826552764
1826552765
1826552766
1826552767
79448416
79448417
79448418
79448419
79448420
79448421
79448422
79448423
79448424
79448425
79448426
79448427
79448428
79448429
79448430
79448431
79448432
79448433
79448434
79448435
79448436
79448437
79448438
79448439
79448440
79448441
79448442
79448443
79448444
79448445
79448446
79448447
931251232
931251233
931251234
931251235
931251236
931251237
931251238
931251239
931251240
931251241
931251242
931251243
931251244
931251245
931251246
931251247
931251248
931251249
931251250
931251251
931251252
931251253
931251254
931251255
931251256
931251257
931251258
931251259
931251260
931251261
931251262
931251263
1124616032
1124616033
1124616034
1124616035
1124616036
1124616037
1124616038
1124616039
1124616040
1124616041
1124616042
1124616043
1124616044
1124616045
1124616046
1124616047
1124616048
1124616049
1124616050
1124616051
1124616052
1124616053
1124616054
1124616055
1124616056
1124616057
1124616058
1124616059
1124616060
1124616061
1124616062
1124616063
1305464160
1305464161
1305464162
1305464163
1305464164
1305464165
1305464166
1305464167
1305464168
1305464169
1305464170
1305464171
1305464172
1305464173
1305464174
1305464175
1305464176
1305464177
1305464178
1305464179
1305464180
1305464181
1305464182
1305464183
1305464184
1305464185
1305464186
1305464187
1305464188
1305464189
1305464190
1305464191
1475607008
1475607009
1475607010
1475607011
1475607012
1475607013
1475607014
1475607015
1475607016
1475607017
1475607018
1475607019
1475607020
1475607021
1475607022
1475607023
1475607024
1475607025
1475607026
1475607027
1475607028
1475607029
1475607030
1475607031
1475607032
1475607033
1475607034
1475607035
1475607036
1475607037
1475607038
1475607039
225556384
225556385
225556386
225556387
225556388
225556389
225556390
225556391
225556392
225556393
225556394
225556395
225556396
225556397
225556398
225556399
225556400
225556401
225556402
225556403
225556404
225556405
225556406
225556407
225556408
225556409
225556410
225556411
225556412
225556413
225556414
225556415
444989792
444989793
444989794
444989795
444989796
444989797
444989798
444989799
444989800
444989801
444989802
444989803
444989804
444989805
444989806
444989807
444989808
444989809
444989810
444989811
444989812
444989813
444989814
444989815
444989816
444989817
444989818
444989819
444989820
444989821
444989822
444989823
655023648
655023649
655023650
655023651
655023652
655023653
655023654
655023655
655023656
655023657
655023658
655023659
655023660
655023661
655023662
655023663
655023664
655023665
655023666
655023667
655023668
655023669
655023670
655023671
655023672
655023673
655023674
655023675
655023676
655023677
655023678
655023679
860067424
860067425
860067426
860067427
860067428
860067429
860067430
860067431
860067432
860067433
860067434
860067435
860067436
860067437
860067438
860067439
860067440
860067441
860067442
860067443
860067444
860067445
860067446
860067447
860067448
860067449
860067450
860067451
860067452
860067453
860067454
860067455
1056818592
1056818593
1056818594
1056818595
1056818596
1056818597
1056818598
1056818599
1056818600
1056818601
1056818602
1056818603
1056818604
1056818605
1056818606
1056818607
1056818608
1056818609
1056818610
1056818611
1056818612
1056818613
1056818614
1056818615
1056818616
1056818617
1056818618
1056818619
1056818620
1056818621
1056818622
1056818623
1241773280
1241773281
1241773282
1241773283
1241773284
1241773285
1241773286
1241773287
1241773288
1241773289
1241773290
1241773291
1241773292
1241773293
1241773294
1241773295
1241773296
1241773297
1241773298
1241773299
1241773300
1241773301
1241773302
1241773303
1241773304
1241773305
1241773306
1241773307
1241773308
1241773309
1241773310
1241773311
1415493984
1415493985
1415493986
1415493987
1415493988
1415493989
1415493990
1415493991
1415493992
1415493993
1415493994
1415493995
1415493996
1415493997
1415493998
1415493999
1415494000
1415494001
1415494002
1415494003
1415494004
1415494005
1415494006
1415494007
1415494008
1415494009
1415494010
1415494011
1415494012
1415494013
1415494014
1415494015
1647004384
1647004385
1647004386
1647004387
1647004388
1647004389
1647004390
1647004391
1647004392
1647004393
1647004394
1647004395
1647004396
1647004397
1647004398
1647004399
1647004400
1647004401
1647004402
1647004403
1647004404
1647004405
1647004406
1647004407
1647004408
1647004409
1647004410
1647004411
1647004412
1647004413
1647004414
1647004415
1726351584
1726351585
1726351586
1726351587
1726351588
1726351589
1726351590
1726351591
1726351592
1726351593
1726351594
1726351595
1726351596
1726351597
1726351598
1726351599
1726351600
1726351601
1726351602
1726351603
1726351604
1726351605
1726351606
1726351607
1726351608
1726351609
1726351610
1726351611
1726351612
1726351613
1726351614
1726351615
1862898656
1862898657
1862898658
1862898659
1862898660
1862898661
1862898662
1862898663
1862898664
1862898665
1862898666
1862898667
1862898668
1862898669
1862898670
1862898671
1862898672
1862898673
1862898674
1862898675
1862898676
1862898677
1862898678
1862898679
1862898680
1862898681
1862898682
1862898683
1862898684
1862898685
1862898686
1862898687
jmd0 ~ #

Edit:
Here is the pastbin link I was in the process of creating when I read the last post..

http://pastebin.com/QPURQ6Au
 
That is interesting data. Did you spot the pattern? There are 23 chunks of 32 consecutive bad blocks. You apparently used the default 1024B block size, so the chunks are 32KB in size. I'm not sure why your bad blocks come in 32KB chunks. My copy of badblocks defaults to 64 blocks at a time (1024B blocks), so unless yours did 32 at a time, that should not be the reason.

32KB is 64 512B sectors, but the man page for badblocks says it defaults to 64 "blocks which are tested at a time", not sectors.

The chunks seem to be uniformly randomly distributed over the disk, with an average of 4.15% of capacity between chunks of bad blocks.

I do not understand the reason for the pattern. You might re-run the scan with -b 4096 -c 32, just to see what happens.
 
Thanks. I did not look at it closely enough to see the pattern. I will have to look further into this. I can't explain any of this..
 
BTW, the man pages for my badblocks version (E2fsprogs version 1.41.12) says

-b block-size
Specify the size of blocks in bytes. The default is 1024.

-c number of blocks
is the number of blocks which are tested at a time. The default is 64.
 
I looked at the data and it is corrupted:

Code:
jmd0 ~ # dd if=/dev/sdf bs=1024 skip=1862898656 count=33 | hexdump -C
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00008000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008400
33+0 records in
33+0 records out
33792 bytes (34 kB) copied, 0.000801165 s, 42.2 MB/s

At the 4th pass all bytes on the disk should be 0. FF was the previous pattern. This looks really weird. Kernel bug?

Edit same goes for the previous block.
Code:
jmd0 ~ # dd if=/dev/sdf bs=1024 skip=1726351584 count=33 | hexdump -C
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00008000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00008400
33+0 records in
33+0 records out
33792 bytes (34 kB) copied, 0.0218385 s, 1.5 MB/s

For those of us who can not do hexadecimal math in their heads (I am a programmer and a sys admin) or can not follow linux commands. I did a hex dump of 33 blocks (of 1024 bytes) instead of 32 that badblocks said was bad to see if what the data looked like in the blocks badblocks said was bad and the block immediately following the data. hexdump simplifies the output here so that consecutive lines of the same value are skipped from the output. 0x8000 = 32K. The two output regions were taken from the final two regions that badblocks displayed. I selected these since I know that at that point all bytes of the disk are supposed to be zero..
 
Last edited:
The blocks that were listed as bad seemed random. I'm not sure how a kernel bug could do that. I'm not saying whether it is a kernel bug, just that I cannot imagine what sort of bug would result in what you saw.

Did you try running badblocks again with -b 4096 -c 32, or something else, just to see if the pattern holds or changes?
 
Not yet it takes 10 hours each pass. And the last time there were no bad blocks on the first pass.

Also I have a second identical drive with nothing on it to test as well.
 
Last edited:
Started

Code:
jmd0 ~ # badblocks -svw -c 32 -b 4096 /dev/sdf -o S2HGJ1BZ836643_test2.txt
Checking for bad blocks in read-write mode
From block 0 to 488378645

iotop shows initial disk writes of 117 to 120MB/s
 
Last edited:
New bad blocks:
Code:
22320280
22320281
22320282
22320283
22320284
22320285
22320286
22320287
132451576
132451577
132451578
132451579
132451580
132451581
132451582
132451583
184302392
184302393
184302394
184302395
184302396
184302397
184302398
184302399
234629240
234629241
234629242
234629243
234629244
234629245
234629246
234629247
282862744
282862745
282862746
282862747
282862748
282862749
282862750
282862751
327766616
327766617
327766618
327766619
327766620
327766621
327766622
327766623

So the bad regions are still 32K. It's like the kernel is randomly not flushing back a 32 K block for some reason. The last pattern is was 00 and the new pattern is aa.

Code:
jmd0 ~ #  dd if=/dev/sdf bs=4096 skip=22320280 count=9 | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
9+0 records in
9+0 records out
*
36864 bytes (37 kB) copied, 5.2471e-05 s, 703 MB/s
00008000  aa aa aa aa aa aa aa aa  aa aa aa aa aa aa aa aa  |................|
*
00009000
 
Last edited:
That is really strange. Note that there are only 6 regions this time, and not in the same location as the previous 23 regions. So I agree with you that it looks like some sort of bug -- either in badblocks, or in the kernel. What distro and kernel are you using?
 
Note that there are only 6 regions this time,
The reading was not 100% finished when I took the result.

What distro and kernel are you using?
This is 64 bit gentoo. But it's not a common kernel because I need openvz on this machine. I am using the latest openvz-2.6.32.9.1

For gentoo the package is:
sys-kernel/openvz-sources-2.6.32.9.1

This kernel was updated last week to the latest 2.6.32 in the mainline kernel. I could back off of that to the previous kernel. Still 2.6.32 based (I have been running 2.6.32 for ~8 months).

http://git.openvz.org/?p=linux-2.6.32-openvz;a=summary

Code:
jmd0 ~ # uname -a
Linux jmd0.comcast.net 2.6.32-openvz-dyomin.1 #1 SMP Wed Sep 22 21:45:36 EDT 2010 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux

Since this is a production system I am pretty much stuck with this kernel until I move the subversion / cvs server over to the in kernel lxc containers. I am waiting for some more testing of lxc before I make that switch since I (and others) use the svn server daily. On top of that I can not safely downgrade the kernel to a lower main version for two reasons ext4 and it also causes me difficulty since the box is also my main HTPC backend.
 
Last edited:
That will be hard to debug. If it were me, I'd probably start by finding a forum / bugtracker web site for badblocks and posting the issue there. The man page for badblocks on my system references http://e2fsprogs.sourceforge.net/

Theodore Ts'o is apparently the maintainer of badblocks, which is good if you can get his attention, since you could not ask for a person more knowledgeable about linux and disks.
 
BTW. Thanks a lot. I really appreciate your help on this.

I posted this same question in gentoo and no one has bitten yet. I can take it to badblocks and possibly lkml (linux kernel mailing list). I may also try other drives in that machine and also put this drive in a different machine.

Theodore Ts'o is apparently the maintainer of badblocks, which is good if you can get his attention, since you could not ask for a person more knowledgeable about linux and disks

Agreed.
 
Hi,
I'm seeing a similar problem. Your post is the only thing I can find that
resembles what I see. I believe it is not a bug in Linux or badblocks, but a
problem with the drive.

hdd:
Code:
=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD204UI
Serial Number:    S2H7JD2ZA14578
Firmware Version: 1AQ10001

Tested in 2 different systems: Asus M2A-VM with Athlon X2 4450e and an Intel
DH55TC with Core i5-750. No overclocking, default BIOS settings. The Asus is
running an Ubuntu 2.6.35-23 kernel and the Intel runs a vanilla 2.6.36.1.

Partition table:
Code:
# sfdisk -dl /dev/sdb
# partition table of /dev/sdb
unit: sectors

/dev/sdb1 : start=        1, size=    32129, Id=83
/dev/sdb2 : start=    32130, size= 58589054, Id=83
/dev/sdb3 : start= 58621192, size=  3903856, Id=82
/dev/sdb4 : start= 62525048, size=3844499016, Id=83

Note that /dev/sdb2 is not aligned on 4kb (32130 % 8 = 2)

I first became aware of the problem when I did an md5sum check on some files
that I copied. I couldn't reproduce it with badblocks, but I can with this
script which writes and compares data:

Code:
#!/usr/bin/perl
use Fcntl;
my $dev = shift @ARGV or die "usage: $0 <device>\n";
for(;;) {
   for $pattern (0, 0x55, 0xff, 0xaa) {
      my $buf = pack("C", $pattern) x 4096;
      my $blocks = 0;
      sysopen DEV, $dev, O_WRONLY|O_EXCL or die "sysopen $dev: $!\n";
      printf STDERR "writing pattern 0x%02x... ", $pattern;
      while(print DEV $buf) { $blocks++; }
      close DEV;
      printf STDERR "%d blocks\n", $blocks;

      sysopen DEV, $dev, O_RDONLY|O_EXCL or die "open $dev: $!\n";
      printf STDERR "comparing pattern 0x%02x... ", $pattern;
      for(my $i = 0; $i < $blocks; $i++) {
         read(DEV, $_, 4096) == 4096 or die "read $dev $i: $!\n";
         $_ eq $buf or die "error at block $i\n";
      }
      close DEV;
      printf STDERR "ok\n";
   }
}

result:
Code:
# ./mybadblocks /dev/sdb2
writing pattern 0x00... 7323631 blocks
comparing pattern 0x00... ok
writing pattern 0x55... 7323631 blocks
comparing pattern 0x55... error at block 6279263

# perl -e 'seek STDIN, 6279263*4096, 0; while(read STDIN, $_, 4096) {print}' </dev/sdb2 | hd
00000000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
00000c00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00006000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
^C

21 kbyte of data was wrong, at absolute LBA (6279263*4096 + 0xc00 + 32130 *
512) / 512 = 50266240. This is multiple of 8, i.e., the start of a 4kb
sector.

I then moved partition 2 to start 1 sector later, at LBA 32131, and ran the test again.


result:
Code:
# ./mybadblocks /dev/sdb2
writing pattern 0x00... 7323631 blocks
comparing pattern 0x00... ok
writing pattern 0x55... 7323631 blocks
comparing pattern 0x55... ok
writing pattern 0xff... 7323631 blocks
comparing pattern 0xff... error at block 471799

# perl -e 'seek STDIN, 471799*4096, 0; while(read STDIN, $_, 4096) {print}' </dev/sdb2 | hd
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
00000a00  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
00004000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
^C

27 kbyte of data was wrong, at absolute LBA (1932488704 + 0xa00 +
32131 * 512) / 512 = 475816, which is a multiple of 8, again the start of a
4kb sector.

And now with the partition properly aligned, starting at LBA 32136:

Code:
# ./mybadblocks /dev/sdb2
writing pattern 0x00... 7323631 blocks
comparing pattern 0x00... ok
writing pattern 0x55... 7323631 blocks
comparing pattern 0x55... ok
writing pattern 0xff... 7323631 blocks
comparing pattern 0xff... error at block 4349367

# perl -e 'seek STDIN, 4349367*4096, 0; while(read STDIN, $_, 4096) {print}' </dev/sdb2 | hd
00000000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
*
00000400  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
^C

1 kbyte of data wrong.

So, it appears the problem is aligned with the 4kbyte sectors of the drive,
and not with the 4kbyte pages of the Linux page cache. That's why I think
it's a problem in the drive.

The above was with the "deadline" I/O scheduler. The cfq and noop schedulers
give the same kind of errors.

I'm probably going to return this drive to Newegg before the 30 days are up.
 
The same exact thing happened with a second Samsung F4 on the same machine. However it does not do this on my i7 machine only the core2 quad with the older kernel 2.6.32. I did not try a newer kernel on the core2 quad because this is a production htpc box and during the week I can not take it down. On the weekend I can but its not easy to find time.

BTW, thanks for the perl script. I will look at this as soon as I can..
 
Last edited:
I found out something new: if I disable the write cache of the drive with hdparm -W 0 /dev/sdb, the problem goes away. Before, I would get an error within a few minutes, and now it's been running for several hours with no errors at all.

This supports my theory that the problem is in the drive and not in Linux, because for Linux the situation is exactly as before (apart from the fact that the drive is a little slower for writes).
 
If so this is a serious flaw. I am surprised that other users have not seen this. Hopefully this is fixable with a firmware upgrade. That is if samsung becomes aware of this.

I have purchased 3 F4s and have 2TB of htpc data on one of them that is connected to the core2quad. I have not seen any corruption yet on any of my recorded programs. Although the 2TB is just 1 drive in the process (not using any raid). Recordings should be balanced between the freespace.
 
I found some more info: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

The Linux on my Core i5 where I could reproduce the corruption so easily runs smartctl on all disks every 10 minutes, the AMD runs smartctl every hour, so this explains what I saw. It also says the corruption doesn't occur with the drive cache off.

Here it says a drive firmware update is in the works: http://www.heise.de/newsticker/meld...en-auf-Samsung-Festplatte-Update-1143120.html (German)


edit: oops, I missed your post before this. Oh well.......
 
I've tried the firmware fix and it seems to have corrected the issue. I used a freedos bootable USB to flash the Samsung firmware on all 5x of my HD204UI drives and then ran evildrone's perl script against a 1GB partition on one of the drives drives, no corruption detected.

I used Ubuntu 9.10 and I had the following command running during the test:
# watch smartctl -i /dev/sdb

This should force an ATA identify command to be issued to the drive every 2 seconds.

I let the script run for at least a half hour over that 1GB partition with no reported corruption.

Here are the hardware details:
Supermicro X8SIA-F motherboard (Intel 3420 chipset)
Intel Xeon X3440 processor (4C Lynnfield @ 2.53 GHz)
8GB ECC DDR3 RDIMMs (2x 4GB)
5x Samsung HD204UI 2TB SATA HDD
1x OCZ Vertex Pro2 60GB SATA SSD

I can confirm what some others have reported: the Samsung firmware update does NOT result in a change to the reported firmware version. This is extremely disappointing to me.

Also, the SMART output from all of the drives has now changed. I think Samsung added some new fields or otherwise changed the data, because now the output is quite messy.

I still plan on running the Samsung ESTOOL complete surface scan on all 5x of the drives before I'll completely trust the rig.
 
Last edited:
First off on both PC's the new Samsung HD204UI is the only hard drive plugged in

I downloaded the file and made a MS-DOS boot floppy disk with WinXP and added the 181550HD204UI.EXE to it. Tried in 2nd PC (older P4 with SATA 1.5) and it didn't work with two different floppy's.

Then tried it with a USB boot drive then copied that over. This I tried on my main 2 year old PC (Abit IP35 Pro) and had USB boot first, showed the MS windows millennium dos boot and still nothing. Showed C:\> prompt

I noticed in the instructions it says extract the .exe file, well it didn't work in WinXP with UniExtract or 7-zip. Also downloaded file to Win7 laptop and that wouldn't extract with 7-zip. Kept getting errors and I downloaded it numerous times and checked MD5 hash and was same everytime, with IE and Firefox.

What am I missing? I just did a Asus P4G800-V bios update 5 days ago or so. That was a .exe download and I had to change it to a .rom extension and I tried that and that still didn't work with the floppy drive boot.
At A:\> I also tried just the HD204UI as is without the numbers before it. I have done firmware updates before, don't know what I'm missing. On the site it says the 181550HD204UI.EXE is the flash program so I should just need a dos bootable floppy disk or USB flash disc.

Please help what am I missing with this one? 5 hours past my bedtime, sorry might not make total sense.

Thanks for your time.
 
181550HD204UI.EXE is a DOS executable. No you do not extract at least I did not.

If you copied that to your USB boot disk See if it is on C: when you booted from the usb boot disk. If not it may be on A: or B:

Then you just run the dos executable and it looks for the drives.

I used freedos for this.
 
Last edited:
I did the same as drescherjm, copied the executable to my FreeDOS USB flash drive, booted it up and ran the Samsung exe. I didn't even bother to remove my SSD, the executable was smart enough to only update the Samsung drives. I did actually rename the file so that it was in 8.3 format (shortened the name to HD204UI.EXE) but I don't know if that step was even required.

To make the FreeDOS usb, I downloaded the image from here: http://derek.chezmarcotte.ca/?p=188 and then used the dd command from a Linux system to transfer the image to a USB drive. Then I copied the Samsung EXE file over to the USB drive (which now had a FAT partition on it). Finally booted up the server with the Samsung drives from the USB flash drive and ran the exe.
 
I did the same as drescherjm, copied the executable to my FreeDOS USB flash drive, booted it up and ran the Samsung exe. I didn't even bother to remove my SSD, the executable was smart enough to only update the Samsung drives. I did actually rename the file so that it was in 8.3 format (shortened the name to HD204UI.EXE) but I don't know if that step was even required.

To make the FreeDOS usb, I downloaded the image from here: http://derek.chezmarcotte.ca/?p=188 and then used the dd command from a Linux system to transfer the image to a USB drive. Then I copied the Samsung EXE file over to the USB drive (which now had a FAT partition on it). Finally booted up the server with the Samsung drives from the USB flash drive and ran the exe.

Thanks for both of your comments. I did the FreeDOS bootable USB with UltraISO and it worked!!!
Thanks so much.

I can't believe the other methods weren't working. I even tried that HP_USB_Boot_Utility.exe method and of all of them I thought that would of worked.

After it was done, I turned off PC and pulled the drive right away and going to run the ESTool 3.01v and see if I get the RAM Error: AJ41.
Well actually I used 3.01p the other day, but if this passes then I'm thinking of getting another one at Micro Center since they are $90 now. Was thinking this weekend but 7-12+ inches of snow in Minneapolis !!! Have to check the weather, if that's really the case I can wait awhile :)
I hope I didn't screw up by pulling the hard drive after it was done and not letting it boot up again on the same motherboard, don't know why that would matter.

Never had so much trouble, I've done DVDRW's, CDROM's, MB's and a couple hard drives through the years, but felt like I've never done this before, or ever used a PC before. Both PC's MB's Bios's are easy to set first boot drives as FDD or USB-FDD....

Thanks again.

I still don't understand why they couldn't of made the firmware change, really stupid on their part. The only way to fully test it is to do a 6 hour test, crazy if you have multiple ones. What if someone only has one main PC? I do commend them on doing a fast fix though.
 
Last edited:
Update: I finished the Samsung ESTOOL.EXE surface scan on all 5x of my drives. 6.5 hours each, one at a time, but they all came back clean.

Hopefully by the end of the holidays I'll have completed all of my other testing, tweaking and configuration so I can put my new rig into production. At least now I have confidence in the drives themselves.
 
Back
Top