New drive badblocks.

Discussion in 'SSDs & Data Storage' started by drescherjm, Oct 4, 2010.

  1. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    I have a new 2TB F4 (have not tested the other 2 F4s) that I decided to do a 4 pass write badblocks on it (over 2 days of testing) and that resulted in 736 bad blocks however there were no errors recorded in SMART and no errors recorded in my dmesg.

    Here is the output:

    jmd0 ~ # badblocks -svw /dev/sdf -o S2HGJ1BZ836643.txt
    Checking for bad blocks in read-write mode
    From block 0 to 1953514583
    Testing with pattern 0xaa: done
    Reading and comparing: done
    Testing with pattern 0x55: done
    Reading and comparing: done
    Testing with pattern 0xff: done
    Reading and comparing: done
    Testing with pattern 0x00: done
    Reading and comparing: done
    Pass completed, 736 bad blocks found.
    jmd0 ~ # smartctl --all /dev/sdf
    smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
    Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF INFORMATION SECTION ===
    Device Model: SAMSUNG HD204UI
    Serial Number: S2HGJ1BZ836643
    Firmware Version: 1AQ10001
    User Capacity: 2,000,398,934,016 bytes
    Device is: Not in smartctl database [for details use: -P showall]
    ATA Version is: 8
    ATA Standard is: ATA-8-ACS revision 6
    Local Time is: Mon Oct 4 19:47:42 2010 EDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status: (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection: Disabled.
    Self-test execution status: ( 0) The previous self-test routine completed
    without error or no self-test has ever
    been run.
    Total time to complete Offline
    data collection: (21060) seconds.
    Offline data collection
    capabilities: (0x5b) SMART execute Offline immediate.
    Auto Offline data collection on/off support.
    Suspend Offline collection upon new
    command.
    Offline surface scan supported.
    Self-test supported.
    No Conveyance Self-test supported.
    Selective Self-test supported.
    SMART capabilities: (0x0003) Saves SMART data before entering
    power-saving mode.
    Supports SMART auto save timer.
    Error logging capability: (0x01) Error logging supported.
    General Purpose Logging supported.
    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 255) minutes.
    SCT capabilities: (0x003f) SCT Status supported.
    SCT Feature Control supported.
    SCT Data Table supported.

    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 0
    2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0
    3 Spin_Up_Time 0x0023 068 068 025 Pre-fail Always - 9724
    4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 2
    5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
    7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0
    8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0
    9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 56
    10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0
    11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 2
    181 Program_Fail_Cnt_Total 0x0022 252 252 000 Old_age Always - 0
    191 G-Sense_Error_Rate 0x0022 252 252 000 Old_age Always - 0
    192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0
    194 Temperature_Celsius 0x0002 064 064 000 Old_age Always - 28 (Lifetime Min/Max 22/36)
    195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0
    196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
    197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0
    200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always - 0
    223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0
    225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 2

    SMART Error Log Version: 1
    No Errors Logged

    SMART Self-test log structure revision number 1
    No self-tests have been logged. [To run self-tests, use: smartctl -t]


    Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
    SMART Selective self-test log data structure revision number 0
    Note: revision number not 1 implies that no selective self-test has ever been run
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Completed [00% left] (0-65535)
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
     
    Last edited: Oct 5, 2010
  2. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    The machine is a Intel core2 Q9550 running at 3.1GHz instead of 2.83 GHz. Yes, I know overclocking can cause this. However the system has been rock stable (24/7/365) at this overclock (not a single kernel panic ...) for nearly 2 years since I purchased it in November of 2008. I guess I should test the drive on my i7 box. The interesting thing is the first entire pass returned no errors at all.

    Edit: I do have the badblocks list that I can post somewhere but I will refrain from copying it here being that its too long..
     
  3. john4200

    john4200 [H]ard|Gawd

    Messages:
    1,537
    Joined:
    Oct 16, 2009
    You did not post the list of badblocks. Were they distributed throughout the disk, or were most of them clustering in certain locations?

    Did you try setting the block size to 4096 (-b 4096)? I don't know if it makes any difference in the time to complete the test, but the default is 1024 on my copy.

    If you wrap code tags around the badblocks list, it should be okay to post.
     
  4. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    Thanks. I forgot about code tags..

    Code:
    jmd0 ~ # cat S2HGJ1BZ836643.txt 
    386596256
    386596257
    386596258
    386596259
    386596260
    386596261
    386596262
    386596263
    386596264
    386596265
    386596266
    386596267
    386596268
    386596269
    386596270
    386596271
    386596272
    386596273
    386596274
    386596275
    386596276
    386596277
    386596278
    386596279
    386596280
    386596281
    386596282
    386596283
    386596284
    386596285
    386596286
    386596287
    805674976
    805674977
    805674978
    805674979
    805674980
    805674981
    805674982
    805674983
    805674984
    805674985
    805674986
    805674987
    805674988
    805674989
    805674990
    805674991
    805674992
    805674993
    805674994
    805674995
    805674996
    805674997
    805674998
    805674999
    805675000
    805675001
    805675002
    805675003
    805675004
    805675005
    805675006
    805675007
    1370054432
    1370054433
    1370054434
    1370054435
    1370054436
    1370054437
    1370054438
    1370054439
    1370054440
    1370054441
    1370054442
    1370054443
    1370054444
    1370054445
    1370054446
    1370054447
    1370054448
    1370054449
    1370054450
    1370054451
    1370054452
    1370054453
    1370054454
    1370054455
    1370054456
    1370054457
    1370054458
    1370054459
    1370054460
    1370054461
    1370054462
    1370054463
    1475864160
    1475864161
    1475864162
    1475864163
    1475864164
    1475864165
    1475864166
    1475864167
    1475864168
    1475864169
    1475864170
    1475864171
    1475864172
    1475864173
    1475864174
    1475864175
    1475864176
    1475864177
    1475864178
    1475864179
    1475864180
    1475864181
    1475864182
    1475864183
    1475864184
    1475864185
    1475864186
    1475864187
    1475864188
    1475864189
    1475864190
    1475864191
    1476597664
    1476597665
    1476597666
    1476597667
    1476597668
    1476597669
    1476597670
    1476597671
    1476597672
    1476597673
    1476597674
    1476597675
    1476597676
    1476597677
    1476597678
    1476597679
    1476597680
    1476597681
    1476597682
    1476597683
    1476597684
    1476597685
    1476597686
    1476597687
    1476597688
    1476597689
    1476597690
    1476597691
    1476597692
    1476597693
    1476597694
    1476597695
    1535326368
    1535326369
    1535326370
    1535326371
    1535326372
    1535326373
    1535326374
    1535326375
    1535326376
    1535326377
    1535326378
    1535326379
    1535326380
    1535326381
    1535326382
    1535326383
    1535326384
    1535326385
    1535326386
    1535326387
    1535326388
    1535326389
    1535326390
    1535326391
    1535326392
    1535326393
    1535326394
    1535326395
    1535326396
    1535326397
    1535326398
    1535326399
    1687842016
    1687842017
    1687842018
    1687842019
    1687842020
    1687842021
    1687842022
    1687842023
    1687842024
    1687842025
    1687842026
    1687842027
    1687842028
    1687842029
    1687842030
    1687842031
    1687842032
    1687842033
    1687842034
    1687842035
    1687842036
    1687842037
    1687842038
    1687842039
    1687842040
    1687842041
    1687842042
    1687842043
    1687842044
    1687842045
    1687842046
    1687842047
    1826552736
    1826552737
    1826552738
    1826552739
    1826552740
    1826552741
    1826552742
    1826552743
    1826552744
    1826552745
    1826552746
    1826552747
    1826552748
    1826552749
    1826552750
    1826552751
    1826552752
    1826552753
    1826552754
    1826552755
    1826552756
    1826552757
    1826552758
    1826552759
    1826552760
    1826552761
    1826552762
    1826552763
    1826552764
    1826552765
    1826552766
    1826552767
    79448416
    79448417
    79448418
    79448419
    79448420
    79448421
    79448422
    79448423
    79448424
    79448425
    79448426
    79448427
    79448428
    79448429
    79448430
    79448431
    79448432
    79448433
    79448434
    79448435
    79448436
    79448437
    79448438
    79448439
    79448440
    79448441
    79448442
    79448443
    79448444
    79448445
    79448446
    79448447
    931251232
    931251233
    931251234
    931251235
    931251236
    931251237
    931251238
    931251239
    931251240
    931251241
    931251242
    931251243
    931251244
    931251245
    931251246
    931251247
    931251248
    931251249
    931251250
    931251251
    931251252
    931251253
    931251254
    931251255
    931251256
    931251257
    931251258
    931251259
    931251260
    931251261
    931251262
    931251263
    1124616032
    1124616033
    1124616034
    1124616035
    1124616036
    1124616037
    1124616038
    1124616039
    1124616040
    1124616041
    1124616042
    1124616043
    1124616044
    1124616045
    1124616046
    1124616047
    1124616048
    1124616049
    1124616050
    1124616051
    1124616052
    1124616053
    1124616054
    1124616055
    1124616056
    1124616057
    1124616058
    1124616059
    1124616060
    1124616061
    1124616062
    1124616063
    1305464160
    1305464161
    1305464162
    1305464163
    1305464164
    1305464165
    1305464166
    1305464167
    1305464168
    1305464169
    1305464170
    1305464171
    1305464172
    1305464173
    1305464174
    1305464175
    1305464176
    1305464177
    1305464178
    1305464179
    1305464180
    1305464181
    1305464182
    1305464183
    1305464184
    1305464185
    1305464186
    1305464187
    1305464188
    1305464189
    1305464190
    1305464191
    1475607008
    1475607009
    1475607010
    1475607011
    1475607012
    1475607013
    1475607014
    1475607015
    1475607016
    1475607017
    1475607018
    1475607019
    1475607020
    1475607021
    1475607022
    1475607023
    1475607024
    1475607025
    1475607026
    1475607027
    1475607028
    1475607029
    1475607030
    1475607031
    1475607032
    1475607033
    1475607034
    1475607035
    1475607036
    1475607037
    1475607038
    1475607039
    225556384
    225556385
    225556386
    225556387
    225556388
    225556389
    225556390
    225556391
    225556392
    225556393
    225556394
    225556395
    225556396
    225556397
    225556398
    225556399
    225556400
    225556401
    225556402
    225556403
    225556404
    225556405
    225556406
    225556407
    225556408
    225556409
    225556410
    225556411
    225556412
    225556413
    225556414
    225556415
    444989792
    444989793
    444989794
    444989795
    444989796
    444989797
    444989798
    444989799
    444989800
    444989801
    444989802
    444989803
    444989804
    444989805
    444989806
    444989807
    444989808
    444989809
    444989810
    444989811
    444989812
    444989813
    444989814
    444989815
    444989816
    444989817
    444989818
    444989819
    444989820
    444989821
    444989822
    444989823
    655023648
    655023649
    655023650
    655023651
    655023652
    655023653
    655023654
    655023655
    655023656
    655023657
    655023658
    655023659
    655023660
    655023661
    655023662
    655023663
    655023664
    655023665
    655023666
    655023667
    655023668
    655023669
    655023670
    655023671
    655023672
    655023673
    655023674
    655023675
    655023676
    655023677
    655023678
    655023679
    860067424
    860067425
    860067426
    860067427
    860067428
    860067429
    860067430
    860067431
    860067432
    860067433
    860067434
    860067435
    860067436
    860067437
    860067438
    860067439
    860067440
    860067441
    860067442
    860067443
    860067444
    860067445
    860067446
    860067447
    860067448
    860067449
    860067450
    860067451
    860067452
    860067453
    860067454
    860067455
    1056818592
    1056818593
    1056818594
    1056818595
    1056818596
    1056818597
    1056818598
    1056818599
    1056818600
    1056818601
    1056818602
    1056818603
    1056818604
    1056818605
    1056818606
    1056818607
    1056818608
    1056818609
    1056818610
    1056818611
    1056818612
    1056818613
    1056818614
    1056818615
    1056818616
    1056818617
    1056818618
    1056818619
    1056818620
    1056818621
    1056818622
    1056818623
    1241773280
    1241773281
    1241773282
    1241773283
    1241773284
    1241773285
    1241773286
    1241773287
    1241773288
    1241773289
    1241773290
    1241773291
    1241773292
    1241773293
    1241773294
    1241773295
    1241773296
    1241773297
    1241773298
    1241773299
    1241773300
    1241773301
    1241773302
    1241773303
    1241773304
    1241773305
    1241773306
    1241773307
    1241773308
    1241773309
    1241773310
    1241773311
    1415493984
    1415493985
    1415493986
    1415493987
    1415493988
    1415493989
    1415493990
    1415493991
    1415493992
    1415493993
    1415493994
    1415493995
    1415493996
    1415493997
    1415493998
    1415493999
    1415494000
    1415494001
    1415494002
    1415494003
    1415494004
    1415494005
    1415494006
    1415494007
    1415494008
    1415494009
    1415494010
    1415494011
    1415494012
    1415494013
    1415494014
    1415494015
    1647004384
    1647004385
    1647004386
    1647004387
    1647004388
    1647004389
    1647004390
    1647004391
    1647004392
    1647004393
    1647004394
    1647004395
    1647004396
    1647004397
    1647004398
    1647004399
    1647004400
    1647004401
    1647004402
    1647004403
    1647004404
    1647004405
    1647004406
    1647004407
    1647004408
    1647004409
    1647004410
    1647004411
    1647004412
    1647004413
    1647004414
    1647004415
    1726351584
    1726351585
    1726351586
    1726351587
    1726351588
    1726351589
    1726351590
    1726351591
    1726351592
    1726351593
    1726351594
    1726351595
    1726351596
    1726351597
    1726351598
    1726351599
    1726351600
    1726351601
    1726351602
    1726351603
    1726351604
    1726351605
    1726351606
    1726351607
    1726351608
    1726351609
    1726351610
    1726351611
    1726351612
    1726351613
    1726351614
    1726351615
    1862898656
    1862898657
    1862898658
    1862898659
    1862898660
    1862898661
    1862898662
    1862898663
    1862898664
    1862898665
    1862898666
    1862898667
    1862898668
    1862898669
    1862898670
    1862898671
    1862898672
    1862898673
    1862898674
    1862898675
    1862898676
    1862898677
    1862898678
    1862898679
    1862898680
    1862898681
    1862898682
    1862898683
    1862898684
    1862898685
    1862898686
    1862898687
    jmd0 ~ # 
    
    Edit:
    Here is the pastbin link I was in the process of creating when I read the last post..

    http://pastebin.com/QPURQ6Au
     
  5. john4200

    john4200 [H]ard|Gawd

    Messages:
    1,537
    Joined:
    Oct 16, 2009
    That is interesting data. Did you spot the pattern? There are 23 chunks of 32 consecutive bad blocks. You apparently used the default 1024B block size, so the chunks are 32KB in size. I'm not sure why your bad blocks come in 32KB chunks. My copy of badblocks defaults to 64 blocks at a time (1024B blocks), so unless yours did 32 at a time, that should not be the reason.

    32KB is 64 512B sectors, but the man page for badblocks says it defaults to 64 "blocks which are tested at a time", not sectors.

    The chunks seem to be uniformly randomly distributed over the disk, with an average of 4.15% of capacity between chunks of bad blocks.

    I do not understand the reason for the pattern. You might re-run the scan with -b 4096 -c 32, just to see what happens.
     
  6. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    Thanks. I did not look at it closely enough to see the pattern. I will have to look further into this. I can't explain any of this..
     
  7. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    BTW, the man pages for my badblocks version (E2fsprogs version 1.41.12) says

    -b block-size
    Specify the size of blocks in bytes. The default is 1024.

    -c number of blocks
    is the number of blocks which are tested at a time. The default is 64.
     
  8. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    I looked at the data and it is corrupted:

    Code:
    jmd0 ~ # dd if=/dev/sdf bs=1024 skip=1862898656 count=33 | hexdump -C
    00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
    *
    00008000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00008400
    33+0 records in
    33+0 records out
    33792 bytes (34 kB) copied, 0.000801165 s, 42.2 MB/s
    
    At the 4th pass all bytes on the disk should be 0. FF was the previous pattern. This looks really weird. Kernel bug?

    Edit same goes for the previous block.
    Code:
    
    jmd0 ~ # dd if=/dev/sdf bs=1024 skip=1726351584 count=33 | hexdump -C
    00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
    *
    00008000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00008400
    33+0 records in
    33+0 records out
    33792 bytes (34 kB) copied, 0.0218385 s, 1.5 MB/s
    
    For those of us who can not do hexadecimal math in their heads (I am a programmer and a sys admin) or can not follow linux commands. I did a hex dump of 33 blocks (of 1024 bytes) instead of 32 that badblocks said was bad to see if what the data looked like in the blocks badblocks said was bad and the block immediately following the data. hexdump simplifies the output here so that consecutive lines of the same value are skipped from the output. 0x8000 = 32K. The two output regions were taken from the final two regions that badblocks displayed. I selected these since I know that at that point all bytes of the disk are supposed to be zero..
     
    Last edited: Oct 4, 2010
  9. john4200

    john4200 [H]ard|Gawd

    Messages:
    1,537
    Joined:
    Oct 16, 2009
    The blocks that were listed as bad seemed random. I'm not sure how a kernel bug could do that. I'm not saying whether it is a kernel bug, just that I cannot imagine what sort of bug would result in what you saw.

    Did you try running badblocks again with -b 4096 -c 32, or something else, just to see if the pattern holds or changes?
     
  10. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    Not yet it takes 10 hours each pass. And the last time there were no bad blocks on the first pass.

    Also I have a second identical drive with nothing on it to test as well.
     
    Last edited: Oct 4, 2010
  11. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    Started

    Code:
    jmd0 ~ # badblocks -svw -c 32 -b 4096 /dev/sdf -o S2HGJ1BZ836643_test2.txt
    Checking for bad blocks in read-write mode
    From block 0 to 488378645
    
    iotop shows initial disk writes of 117 to 120MB/s
     
    Last edited: Oct 4, 2010
  12. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    New bad blocks:
    Code:
    22320280
    22320281
    22320282
    22320283
    22320284
    22320285
    22320286
    22320287
    132451576
    132451577
    132451578
    132451579
    132451580
    132451581
    132451582
    132451583
    184302392
    184302393
    184302394
    184302395
    184302396
    184302397
    184302398
    184302399
    234629240
    234629241
    234629242
    234629243
    234629244
    234629245
    234629246
    234629247
    282862744
    282862745
    282862746
    282862747
    282862748
    282862749
    282862750
    282862751
    327766616
    327766617
    327766618
    327766619
    327766620
    327766621
    327766622
    327766623
    
    So the bad regions are still 32K. It's like the kernel is randomly not flushing back a 32 K block for some reason. The last pattern is was 00 and the new pattern is aa.

    Code:
    jmd0 ~ #  dd if=/dev/sdf bs=4096 skip=22320280 count=9 | hexdump -C
    00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    9+0 records in
    9+0 records out
    *
    36864 bytes (37 kB) copied, 5.2471e-05 s, 703 MB/s
    00008000  aa aa aa aa aa aa aa aa  aa aa aa aa aa aa aa aa  |................|
    *
    00009000
    
     
    Last edited: Oct 5, 2010
  13. john4200

    john4200 [H]ard|Gawd

    Messages:
    1,537
    Joined:
    Oct 16, 2009
    That is really strange. Note that there are only 6 regions this time, and not in the same location as the previous 23 regions. So I agree with you that it looks like some sort of bug -- either in badblocks, or in the kernel. What distro and kernel are you using?
     
  14. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    The reading was not 100% finished when I took the result.

    This is 64 bit gentoo. But it's not a common kernel because I need openvz on this machine. I am using the latest openvz-2.6.32.9.1

    For gentoo the package is:
    sys-kernel/openvz-sources-2.6.32.9.1

    This kernel was updated last week to the latest 2.6.32 in the mainline kernel. I could back off of that to the previous kernel. Still 2.6.32 based (I have been running 2.6.32 for ~8 months).

    http://git.openvz.org/?p=linux-2.6.32-openvz;a=summary

    Code:
    jmd0 ~ # uname -a
    Linux jmd0.comcast.net 2.6.32-openvz-dyomin.1 #1 SMP Wed Sep 22 21:45:36 EDT 2010 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux
    
    Since this is a production system I am pretty much stuck with this kernel until I move the subversion / cvs server over to the in kernel lxc containers. I am waiting for some more testing of lxc before I make that switch since I (and others) use the svn server daily. On top of that I can not safely downgrade the kernel to a lower main version for two reasons ext4 and it also causes me difficulty since the box is also my main HTPC backend.
     
    Last edited: Oct 5, 2010
  15. john4200

    john4200 [H]ard|Gawd

    Messages:
    1,537
    Joined:
    Oct 16, 2009
    That will be hard to debug. If it were me, I'd probably start by finding a forum / bugtracker web site for badblocks and posting the issue there. The man page for badblocks on my system references http://e2fsprogs.sourceforge.net/

    Theodore Ts'o is apparently the maintainer of badblocks, which is good if you can get his attention, since you could not ask for a person more knowledgeable about linux and disks.
     
  16. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    BTW. Thanks a lot. I really appreciate your help on this.

    I posted this same question in gentoo and no one has bitten yet. I can take it to badblocks and possibly lkml (linux kernel mailing list). I may also try other drives in that machine and also put this drive in a different machine.

    Agreed.
     
  17. evildrone

    evildrone n00bie

    Messages:
    3
    Joined:
    Nov 30, 2010
    Hi,
    I'm seeing a similar problem. Your post is the only thing I can find that
    resembles what I see. I believe it is not a bug in Linux or badblocks, but a
    problem with the drive.

    hdd:
    Code:
    === START OF INFORMATION SECTION ===
    Device Model:     SAMSUNG HD204UI
    Serial Number:    S2H7JD2ZA14578
    Firmware Version: 1AQ10001
    
    Tested in 2 different systems: Asus M2A-VM with Athlon X2 4450e and an Intel
    DH55TC with Core i5-750. No overclocking, default BIOS settings. The Asus is
    running an Ubuntu 2.6.35-23 kernel and the Intel runs a vanilla 2.6.36.1.

    Partition table:
    Code:
    # sfdisk -dl /dev/sdb
    # partition table of /dev/sdb
    unit: sectors
    
    /dev/sdb1 : start=        1, size=    32129, Id=83
    /dev/sdb2 : start=    32130, size= 58589054, Id=83
    /dev/sdb3 : start= 58621192, size=  3903856, Id=82
    /dev/sdb4 : start= 62525048, size=3844499016, Id=83
    
    Note that /dev/sdb2 is not aligned on 4kb (32130 % 8 = 2)

    I first became aware of the problem when I did an md5sum check on some files
    that I copied. I couldn't reproduce it with badblocks, but I can with this
    script which writes and compares data:

    Code:
    #!/usr/bin/perl
    use Fcntl;
    my $dev = shift @ARGV or die "usage: $0 <device>\n";
    for(;;) {
       for $pattern (0, 0x55, 0xff, 0xaa) {
          my $buf = pack("C", $pattern) x 4096;
          my $blocks = 0;
          sysopen DEV, $dev, O_WRONLY|O_EXCL or die "sysopen $dev: $!\n";
          printf STDERR "writing pattern 0x%02x... ", $pattern;
          while(print DEV $buf) { $blocks++; }
          close DEV;
          printf STDERR "%d blocks\n", $blocks;
    
          sysopen DEV, $dev, O_RDONLY|O_EXCL or die "open $dev: $!\n";
          printf STDERR "comparing pattern 0x%02x... ", $pattern;
          for(my $i = 0; $i < $blocks; $i++) {
             read(DEV, $_, 4096) == 4096 or die "read $dev $i: $!\n";
             $_ eq $buf or die "error at block $i\n";
          }
          close DEV;
          printf STDERR "ok\n";
       }
    }
    
    result:
    Code:
    # ./mybadblocks /dev/sdb2
    writing pattern 0x00... 7323631 blocks
    comparing pattern 0x00... ok
    writing pattern 0x55... 7323631 blocks
    comparing pattern 0x55... error at block 6279263
    
    # perl -e 'seek STDIN, 6279263*4096, 0; while(read STDIN, $_, 4096) {print}' </dev/sdb2 | hd
    00000000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
    *
    00000c00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
    *
    00006000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
    *
    ^C
    
    21 kbyte of data was wrong, at absolute LBA (6279263*4096 + 0xc00 + 32130 *
    512) / 512 = 50266240. This is multiple of 8, i.e., the start of a 4kb
    sector.

    I then moved partition 2 to start 1 sector later, at LBA 32131, and ran the test again.


    result:
    Code:
    # ./mybadblocks /dev/sdb2
    writing pattern 0x00... 7323631 blocks
    comparing pattern 0x00... ok
    writing pattern 0x55... 7323631 blocks
    comparing pattern 0x55... ok
    writing pattern 0xff... 7323631 blocks
    comparing pattern 0xff... error at block 471799
    
    # perl -e 'seek STDIN, 471799*4096, 0; while(read STDIN, $_, 4096) {print}' </dev/sdb2 | hd
    00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
    *
    00000a00  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
    *
    00004000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
    *
    ^C
    
    27 kbyte of data was wrong, at absolute LBA (1932488704 + 0xa00 +
    32131 * 512) / 512 = 475816, which is a multiple of 8, again the start of a
    4kb sector.

    And now with the partition properly aligned, starting at LBA 32136:

    Code:
    # ./mybadblocks /dev/sdb2
    writing pattern 0x00... 7323631 blocks
    comparing pattern 0x00... ok
    writing pattern 0x55... 7323631 blocks
    comparing pattern 0x55... ok
    writing pattern 0xff... 7323631 blocks
    comparing pattern 0xff... error at block 4349367
    
    # perl -e 'seek STDIN, 4349367*4096, 0; while(read STDIN, $_, 4096) {print}' </dev/sdb2 | hd
    00000000  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|
    *
    00000400  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
    *
    ^C
    
    1 kbyte of data wrong.

    So, it appears the problem is aligned with the 4kbyte sectors of the drive,
    and not with the 4kbyte pages of the Linux page cache. That's why I think
    it's a problem in the drive.

    The above was with the "deadline" I/O scheduler. The cfq and noop schedulers
    give the same kind of errors.

    I'm probably going to return this drive to Newegg before the 30 days are up.
     
  18. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    The same exact thing happened with a second Samsung F4 on the same machine. However it does not do this on my i7 machine only the core2 quad with the older kernel 2.6.32. I did not try a newer kernel on the core2 quad because this is a production htpc box and during the week I can not take it down. On the weekend I can but its not easy to find time.

    BTW, thanks for the perl script. I will look at this as soon as I can..
     
    Last edited: Nov 30, 2010
  19. evildrone

    evildrone n00bie

    Messages:
    3
    Joined:
    Nov 30, 2010
    I found out something new: if I disable the write cache of the drive with hdparm -W 0 /dev/sdb, the problem goes away. Before, I would get an error within a few minutes, and now it's been running for several hours with no errors at all.

    This supports my theory that the problem is in the drive and not in Linux, because for Linux the situation is exactly as before (apart from the fact that the drive is a little slower for writes).
     
  20. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    If so this is a serious flaw. I am surprised that other users have not seen this. Hopefully this is fixable with a firmware upgrade. That is if samsung becomes aware of this.

    I have purchased 3 F4s and have 2TB of htpc data on one of them that is connected to the core2quad. I have not seen any corruption yet on any of my recorded programs. Although the 2TB is just 1 drive in the process (not using any raid). Recordings should be balanced between the freespace.
     
  21. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
  22. evildrone

    evildrone n00bie

    Messages:
    3
    Joined:
    Nov 30, 2010
    I found some more info: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

    The Linux on my Core i5 where I could reproduce the corruption so easily runs smartctl on all disks every 10 minutes, the AMD runs smartctl every hour, so this explains what I saw. It also says the corruption doesn't occur with the drive cache off.

    Here it says a drive firmware update is in the works: http://www.heise.de/newsticker/meld...en-auf-Samsung-Festplatte-Update-1143120.html (German)


    edit: oops, I missed your post before this. Oh well.......
     
  23. somedude1234

    somedude1234 n00bie

    Messages:
    3
    Joined:
    Dec 9, 2010
    I've tried the firmware fix and it seems to have corrected the issue. I used a freedos bootable USB to flash the Samsung firmware on all 5x of my HD204UI drives and then ran evildrone's perl script against a 1GB partition on one of the drives drives, no corruption detected.

    I used Ubuntu 9.10 and I had the following command running during the test:
    # watch smartctl -i /dev/sdb

    This should force an ATA identify command to be issued to the drive every 2 seconds.

    I let the script run for at least a half hour over that 1GB partition with no reported corruption.

    Here are the hardware details:
    Supermicro X8SIA-F motherboard (Intel 3420 chipset)
    Intel Xeon X3440 processor (4C Lynnfield @ 2.53 GHz)
    8GB ECC DDR3 RDIMMs (2x 4GB)
    5x Samsung HD204UI 2TB SATA HDD
    1x OCZ Vertex Pro2 60GB SATA SSD

    I can confirm what some others have reported: the Samsung firmware update does NOT result in a change to the reported firmware version. This is extremely disappointing to me.

    Also, the SMART output from all of the drives has now changed. I think Samsung added some new fields or otherwise changed the data, because now the output is quite messy.

    I still plan on running the Samsung ESTOOL complete surface scan on all 5x of the drives before I'll completely trust the rig.
     
    Last edited: Dec 10, 2010
  24. SB1

    SB1 [H]Lite

    Messages:
    119
    Joined:
    Jan 21, 2009
    First off on both PC's the new Samsung HD204UI is the only hard drive plugged in

    I downloaded the file and made a MS-DOS boot floppy disk with WinXP and added the 181550HD204UI.EXE to it. Tried in 2nd PC (older P4 with SATA 1.5) and it didn't work with two different floppy's.

    Then tried it with a USB boot drive then copied that over. This I tried on my main 2 year old PC (Abit IP35 Pro) and had USB boot first, showed the MS windows millennium dos boot and still nothing. Showed C:\> prompt

    I noticed in the instructions it says extract the .exe file, well it didn't work in WinXP with UniExtract or 7-zip. Also downloaded file to Win7 laptop and that wouldn't extract with 7-zip. Kept getting errors and I downloaded it numerous times and checked MD5 hash and was same everytime, with IE and Firefox.

    What am I missing? I just did a Asus P4G800-V bios update 5 days ago or so. That was a .exe download and I had to change it to a .rom extension and I tried that and that still didn't work with the floppy drive boot.
    At A:\> I also tried just the HD204UI as is without the numbers before it. I have done firmware updates before, don't know what I'm missing. On the site it says the 181550HD204UI.EXE is the flash program so I should just need a dos bootable floppy disk or USB flash disc.

    Please help what am I missing with this one? 5 hours past my bedtime, sorry might not make total sense.

    Thanks for your time.
     
  25. drescherjm

    drescherjm [H]ardForum Junkie

    Messages:
    13,815
    Joined:
    Nov 19, 2008
    181550HD204UI.EXE is a DOS executable. No you do not extract at least I did not.

    If you copied that to your USB boot disk See if it is on C: when you booted from the usb boot disk. If not it may be on A: or B:

    Then you just run the dos executable and it looks for the drives.

    I used freedos for this.
     
    Last edited: Dec 10, 2010
  26. somedude1234

    somedude1234 n00bie

    Messages:
    3
    Joined:
    Dec 9, 2010
    I did the same as drescherjm, copied the executable to my FreeDOS USB flash drive, booted it up and ran the Samsung exe. I didn't even bother to remove my SSD, the executable was smart enough to only update the Samsung drives. I did actually rename the file so that it was in 8.3 format (shortened the name to HD204UI.EXE) but I don't know if that step was even required.

    To make the FreeDOS usb, I downloaded the image from here: http://derek.chezmarcotte.ca/?p=188 and then used the dd command from a Linux system to transfer the image to a USB drive. Then I copied the Samsung EXE file over to the USB drive (which now had a FAT partition on it). Finally booted up the server with the Samsung drives from the USB flash drive and ran the exe.
     
  27. SB1

    SB1 [H]Lite

    Messages:
    119
    Joined:
    Jan 21, 2009
    Thanks for both of your comments. I did the FreeDOS bootable USB with UltraISO and it worked!!!
    Thanks so much.

    I can't believe the other methods weren't working. I even tried that HP_USB_Boot_Utility.exe method and of all of them I thought that would of worked.

    After it was done, I turned off PC and pulled the drive right away and going to run the ESTool 3.01v and see if I get the RAM Error: AJ41.
    Well actually I used 3.01p the other day, but if this passes then I'm thinking of getting another one at Micro Center since they are $90 now. Was thinking this weekend but 7-12+ inches of snow in Minneapolis !!! Have to check the weather, if that's really the case I can wait awhile :)
    I hope I didn't screw up by pulling the hard drive after it was done and not letting it boot up again on the same motherboard, don't know why that would matter.

    Never had so much trouble, I've done DVDRW's, CDROM's, MB's and a couple hard drives through the years, but felt like I've never done this before, or ever used a PC before. Both PC's MB's Bios's are easy to set first boot drives as FDD or USB-FDD....

    Thanks again.

    I still don't understand why they couldn't of made the firmware change, really stupid on their part. The only way to fully test it is to do a 6 hour test, crazy if you have multiple ones. What if someone only has one main PC? I do commend them on doing a fast fix though.
     
    Last edited: Dec 10, 2010
  28. somedude1234

    somedude1234 n00bie

    Messages:
    3
    Joined:
    Dec 9, 2010
    Update: I finished the Samsung ESTOOL.EXE surface scan on all 5x of my drives. 6.5 hours each, one at a time, but they all came back clean.

    Hopefully by the end of the holidays I'll have completed all of my other testing, tweaking and configuration so I can put my new rig into production. At least now I have confidence in the drives themselves.
     
  29. imomonkey

    imomonkey n00bie

    Messages:
    12
    Joined:
    Jun 16, 2008
    wrong tread
     
    Last edited: Dec 22, 2010