Can I streamline this back up strategy?

mrwhitethc · Apr 28, 2013

I started doing this when I had 500GB of data and now I'm up to 10TB so I'm wondering if there's a program or way to do this better. I have 3 sets of data my HTPC data, gaming pc data and wifes data all running windows 7 home. This is rather lengthy so that's why I hope there's an easier solution.

Windows 7 HTPC
8TB of data across 4 2TB hard drives, plus a single 500gb OS drive
backup as follows:

500GB OS drive: cloned using Clonezilla, very little changes so I have the image burned on DVD's plus on 1 of the 2TB drives.

Each 2TB drive has a clone, AV001a would be online in the HTPC and AV001b would be the cold backup. I have a single MD5 hash on the root of each drive for all the files. Every time I add data I add it to the hash digest. Once a month I run the hash on AV001a, verify it's good then copy the changes over to AV001b using SyncBack, run the hash against AV001b then copy it to the root of AV001b. Rinse and repeat each week, AV002a, AV003a, etc If something doesn't match I know there's either something wrong with the file, the drive, bit rot, etc

Windows 7 gaming pc and wifes pc both have 2 drives, 1TB each partitioned at 500gb. Master drive is only the OS and programs, slave is my docs, then I cross copy the data, C: copies to Z which is the second partition on the slave drive, D: copies to X which is the second partition on the master drive. I run this once every two weeks using syncback again. I don't hash this but it's more because of the amount of changes it would be nice to check for bit rot or file corruption. Ideally would like to copy the master drive as an image to the HTPC so I can do a file restore or maybe another offline drive not sure.

The main issue is the hashing of the HTPC really, the data grows quickly with all the movies the wife and I buy and rip, 5 a month on average sometimes more. I had thought about compressing them to save space but that takes time and then watching the movies few times to make sure I have compressed it correctly eats up more time. Also I think I need a new strategy for the gaming and wifes pc but I'm not sure the best way to handle all this. Like I said I started off just copying data from one drive to another, then I figured I would hash it to make sure it was good and it's kind of grown into a monster at this point. So if anyone has any input it would be greatly appreciated. Again sorry if this was long, I'm not sure how else to explain this better.

GeorgeHR · Apr 28, 2013

No need to back up the videos. You have a hard copy of them.

SemiLucid · Apr 29, 2013

The description of your set-up is a little confusing but I think it is fantastic you are checksumming your files. I would like to do the same with my data but don't yet know an effective way to manage this.

hotcrandel · Apr 29, 2013

(I digress, I've only recently been getting into freeNAS with ZFS, but overall it seems like light years ahead of file systems like NTFS)

Two separate, low power FreeNAS boxes running ZFS with ECC memory.

Each box has 4x3TB disks in RAIDZ

Configure zfs replication to run daily from the server to the backup.

Set each machine to do daily or weekly ZFS scrubs. (This checksums every file and corrects flipped bits if necessary from parity data)

mrwhitethc · Apr 29, 2013

GeorgeHR said:
No need to back up the videos. You have a hard copy of them.

Ideally yes but copying all those back would take forever, especially since it's all manual ejecting discs, copying, naming and adding multiple subtitles. Another reason i'm looking to streamline this process so it takes even less time.

SemiLucid said:
The description of your set-up is a little confusing but I think it is fantastic you are checksumming your files. I would like to do the same with my data but don't yet know an effective way to manage this.

Thanks, like I said it did start out simple but kind of grew into this monster.

hotcrandel said:
(I digress, I've only recently been getting into freeNAS with ZFS, but overall it seems like light years ahead of file systems like NTFS)

Two separate, low power FreeNAS boxes running ZFS with ECC memory.

Each box has 4x3TB disks in RAIDZ

Configure zfs replication to run daily from the server to the backup.

Set each machine to do daily or weekly ZFS scrubs. (This checksums every file and corrects flipped bits if necessary from parity data)

I'm not opposed to this my only issue would be I would now have 3 boxes vs the 1 and would incur the cost of parts. I was really hoping there was software to do this but I will look into the cost of parts for a dual zfs build, thank you for the suggestion.

*hrm I thought I had heard that you could get one of those HP home server shell's with no drives or OS cheap but I'm not seeing that, from a quick search if looks like freenas 8.0.1 or later would work. Does anyone know where to find those HP EX490 shells?

drescherjm · Apr 29, 2013

hotcrandel said:
(I digress, I've only recently been getting into freeNAS with ZFS, but overall it seems like light years ahead of file systems like NTFS)

Two separate, low power FreeNAS boxes running ZFS with ECC memory.

Each box has 4x3TB disks in RAIDZ

Configure zfs replication to run daily from the server to the backup.

Set each machine to do daily or weekly ZFS scrubs. (This checksums every file and corrects flipped bits if necessary from parity data)

Does the OP really have to worry about a bit rot or even ECC when if either of these occurred the 1 to 3 movies would have been corrupted the data could just be reconstructed from the originals.

I would just get a few 4 TB externals when they go back on sale for $150 US and use that to backup the data. Although this does not help with strategy. Sorry. I use a linux based network backup that could work well for this but I consider that an overkill as well.

mrwhitethc · Apr 29, 2013

drescherjm said:
Does the OP really have to worry about a bit rot or even ECC when if either of these occurred the 1 to 3 movies would have been corrupted the data could just be reconstructed from the originals.

I would just get a few 4 TB externals when they go back on sale for $150 US and use that to backup the data. Although this does not help with strategy. Sorry. I use a linux based network backup that could work well for this but I consider that an overkill as well.

It's true I'm a bit paranoid, this all came about after I lost a whole HD worth of data from an errant chkdsk and since then just became what it is. I would like to hear about your linux solution or even a link to more info if you know of one.

SemiLucid · Apr 29, 2013

ZFS seems like an awesome file system but don't forget you need to verify any files you transfer between boxes.

mrwhitethc · Apr 29, 2013

SemiLucid said:
ZFS seems like an awesome file system but don't forget you need to verify any files you transfer between boxes.

The best I can do is use the md5 hashes I already have stored for the files, unless there is another way?

drescherjm · Apr 29, 2013

mrwhitethc said:
It's true I'm a bit paranoid, this all came about after I lost a whole HD worth of data from an errant chkdsk and since then just became what it is. I would like to hear about your linux solution or even a link to more info if you know of one.

I use a program called bacula at home and at work. At work I have used this since the late 1990s and I have run over 30 thousand backup jobs in this time with very few problems. I currently have around 50TB stored in my backups at work. Most of these are stored on my 2 drive 24 slot lto2 (yes I know LTO6 would be better) archive. At home (started using a year or so later than at work) I use bacula + the bacula vchanger to do disk backups of my linux based HTPC + my source code + databases + docs + VMs... For my HTPC, I do not backup my 5000 recordings but mainly the OS + HTPC database. I use disk based backups at home with 2 drives and a virtual disk autochhanger provided by the bacula disk autochanger (available on sourceforge). In both home and work I do daily incrementals, weekly differentials monthly fulls.. Also in both cases the backup is pretty much a set it and forget it although I do get daily emails about my jobs being successful I rarely ever have to do any maintenance.

Links follow:

http://www.bacula.org/en/
http://sourceforge.net/projects/vchanger/

I am sorry that I have to be short on this I am in a crunch for time..

mrwhitethc · Apr 29, 2013

drescherjm said:
I use a program called bacula at home and at work. At work I have used this since the late 1990s and I have run over 30 thousand backup jobs in this time with very few problems. I currently have around 50TB stored in my backups at work. Most of these are stored on my 2 drive 24 slot lto2 (yes I know LTO6 would be better) archive. At home (started using a year or so later than at work) I use bacula + the bacula vchanger to do disk backups of my linux based HTPC + my source code + databases + docs + VMs... For my HTPC, I do not backup my 5000 recordings but mainly the OS + HTPC database. I use disk based backups at home with 2 drives and a virtual disk autochhanger provided by the bacula disk autochanger (available on sourceforge). In both home and work I do daily incrementals, weekly differentials monthly fulls.. Also in both cases the backup is pretty much a set it and forget it although I do get daily emails about my jobs being successful I rarely ever have to do any maintenance.

Links follow:

http://www.bacula.org/en/
http://sourceforge.net/projects/vchanger/

I am sorry that I have to be short on this I am in a crunch for time..

Thank you for the links, I'll look into this, sounds like it's pretty awesome if you've been using it for that long.

marsboer · Apr 29, 2013

I have the same concerns for my backup as the OP regarding integrity and checksum verifications, so I thought I could share my backup setup that does this automatically for me and seems to scale very vell without additional management.

I have the following setup for my private system:

- My old fileserver serving as my backup server in a server room at work. It automatically backs up all my hosts via rsync and IPSec tunnels (I have several networks)
- The backup server runs mdadm RAID6 with 8x2TB disks with monthly scrubbing. This is the first layer protecting against bit rot.
- I use the --link-dest feature of rsync, which means that it creats a hard linked and complete backup for every time the backup runs even though only the delta is actually copied
- Status is sent to me by mail (with link to logfile accessed through the VPN-tunnel)

Now for the checksumming part:

I did a lot of thinking to figure out how I could implement efficient checksumming of each backup for later verification. I implemented this by calculating the checksums for each file in the backup after every backup and store them in a dedicated checksum file for each backup. This worked well for most backups, but not for my then 5TB backup of my file server (which is now 10TB). The delta backup and snapshot creation using rsync via internet takes only seconds if no data is changed, but the calculation of the checksums took about 6 hours with all my crappy Seagate 2TB LP disks working at full speed (6 of my original 8 has either died or is dying according to SMART) each and every backup.

- After a while I figured out how to reuse the checksums that have not changed since the last backup, and only calculate the checksums for the new or changed files. Checksum calculation normally takes only a second or two now.
- Every saturday I veryify the backup using the checksum file. It automatically selects the newest backup of each computer and re-calculates everything. If there are any errors at all it will notify me by mail.

I originally implemented this functionality using a bash-wrapper script for rsync. I have since rewritten the rsync wrapper script to python.

A couple of months ago I implemented an additional feature made possible by the checksum file. As I know the checksum for every file It was suddenly possible to add filelevel deduplication to the script after the backup has finished. It scans for identical checksums, checks that the sizes and permissions are identical, and then hard links them together if they are not linked together before. This process normally takes less than a second. The cool thing is that this automatically will get carried over to the next snapshot as link-dest does this automatically.

The rsync script does not rely on any metadata history, databases or such so it has nothing that can go to hell. If you delete something you shouldn't have, it will automatically correct it on the next run. I have been using this for several years now and it is extremely stable, efficient and reliable for my purposes.

SemiLucid · Apr 30, 2013

Hi marsboer how do you calculate only the checksums for those files that have changed? Sounds like a great setup and I hope you can share more details. I would like to implement something like this, but am not sure how/where to start.

marsboer · Apr 30, 2013

The algorithm I have implemented is something like this:

1. Create list of files in the current backup
2. Create list of files that have changed between the current backup and the previous snapshot. The list is created by running rsync in dry run mode comparing the snapshots with itemize-changes activated. This list is parsed to get the changed files.
3. Create list of files in current backup, excepting files that is in the list of changed files
4. Add all files in the list from #3 that actually have checksums in the previous checksum file to a list
5. Add all files from the current backup that is not in the list generated in #4 to a list of files that need their checksum calculated or re-calculated
6. Add the newly calculated checksums to the same list as #4. This is your new checksum file.

I also do some additional checking to verify the checksum files during this operation (to somewhat protect against programatic errors and detect tampering):
1. Make sure all lines in the checksum file are unique and count the lines
2. Count the number of actual files in the associated backup snapshot and compare the number with #1. If there is a mismatch either your script logic is wrong, or something has tampered with either your snapshot or your checksum file.

I ensure that the checksum file created by my script is exactly like the format the standard Linux CLI util md5sum uses. This makes it easy to verify the backups, even if my script logic fails or the script is separated from the backup files. Normally I use my script to verify the backup because of the built in mail reporting it has.

The operations needed was very tricky to implement in bash in a 100% correct but efficient way (if the file numbers get high) as it needed some intricate awk-operations with multiple file inputs. In python this is much easier and clearer as the 'set' module has all these built in and accessed in intuitive ways.
I don't know if this answers your question, but anything more detailed than this requires the actual script code.

mrwhitethc · Apr 30, 2013

Thanks for the breakdown marsboer

SemiLucid · Apr 30, 2013

Hi marsboer your program looks really thorough. Would you consider releasing your script? I think It would be useful to allot of people, myself included.

Can I streamline this back up strategy?

mrwhitethc

Weaksauce

GeorgeHR

Gawd

SemiLucid

n00b

hotcrandel

Gawd

mrwhitethc

Weaksauce

drescherjm

[H]F Junkie

mrwhitethc

Weaksauce

SemiLucid

n00b

mrwhitethc

Weaksauce

drescherjm

[H]F Junkie

mrwhitethc

Weaksauce

marsboer

n00b

SemiLucid

n00b

marsboer

n00b

mrwhitethc

Weaksauce

SemiLucid

n00b