Tips for creating and maintaining offline backup?

SirMaster · Mar 7, 2014

Hey guys I finally decided to create a real backup of my ZFS array and I have opted for an offline backup for a few reasons.

By offline I mean individual hard drives sitting in a drawer.

It's "cheap".
They will be powered off nearly all the time so they will last longer.
I will keep them at work so they aren't at home where my server is.

Anyways I get the drives in a couple days and I am trying to plan out the best way to transfer my array to them initially and the best way to manage them going forward.

Here are just some of the ideas I have come up with so far. I am looking for suggestion or tips to improve on my ideas so far. Maybe some of you have done something like this before or just have some good ideas or good software to use.

Pre-info
My array is a 20TB usable ZFS pool with currently 13TB used.

I have 3 4TB drives and 1 3TB drive to use as backup, so 15TB of space.

Part 1: Initially filling my drives.

So the problem I see is that I have lots of data in few folders. For example, my Movies folder is about 9TB alone. How should I easily copy this to multiple individual disks?

My current idea is to just start to rsync the first directory (Movies) until the drive is full. Then output the list of what files copied successfully into a text file and then input that as an -exclude-from list into the next rsync command to the next disk. Then once Movies is done, move onto my 3TB TV Show folder and so on.

Think this will work well or see a better way to do this?

Part 2: Keeping the backup updated.

One piece of information to keep in mind is that my ZFS pool is 90% media, so I very rarely delete or modify files. 90%+ of the time changes to my data are simply file additions.

I was thinking perhaps to update my backup monthly or possibly quicker. Once I finish my initial backup, I was going to take a zfs snapshot. Then when it's time to update my backup I would take another zfs snapshot and then do zfs diff to quickly and easily tell me the file differences. Again, 90% of these would be additions, so I would just manually copy these files to the backup disk that has room. I was going to do this manually, but perhaps I can parse the zfs diff output to a file and then create a script that rsyncs each file to a disk of my choosing. Then for any file deletions or modifications (rare case) pop in the other backup disks and make those changes.

I'm OK with this process and I don't think it will take too long to do once a month or so. But again, maybe there is a way easier way I'm missing.

Part 3: Extra data maintenance

I was also thinking that since my backup isnt going to have any redundancy or parity or anything that I should "scrub" my backups periodically. I was thinking of using SFV or some similar tool where I can just tell it to checksum all the files on my disks and it maintains like a text file on the disk keeping all the checksums. This way if a file becomes corrupted or bit-rotted and a sector needs to be remapped I will see which file didn't pass the checksum and then restore it fresh from my array.

Part 4: Random thoughts

I was thinking of using NTFS as the filesystem for my backup disks. I wanted them to be able to be read anywhere (mainly Windows and Linux) and NTFS does have built in compression that works in both places. Unless you guys see a really bad reason to use NTFS as the file-system and have compelling reasons for me to consider something else. Maybe a different filesystem would make the ideas above easier?

Thanks.

crawfish · Mar 8, 2014

I broke my media into 1 TB size chunks, e.g. Movies-1, Movies-2, TV-1, TV-2, etc. Then in SyncBackSE, I create profiles that group these folders together depending on the sizes of my backup drives. I rotate a local backup and off-site backup monthly, and every couple of months, I turn on file contents comparison to make sure the backups are still fully readable and consistent with my local data. I use NTFS for all my drives. All are single partition and TrueCrypted. You mentioned NTFS compression, but media files are already compressed, and I wouldn't use it.

SirMaster · Mar 8, 2014

Well, I do have at least 1TB that I think will compress pretty well so I figured if turn it on. At worst it slows down performance a tad which I thought would be ok.

I've seen a lot of people say to turn on the transparent compression on our Filesystem these days. Usually it's lz4. NTFS uses lz77 which I read is pretty good too. Since NTFS compression is per folder/file perhaps I'll just enable it just for the folder containing my non-media data then.

Quartz-1 · Mar 8, 2014

SirMaster said:
By offline I mean individual hard drives sitting in a drawer.

Not a good idea. Have them offsite, otherwise you'll still lose everything if you have a fire.

SirMaster · Mar 8, 2014

Quartz-1 said:
Not a good idea. Have them offsite, otherwise you'll still lose everything if you have a fire.

You missed the third bullet point after that. I'll be keeping them at work.

Quartz-1 · Mar 8, 2014

SirMaster said:
You missed the third bullet point after that. I'll be keeping them at work.

I did indeed.

mrwhitethc · Jul 13, 2014

Did you ever find a solution to this? I'm currently just mirroring each drives but started out doing a copy and then running MD5Hash on the files, took forever to compare, plus add the new files to the original md5 file wasn't fun. However I just got in a few 4TB drives so now I'll use the original 2x2TB drives as backup.

I could swear someone wrote a program and posted it to the hard forums here that was like a database that kept all the checksums of your backup files but I can't for the life of me find that now.

iroc409 · Jul 13, 2014

I use rsync on my ZFS server. Every day it runs a basic rsync, and then once a week it runs a full checksum-based rsync. The checksum takes a long time, which is why I only run it once a week. I have three drives in backup rotation, one stays on the machine. They are USB 3.0, which occasionally has some issues, but for the most part it works OK.

SirMaster · Jul 13, 2014

My current solution is that I came across a free computer in which I stuck all my backup drives in.

Since they are all inside one PC at once I have a software pooling program to pool all the disks together. This way I can do one single compare to do the update sync from my large ZFS pool.

I also opted to stick 1 extra drive in the backup PC to use as parity with SnapRAID so at least there is a little redundancy in case a drive should fail during a mass restore of my ZFS array.

SnapRAID also maintains checksums and I check that every month.

Otherwise, my always-on ZFS server wake-on-lan boots my backup server once a week and then when the backup server powers on it syncs its pool with my ZFS array, updates the SnapRAID parity, and then powers off until next week.

I'm pretty happy with the setup now.

Aesma · Jul 14, 2014

Yeah there is no other way to do this really, single drives are a pain to manage. Now if I could lay my own fiber from my home to another place where I would keep the backup server !

I'm still in single drive mode and my plan is another OI/OmniOS machine with data backed up with ZFS send.

twistacatz · Jul 14, 2014

I backup my data to two locations. The first location is a smaller HP Microsever that I keep upstairs while my large array is in the basement. The second location is my brothers house. He has a small Microserver as well that he uses as his main NAS which he has afforded me some space on. We have VPN tunnel between our homes and I run backups across it.

My biggest concern is my digital photography. If my array failed and I lost my music and movies it would suck but I could re download everything for the most part. The pictures not so much.

SirMaster · Jul 14, 2014

Yeah, I wish I could keep my backup array remote, but It's just too much data to really manage across the Internet.

All the really important stuff I do actually have backed up a third/fourth time to remote at parents house and CrashPlan though, but that's only a couple hundred GB.

TGK · Jul 14, 2014

I do offline drive backups also. I run the primaries in single zfs pools and use mhddfs to get a unionized view. Then I rsync the disks 1:1 monthly. Mhddfs fills the disk with the most space so most of the changes are on one disk. I also zfs scrub both the primary and backup disks to verify integrity and exercise the heads.

I use a 3.5 drive dock for the backups it works really well.

Think of it as a broken mirror.

SirMaster · Jul 14, 2014

TGK said:
I do offline drive backups also. I run the primaries in single zfs pools and use mhddfs to get a unionized view. Then I rsync the disks 1:1 monthly. Mhddfs fills the disk with the most space so most of the changes are on one disk. I also zfs scrub both the primary and backup disks to verify integrity and exercise the heads.

I use a 3.5 drive dock for the backups it works really well.

Think of it as a broken mirror.

I thought about using individual ZFS disks but all the pooling software on linux always seemed terrible.

Hows your performance with mhddfs?

I only just set up my backup server and I opted for Server 2008 R2 with NTFS disks and am using FlexRAID pooling. I get over 100MB/s from my file copies across the network into the pool.

I was under the impression mhddfs would be a lot slower. And AUFS seems to not be as smart with where it puts the files and what happens when a disk is full.

I'm certainly open to switching to something else like ZFS+MHDDFS if performance is good enough.

To be clear, I already do use ZFS in RAIDZ2 on my server for the main array. I was just talking about my backup.

TGK · Jul 14, 2014

I have no problem maxing out gigabit samba with mhddfs. Processor is a quad core ivy bridge i5 but doesn't hit the CPU too hard. I actually also have zfs on top of luks encrypted layer. Works great, you should try it out to see if it meets your performance needs. Disks are WD red 3tb and I see 100-150MB typically.

Another nice thing is you can still access the direct drive mounts if you want to avoid any mhddfs overhead for a specific scenario.

I ran a quick test with two 1GB ram disks into a mhddfs pool and ran some very rudimentary benchmarks.

tmpfs 1.0G 0 1.0G 0% /t1
tmpfs 1.0G 0 1.0G 0% /t2
/t1;/t2 2.0G 0 2.0G 0% /tmptest

time sh -c "dd if=/dev/zero of=test1 bs=1M count=1024 && sync"
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.969309 s, 1.1 GB/s

real 0m0.994s
user 0m0.003s
sys 0m0.176s

echo 3 > /proc/sys/vm/drop_caches
free
total used free shared buffers cached
Mem: 16244000 1429820 14814180 0 724 1114892
-/+ buffers/cache: 314204 15929796
Swap: 0 0 0

dd if=./test1 of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.588599 s, 1.8 GB/s

Overall looks like there is plenty of headroom; I also ran a test creating 10000 files with touch and deleting them and it only took a few seconds; so shouldn't be a problem unless your doing time sensitive compiling or some such.

I'm running mhddfs 1.39 with the cflags and uthash patches.

SirMaster · Jul 14, 2014

Thanks for your mhddfs testing and stats. I will start playing with it.

stormy1 · Jul 15, 2014

I split it up into different folders by type.
Then use a script to back up sets of folders to the right removable drive.
If a folder set for the drive gets to 10% free on the backup drive I shuffle them around until it fits.
I don't fill a drive past 90%

What I use to script the copy varies by OS. xcopy with verify on windows. Bash cp or rsync on Linux.
Something similar will work with any OS and is quick to set up.

SirMaster · Jul 15, 2014

stormy1 said:
I split it up into different folders by type.
Then use a script to back up sets of folders to the right removable drive.
If a folder set for the drive gets to 10% free on the backup drive I shuffle them around until it fits.
I don't fill a drive past 90%

What I use to script the copy varies by OS. xcopy with verify on windows. Bash cp or rsync on Linux.
Something similar will work with any OS and is quick to set up.

Haha this would absolutely not work for me. My movies folder is about 8TB alone.

bexamous · Jul 15, 2014

FWIW there is a pretty crappy script here I posted long ago:
http://hardforum.com/showthread.php?t=1366907

If nothing else the method it uses at least works:
- use 'find' on your array to get a list of all files and their sizes
- split this big list up into smaller lists, one for each backup drive
- every time you attach a backup drive, use rsync to sync just the files on the smaller list that is associated with that backup drive.

It is a little tricky to handle deleted files. Rsync's --delete option does not work if you are also giving it a list of files to sync, but see script linked to because it uses some method to handle deleted files.

If you really want to mess with checksum files, which is just a huge pita, take a look at cfv.. it has lots of options, I do not recall exactly if any will make updating lists of checksums easier.

Alternatively to using checksum, just make your backup drives single disk zfs pool, or put btrfs on them... let the filesystem keep checksum data. When you want to verify data integrity you can then just use 'zfs scrub' on each backup drive. It will be no slower then using checksum files, but you won't have any lists to maintain.

stormy1 · Jul 15, 2014

SirMaster said:
Haha this would absolutely not work for me. My movies folder is about 8TB alone.

split it into different folders. drama, comedy, chick flicks, etc or a,b,c,d.......by name.
I have done both at different times.

TGK · Jul 16, 2014

SirMaster said:
Haha this would absolutely not work for me. My movies folder is about 8TB alone.

My same scenario, 4x3TB drives storing mostly video data with some pictures and audio files thrown in.

A major downside of the fancy rsync scripts is the complexity and higher chance for human or script error. I've had three data losses in the past 5 years. One was a non backed up 1tb drive that died and I couldn't recover 1.5GB. The other two were total losses during attempted backups with a buggy script or command that destroyed the source. I've learned since then to keep it as simple as possible and only backup read only sources (zfs snapshots).

SirMaster · Jul 16, 2014

TGK said:
My same scenario, 4x3TB drives storing mostly video data with some pictures and audio files thrown in.

A major downside of the fancy rsync scripts is the complexity and higher chance for human or script error. I've had three data losses in the past 5 years. One was a non backed up 1tb drive that died and I couldn't recover 1.5GB. The other two were total losses during attempted backups with a buggy script or command that destroyed the source. I've learned since then to keep it as simple as possible and only backup read only sources (zfs snapshots).

Good idea. Or re-mount your filesystem read-only during backup.

Tips for creating and maintaining offline backup?

2[H]4U

n00b

2[H]4U

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

Weaksauce

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

2[H]4U

Weaksauce

2[H]4U

Weaksauce

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

Weaksauce

2[H]4U