Cheapest and Most Practical Technique for Archiving Large-ish Amounts of Data

lemonZest

n00b
Joined
Jun 10, 2014
Messages
1
I have roughly 25 TB of mostly not very compressible data (videos and pictures) that is currently growing at a rate of about 8 TB per year. The rate at which this pool of data I keep is growing continues to rise each year. All of this data is personal data of which I possess the only copy and so I need a system where data integrity (i.e. being able to read the data back years later with no non-recoverable read errors) is paramount. Internet storage solutions are not great for me since my upload speed is terrible, being only about 3 mbps. I do not need to be able to access this data very quickly so offline/cold storage is acceptable, but I absolutely do need to be able to access this data with no non-recoverable read errors for the decades to come.

I have a few ideas so far, but all of them have some pretty hefty drawbacks:
1. optical media: all accounts say this isn't great for archiving since optical disks decay in about 5 years.
2. large RAID with ZFS: expansion of the RAID would require a rather hefty investment each time I wanted to expand the amount of data I could store. It also requires electricity to keep running and I'd have to routinely keep replacing disks over the years to avoid data loss. I'd also have to buy additional hardware for this setup and internal SATA ports are actually quite expensive.
3: tape media: although tapes have the cheapest price per GB of all the methods I've come across, newer tape reader/writers for LTO5/6 are crazy expensive. I've also read about the concern of reader/writers failing and not being able to get a compatible replacement unit. Also, it appears that the tape medium itself is very fragile and would require maintaining a special storage environment for the tapes to best ensure readability of the tapes in the future.

Each also have some unique benefits:
1. optical media: Smaller size allows me to more accurately concentrate additional redundancy if I so choose.
2. RAID: more or less real time status of the integrity of my data.
3. tape media: absolutely the cheapest option once my data pool gets above a certain size.

I'd also probably be employing the use of par2 files to help fight against things like bitrot.

I seem to be in this weird spot between typical home users with only a handful of TBs of data to worry about at most and enterprises who have lots more data than I, but also have a lot more money than I to spend.

In short, I'm looking for the cheapest solution that can guarantee that I can recover all of my data 10, 20, 30+ years down the road without any non-recoverable corruption. I'd also like to favor higher upfront costs for reduced recurring costs, provided that it is reasonable and actually more economical.

If you have some alternative solutions in mind, please post a rough estimate of upfront and recurring costs.
 
When comparing the costs, keep in mind that independent storage of redundant copies and periodic cloning to new media will be needed. So you don't just need 25+ TB of storage, you need at least 50+ TB, possibly more depending on the importance of the data. I also wouldn't trust anything to store data for decades without periodic transfers of the data to new media, so keep that in mind too.
 
If you want to keep this data offline, you'll want AT LEAST 2 copies of everything AND read it every few years or so. No storage medium that can be bought by consumer-level money will keep your data intact for decades without maintenance.

I think your best bet is large raid arrays with huge disks and just spinning the drives down. A spun down 4TB drive shouldn't use more than a few Watts, you can scrub it as often as you like and you don't need to mirror everything. You can't create automatic parity with tapes or optical media like you can with RAID-Z/RAID5, so you'll basically have to mirror all your data with these types of storage.
 
If you dont use ZFS, you should do a manual SHA-256 checksum on all files, every week or so to ensure data integrity. (BTW, ZFS does automatic SHA-256 checksums on all files, with "scrub" command). ZFS is proven in research to be safe (read the wikipedia article to see all the different research papers), other solutions are not proven safe by researchers.

If you are concerned about lot of spinning disks, you could use snapraid/flexraid/unraid/etc ontop ZFS. So you could use ZFS on each individual disk, and then pool together with snap/flex/unraid solution. ZFS will always detect data corruption, but will not be able to correct with a single disk (unless you specify "copies=2"). So you will need to rely on snap/flex/unraid data correction capabilities. But these solutions are not proven safe by researchers. Researchers at CERN say that even very expensive storage solutions are not able to detect/correct data corruption, which is why CERN has switched to ZFS to get full data protection. All these research papers are linked to, from the wikipedia article on ZFS, just read it.
 
8TB is a couple thousand hours of video. I would get rid of most of it. Do a lot of editing.

Buy 4TB hard drives for backups. Store them in a bank safe deposit box. Make a new backup every year.

Cost for the box about $25/year. Cost for hard drives $1000/year.
 
My media collection is 30TB and I have it in two places in my basement. External drives are likely the cheapest, you can get 5TB seagate's now and I am sure 6TB are just around the corner. Fill them up and put them away until the following year when you can update them and test them.
 
I think that tape is actually your best bet. It is like $2-3k for the LTO 6 drive and then like $50-60 per 6.25TB tape. If you really are going for ~8TB a year increase for some time then tape will be your easiest upkeep (last 15+ years), be much cheaper per TB as space grows, and lower power cost. You will end up saving at least $200 a year just on media, not counting any controller cards or power costs. Tapes are designed for long term storage where as HDDs are not.
 
As someone who has managed thousands of enterprise tapes during the years, I'll just inform you of this: Tapes fail. Data become corrupt.

Difference between corrupt data on a disk and corrupt data on a tape? The disk is usually part of a RAID array, which means that there is parity data, which means the corruption can be repaired. On a tape? Your data is gone, better make sure you have an extra copy.

So yes, a simple $$$/TB comparison will put tapes at the top of your bang-for-the-buck list, but unfortunately, it's not that simple.

Tapes are a pain in the behind. The drive stations are insanely expensive, they require frequent cleaning, the data on the tapes is difficult to manage unless you have enterprise $oftware assisting you, your data is offline, they have horrible seek times (doesn't matter if you're just using it for archiving).
 
Amazon Glacier is set up for precisely what you want. Cheap and copious storage that you don't need access to immediately. So long as you aren't downloading a large portion of your data regularly, it's cheap and you won't have to worry about reliability. As for your low upload speeds, you can mail pretty any sort of media to Amazon and they'll process it and send it back. Get a USB external hard drive and do a few TB at a time would be my suggestion.
 
if your going to use tape for archiving......

DO NOT use compression, and or do not backup compressed files to tape.

Reason why?

Just plain stream the files to tape using (amy unix variant) TAR

Even if a tape breaks......you will be able to stream the data off of the tape.... up until the break. (as it's just reading back a stream of data .....eg your files....... so you will get back the bulk of the files up until the break... and also after the break in the tape.

The only corruption will be whatever file overlaps the physical break in the tape media.

compared to, having to recover the whole compressed data within the tape media......which you naturally can't do completely with a break in the tape media.

eg if you remember the old days of floppy drives..?? (maybe?)

you had two ways to backup your data to floppies

Option 1, copy files till floppy full....insert next floppy.....rinse repeat..
Option 2. zip up all your files into a big zip file, then write that zip file across (usually a lesser number of) floppies

to recover files from Option 1
inset floppy and read back files.....eject and insert next floppy. rinse repeat
Tho if you encountered a BAD floppy...... you still had all the files/data on the remaining floppies to work with

to recover files from option 2
insert 1st floppy.....start the unzip process.... eject and insert next floppy in the ZIP set......
Tho if you encountered a BAD floppy.....you would kiss your Data goodbye.

.
 
Back
Top