Need advise on tape backup for large home office data (50TB+growing)

hikarul

n00b
Joined
Dec 21, 2012
Messages
22
Hi All,

I'm a long time reader and first time poster. Some posts here really helped me build my last file server. Thanks! Now I have a new problem hope to get some advise on.

I have a very LARGE (for home user) data collection for my personal data-mining project. Basically financial data go all the way back including PDFs, txts, pics, that were available from various public sources. The total size today is about 50+ TBs, spanning 20x 3TB Mixed WD and Seagate drives (I fill each drive to 90%) and growing at 5TB/month rate. Compression doesn't help much as a lot of files are PDF scans.

My home-grown Norco 4224 FlexRAID dual parity file server is close to be completely filled, and I'm about to build a second JBOD enclosure to attach to the head file server. However, given the shear amount of data that I have (which I was actually shocked when I realized how much crap I accumulated in the last two years), I'm beginning to worry that I may one day lose it all and have no way to get them back. So I'd like to design a backup/restore system in case of fire/flood/system corruption etc.

Here are a few points:
1. The data is not too important. I can theoretically get them back in case I lost them but I have to pay more subscription fees (which cost major $$$). The bottle neck is the download speed. It took me TWO YEARS to accumulate this much data. There is no way to speed up this process because the restriction is limited on the other side (e.g. SEC only allow bulk download at nights and limits speed per IP).

2. I'm a HOME user, I don't make money on these things (at least not directly, I trade stocks), so my budget is limited (<10K total), so no SANs or expensive pre-boxed NAS solutions (e.g. Drobo). Can't afford them. The file server is off most the time, no need for 24x7 operation. I'm the only user.

3. I live in a Condo so I can't have LOUD servers or fans running all the time.

I've only recently heard about LTO tapes. I'm surprised they still exists. But after some research, it looks like they fit my requirements for Backup / Archive. But I've never heard of them before and know very little about them. The media seems cheaper (~$40/1.5TB vs ~$130/3TB disk), but the drive is a large expense (~$2000 for LTO5).
I did some calcs, if I don't count the cost of HBA cards or another case/motherboard/PSU etc, the break-even between the tape and disk solutions is at ~70TB. Beyond that tape is cheaper.
If I do count all that (i.e. build another fileserver for replication and offline storage), the breakeven is ~40TB.
So economically it makes sense.

But what are the down-sides? Is this a good idea for my situation? Should I rent a bank-deposit box to store all these (40+) tapes?
How are the supports for these tape drives? Say If I get a Quantum/HP/IBM/Dell one from Newegg, how's the support from Quantum/HP/IBM/Dell?

Also since LTO6 just came out, should I hold out a few month to get into LTO6 instead of LTO5 right now? How fast would LTO6 drive and media price go down?

Again, any insights comparing the two solutions for HOME users are extremely appreciated.

Sorry for the long post.
 
How about something like this Used TL-2000 tape library.

We've used them as a vendor in the past and we've been very happy with the prices. You could even get the dual LTO-5 drive and still be under $6000. It has a 24 tape robotic library in it, so just get some software to drive the backups and you are good to go. Tapes are fairly inexpensive as well.

I would probably stay away from LTO-6 for the moment since it's new which means it will be a premium. Plus many companies will upgrade and you can get the hardware like the above out of some datacenter pulls.
 
Thanks for the reply man. Couple more questions.
What's the purpose of a tape library? To store tapes so you don't have to manually change tapes? Sorry still new at this.
If you are storing these tapes inside the case, wouldn't you defeat the purpose of offsite backup?
Say if I only want to have off-site archive, should I just get a bare drive?
 
To answer your questions:

The purpose of the library is to handle automatically changing tapes. With the amount of data you are backing up it will need to span tapes and this library will handle changing them in and out as needed. For instance, our backup jobs span 5 LTO-5 tapes and BackupExec (or any other library aware software) just changes out a tape when the current tape hits the end. You will not be able to get all of your data on one tape and keeping them organized can be a pain doing incremental backups. The library along with the software will catalog the contents and make it easier to restore files and folders since it keeps the tape contents in a database for retrieval.

You can still do offsite backups. There is a drawer on the front where a tape can be dropped for removal. You can schedule a monthly job to drop the tapes and then you can move those offsite.

If you get a bare drive you will end up going mad with tape swapping or you will lose track of the data through incremental backups.

Also, the library has a barcode reader to help catalog tapes in case you get them mixed up. You can print these out yourself and use them as part of the cataloging. If you want to extend your offsite backups you can just eject the used tapes and put in new ones with new barcodes. The software will catalog the new barcode and index the data.

Another option for offsite backups is to setup a Crashplan account and use their drive service to help prime your initial backup. The software will then stream the remaining data automatically and if you ever need to do a wholesale restore you can do it from their servers or have them ship drives to you.

Realize that the tape drive is really the most expensive parts of the library. Last time I checked, a SAS LTO-5 drive was still $1500 and that's just for a single tape drive. Considering the automation the TL-2000 is a pretty good deal. At least it has been for us. We went with the iSCSI version which I think was ultimately a bad idea since it really increased the cost. We've had this running for 2 years now and it works like a champ.

I can't remember the noise levels though. Ours is sitting inside of a datacenter so the entire place already sounds like the business end of a jet engine.

Just something to think about.
 
Wow, that's makes perfect sense, thanks for such detailed information!
I'll give these guys a call and see how much it costs to ship to Canada.
Its a shame there isn't too much second-hand enterprise hardware vendor in Canada.
I really envy people in the silicon valley.:(

@Brahmzy, the cost of cloud backup can actually be very cheap. Backblaze has $4/month for unlimited* data. * I have no way of knowing if its actually unlimited for 50TBs though, although they theoretically have a few PBs of spaces available.

The problem is a HOME Cable Connection 100MB down / 10MB up (which btw is already the fastest you can get in Canada, besides expensive business services). Probably will take me years to do a full backup lol. In fact, SEC's public database *IS* the ultimate backup, just slow as hell.

Amazon Glacier / Google equivalent is actually cost prohibitive for 50TB though.
 
Oh, I don't actually need incremental backups.
The files are organized by SEC's submission dates. They are not edited in any way other than compression to zip as required.
My program unzips them on the fly in memory when using them.

I'm imagining each tape will probably store half-year worth of data from SEC's database. So I just label them YYYY-H1 or H2 or Q1 etc.
 
@Brahmzy, the cost of cloud backup can actually be very cheap. Backblaze has $4/month for unlimited* data. * I have no way of knowing if its actually unlimited for 50TBs though, although they theoretically have a few PBs of spaces available.

No they aren't renting anyone 50TB of space ($2500 worth of harddisks) for $4/month, especially while they're so strapped they've been resorting to having employees, friends and family scour Costco and online sites for drives. But putting that aside, if you ever needed to restore that data, thats when its going to cost a fortune.
 
Last edited:
Ya I thought it was too good to be true:)
We all know unlimited normally means throttle after 3GB:D
 
I have a very LARGE (for home user) data collection for my personal data-mining project. Basically financial data go all the way back including PDFs, txts, pics, that were available from various public sources. The total size today is about 50+ TBs, spanning 20x 3TB Mixed WD and Seagate drives (I fill each drive to 90%) and growing at 5TB/month rate. Compression doesn't help much as a lot of files are PDF scans.

I was actually shocked when I realized how much crap I accumulated in the last two years

Here are a few points:
1. The data is not too important.

2. I'm a HOME user, I don't make money on these things (at least not directly, I trade stocks)

Should I rent a bank-deposit box to store all these (40+) tapes?
Most of your data is crap and you should not be downloading it in the first place. No need to back it up.

If you use it for trading stocks, you only need recent information not stuff that goes back decades. I trade stocks also. Not much. I do maybe 10 sell/buy order pairs a year. But I only have 7 figures of assets. I have never had a need for SEC filings.

You should not buy any tapes. So there is no need to get a bank-deposit box.

You should delete most of your data and use the space you now have to make copies of the data that is important.

---

My wife converted our office to paperless 5 or 6 years ago. We scanned in every piece of paper we had in our files, shredded the documents, and recycled the shreds. The stuff we scanned in has never been accessed and never will be. But we have it.


5 or 6 years later I am still spending a couple days a year shredding old documents - stuff going back to 1990. I have no idea where the document come from. The office is too small to hold them. But we keep finding this stuuf.
 
I have a very LARGE (for home user) data collection for my personal data-mining project. Basically financial data go all the way back including PDFs, txts, pics, that were available from various public sources. The total size today is about 50+ TBs, spanning 20x 3TB Mixed WD and Seagate drives (I fill each drive to 90%) and growing at 5TB/month rate. Compression doesn't help much as a lot of files are PDF scans.

Um I'm intrigued. How many disks are committed for parity in your setup?

5TB per month is an insane burn through rate for home use and even in business settings I would at least do a double take.

While a tape backup solution would be the answer normally, there so much about your set up that seems to be questionable. I would first figure out if I couldn't get a handle on the amount of data your are committing to disk. 3x 20 is 60 TB's and if your data already takes up 50TB's you couldn't possibly have more than like 2 disks committed for parity.

I could be off but I'm calculating about 3 days of backup time at the very minimum if you are using one tape drive. Your in enterprise storage with the amount of disks and storage but you don't have an enterprise setup.

I would re-evalute the setup you have here. Take a look at your burn through rate, think about how many disks you are committing to parity and then come to a determination on what needs to be backed up.
 
@GeorgeHR, I know I have tons of crap but that's the way these data are. The purpose is NOT just for trading, its a crazy experiment to employ home-grown OCR technology on PDFs to interpret and analyze data (i.e. like I said, data-mining).

Sadly my technology is still evolving so I cannot afford to discard the original as my code changes each week.
Also I cannot afford to have lossy compressions because OCR depends on high-res scans.

The data include financials, press releases, technical papers, oil well raster logs etc. going way back.
I coded automatic data-download routines to source data from all over the world.
Sadly most of the open world still only uses PDF for public filings.

@kac77, I use flexraid raid-6, so 2 disks are parities. Since Most data don't change, I only re-sync every week.
only have two more slots left in the 4224. I only fill to 90% cause otherwise the SBS2011 connector keep popping those annoy disk-full errors.
Actually I've never had a single disk fail on me in the last 2 years. The only time sh*t happened is when my PSU craped out burned a 2TB backup drive.

As I said, these data are not precious, just takes insanely long to gather that's all.
To be honest, consider the amount of cable bill I pay, having a backup is probably the cheaper option than to re-download them again.

This project has nothing to do with my day job (engineering), so I can't afford to deploy an enterprise setup.
Also those servers are too loud for my condo. (not to mention power requirements)
 
If the data isn't important then tape backup it is. It's going to take a while to back up so do it like once a month. It's going to be cheaper than back up to disk so go for it.
 
Datamining and topic modeling text sets.
No deduplication.
Storing more data than CitiBank.

Any questions?
 
@paret0, well said my friend :) I'm just a mad scientist trying to conquer the world that's all.

About to put down the money for a brand new drive from dell. Just prefer their customer service and return policies. HP and IBM are virtually non-existent in Canada.

One last question, where should I source the tapes in bulk? Can I trust eBay? What's the likelihood that tapes are damaged or demagnetized?
Also I'm confused about this WORM thing, Are LTO5 tapes re-writable at will? Say if I want to update a few files on a tape, do I need to wipe it clean and re-write the entire tape (e.g. DVD-RW), or is it like USB key where just modifying a single file is possible?

Again, thanks you all for helping out!
 
@paret0, well said my friend :) I'm just a mad scientist trying to conquer the world that's all.

About to put down the money for a brand new drive from dell. Just prefer their customer service and return policies. HP and IBM are virtually non-existent in Canada.

One last question, where should I source the tapes in bulk? Can I trust eBay? What's the likelihood that tapes are damaged or demagnetized?
Also I'm confused about this WORM thing, Are LTO5 tapes re-writable at will? Say if I want to update a few files on a tape, do I need to wipe it clean and re-write the entire tape (e.g. DVD-RW), or is it like USB key where just modifying a single file is possible?

Again, thanks you all for helping out!

I stand corrected.
 
Last edited:
LTO tapes are re-writable, but they are not like the normal USB drives where you replace just a single file that changes. They are not directly accessed storage like a hard disk or USB drive. This is the reason why you need some front end software (Tivoli, BackupExec, Amanda, etc.) to catalog the backups correctly so you know what file is on what tape. This is an archival medium and not a random access medium.

If you do end up getting a library (and with this much data I hope you do) then don't forget to purchase a cleaning tape. They can help with the longevity of the drive.
 
I'm using a HP LTO-4 [inside a Tandberg 24 Storage Library] with built-in of win2003 ,NTBackup+Removable Storage . For pure archival purpose the combination do the job.
But for some type of backup [network] is needed more powerful software.
For the OPs necessity,an LTO-5 drive will be ideal solution.Usually it comes with some simple archive/backup tools.50 TB can fits at 40 LTO-5 tapes roughly. Storage Library is overkill for him.
 
I would like to come back at the cloud storage thing, if you dont mind.

At the moment i have all my data (10tb) backed up to the cloud.
The service i use is clouddrive.nl (witch is a dutch provider for livedrive)
For just 4 euro per month im getting unlimited storage on a unlimited amount of computers with unlimited bandwidth.
And so for it has been 100% unlimited for me. (i upload at 9MB/s to their server)

I also didn't expect it to be unlimited but so far it is:)
 
@morbid, the internet here in Canada is stupidly slow for uploads. its unlimited 100Mbps down / 10 Mbps (i.e. ~1 MByte/s) up. It would take me a life time to upload them all.

This is the fastest I can get for non-business. Business plans require a business address and are normally capped at 1TB/month.

Also I'm paying over $120/month for this shitty service. (Shaw raised it to $195(!)/month, but my plan was grandfathered)

I really envy those of you who lives in California or somewhere in Europe. You can get cheap servers, stupidly fast internet for pennies.

Oh, Even if I could backup to the cloud, considering how much data I have it will cost me more to get them back from the cloud (i.e. they ship drives to me, ~24 of them @ $300+ each) than just back them up to tape (~$40 per 1.5 TB), then store in a safe somewhere.

BTW, does anyone knows the best way to store ~40 LTO5 tapes? In multiple carry cases? Is there a moisture requirement?
 
I would like to come back at the cloud storage thing, if you dont mind.

At the moment i have all my data (10tb) backed up to the cloud.
The service i use is clouddrive.nl (witch is a dutch provider for livedrive)
For just 4 euro per month im getting unlimited storage on a unlimited amount of computers with unlimited bandwidth.
And so for it has been 100% unlimited for me. (i upload at 9MB/s to their server)

I also didn't expect it to be unlimited but so far it is:)

lol for 4$ a month i would pretty much assume the data will be gone at any moment in time.
 
I understand what you are going through.
I have 40TB myself and always think about tape as a backup, but negatively.

I use bluray disks for the very important stuff so I can place them into a spindle and store them in a separate location. Discs are cheap and I can reload the data quickly, if needed. Plus, Fry’s carries drives and discs. If your tape drive dies or you run out of tapes, will you be able to find parts easily?
For the remainder of the data, I store on separate hard drives and offline them when the backup is complete.
I've worked in tech for fifteen years. I’ve experienced so many problems with tape that I would never suggest it even if some technology today makes it appear better.

If you expect to spend over $3,000 for an aged tech like tape, you should talk to a storage vender.
A very low end DataDomain starts around $3,000 and could probably dedup all of your data easy. This is a hard drive based backup system. Many storage venders do block level dedup making files very easy to dedup. I’ve seen crappy dedup numbers on large video streams but it was still half or a quarter of the original size. I have some databases and file servers deduping at 52x. Some home grown ZFS systems can dedup.
For dedup, think of it as worst case, 2x would be 40TB of data on 20TB worth of hard drives. 4TB HDs for $199, $1000 for 20TB of hard drives and a Dedup capable OS could save you some money.

This is just ones man’s opinion but please do not go tape and do not use an internet service. Tape will die, internet takes forever, and if your online backup company dies, so does your data. It’s your data, own it and keep it local so you can move with it. I have a few friends that have lost data to lightning, or home fire. A failed restore is harder to accept then the initial data loss.
 
For dedup, think of it as worst case, 2x would be 40TB of data on 20TB worth of hard drives. 4TB HDs for $199, $1000 for 20TB of hard drives and a Dedup capable OS could save you some money.

The worst case is dedupe does not save any space at all like in the case where all the data is mpeg video (where you will not expect much duplication and you will not gain from compression either). Although that will be highly dependent on your data.

If you expect to spend over $3,000 for an aged tech like tape, you should talk to a storage vender.

Remember that tape has several advantages over adding additional raid servers as a backup.
 
Last edited:
@nicholasfarmer
I did try to use 25GB blurays for backup 1 year ago. They are somewhat competitive compares to HDD.
Each 25GB disk is about $1 when buy in spindles.
50GB disks are still way to expensive @ $4 each

Problem is that the sheer number of disks add up so quickly and it take FOREVER to burn.
each disk takes over 20 minutes to burn and verify, so to do all would take a whole MONTH 24x7.
Also the # of disks is 50TB / 25GB = 2000 disks!!
Imagine switching the disks manually two-freaking-thousand times! You can't pay me enough to do that. :D
Shit, My bluray movie library is smaller than this (~300 titles include TV series, not too sure how many disks), and my GF complains about it takes too much shelf space.

ZFS is a definite NO, tried it but too much hassle for a backup system that's offline almost 24x7. There is absolutely no benefit.
Also most of the files are PDFs and JPGs scans, dedup do not same additional spaces, its already compressed formats. This data isn't your typical corporate word documents, spreadsheets and emails.
Finally the backups need to be offsite. Who knows when my house will catch on fire? (knock on wood).

The way I look at it, if Amazon Glacier uses LTO6 tape, there has to be a reason. Although they have an enormous robotic tape library. They must have done the study and determined tape is the optimal media for a strictly "Backup-Only" data access, in terms of power, hardware, scalability, storage, managabilities etc. etc.
 
One more thing for you guys to comment on:

Looks like Quantum is pushing out their LTO6 tape drive (Quantum TC-L62AN-EY Half Height) early 2013, for about $2000ish (there is already EBay sellers for it).

This compares to a typical new LTO5 drive which is about $1500+ today, is only $500 more expensive but can take advantage of LTO6 tapes. Also LTO6 tape drive can read-write LTO5 tapes anyways.

The way I look at it I should just spend the extra $500 to be future proof, but for now use the cheap LTO5 tapes to do the initial backup. I see a lot of people advocate do a double backup, so I might do that on LTO6 tape when the price go down to nominal levels ($110 -> ~$80 each) in the future (if I feel like it). When you have that much data you can get really paranoid. :eek:

What do you guys think?
 
A bit of a resurrection, but...

If it helps any, I do have a much smaller library for our multimedia (movies / TV), MMO recorded video backups (mostly at near lossless re-encoded FRAPS files), and our DLSR's RAW images (each around 30 MB a piece) counting around 22 TB as of this post. With that, I used our second HP Prolaint DL180 G6, synchronized it, and put it into the datacenter my other HP server is colocated in. You can use another system or possibly multiple systems to do off site backups with only having to sync the newest data to the remote server. There are larger servers out there that do support more than the drives you specified or using multiple servers. With the right datacenter, you possibly colocate the server for under $100 with an unmetered 10 or 100 Mbit line (Gbit lines are never that cheap that are unmetered), even be fairly close by so you can physically retrieve the backup data in case of an emergency.

Along with that, you can use the bandwidth to download additional data.
 
Sorry for contributing to the thread necromancy...

From almost 20 years in IT, LTO is fantastic IMO and depending on media quality and storage conditions can archive up to 30 years. I seriously doubt writeable blu-ray lasts that long. Another thing that will suprise you is just how fast LTO is, well north of 100MB/sec for the more modern incarnations.

Anyway, that aside, why don't you just ocr the stuff once and store the ocr text? I get wanting to keep the original documents in case something is missed but when you're getting into enterprise size storage requirements you may want to consider making choices...
 
Back
Top