Serious thread: Best way to store 40 petabytes

this on its own is a ridiculous idea. in 1000 years, they'll store all 40pb on a single "hard drive", what hard drives are in the future. you only have to worry about 50 years tops, and transfer it from there.

A single hard drive in 1000 years what they use as drives I doubt it is within the realm of our understanding heck 100 years from now I don't think we will be able concieve of what they will be like. At this point we have 4 terabyte drives we achieved that in under 20 years in 10 more we will have exponentially bigger drives think 100 times or 1000 times what we have now 400tb to 4 pb drives in 10 years is feasible with technology as it is now those drives will be likely quantum state my guess they will use some method of storing bit patterns in a frozen state I could be wrong and it could take us longer to go full quantum storage but the density of solid state will pass spinning media at some point.
 
If you are concerned about longevity of data for 100+ years, you'll definitely want 3+ copies is geographically diverse locations over the world and a mechanism to keep all copies in sync based on data growth and such.
ZFS already has this mechanism. You use "zfs send | receive" to all 3+ copies and ZFS only transmits the recent changes (just like rsync) and also, automatically checks that no bits have been altered on the way. That is partly why I think that ZFS is the only solution for large amounts of data.

CERN agrees too. Studies from CERN shows that large amount of data will have randomly flipped bits, that is, data corruption. That is why CERN is switching to ZFS for their long term storage of all their data. Also, large supercomputers such as the new IBM Sequioa, that stores large amount of data using Lustre, have modified Lustre to rely on ZFS. I suggest you read research papers on data corruption here:
http://en.wikipedia.org/wiki/ZFS#Data_integrity
 
Oh, absolutely ZFS would be the only way to fly for this project! But that's also assuming that all data is in online storage. My point with geographic redundancy was more to guard against war/strife and political climate changes, than pure data integrity.

I'd really like to hear back from the OP at this point in the discussion.
 
The thing to understand is that the correct approach to such a project is not really about choosing a single solution with current technology.

It is about creating an infrastructure and organization that will withstand the test of time.

I would divide the problem into two pieces:

1) How to organize at least three groups of people to maintain archives in sync, in three widely geographically separated locations

2) How to systematically copy or regenerate the data to new media (including new technology) and check the integrity every X number of years (where X is probably 5 to 10, but determining X is part of the planning required)
 
It needs to be determined if this is actually a legitimate request for info or, as others have stated, a thought exercise or source of info for a senior thesis paper on massive storage requirements over time. It's already been stated that the organization requiring this is orders of magnitude older than Amazon. Granted, Amazon is not really that old (20 years?) but he is implying an organization that is hundreds or thousands of years older. There are but a handful in existence and the most logical would be the Vatican. Even then, how could they have amassed such a large repository and how the hell (LOL) are they storing it now. This is all pure conjecture but in all honesty this seems a bit odd.
 
Storing digital data for a millennium is inconceivable there is so many things to go wrong un-maintained hard drives will demagnetize and degrade recorded media will rot tape dvd/bluray the organic dye that you record to rots and the disc is un readable after 3-5 years they are saying they have 300 year disc that they are supposed to last that long. But making a disc that can survive times savage beat down is one thing in 1000 years would we even have a working bluray player left likely no the people of the future would likely have to create a device to translate the old method into the newer way of storing data.

So what would be needed is a company that can last for over 1000 years the data base of information to be saved will need to be upgraded and monitored for the full 1000 years so as new tech comes out the storage that this data is stored on will need to be transferred to new drives when each advancement occurs and the data integrity will need to be monitored.

A voice of reason.

Sounds like a bunch of monks copying manuscripts.
 
It needs to be determined if this is actually a legitimate request for info or, as others have stated, a thought exercise or source of info for a senior thesis paper on massive storage requirements over time. It's already been stated that the organization requiring this is orders of magnitude older than Amazon. Granted, Amazon is not really that old (20 years?) but he is implying an organization that is hundreds or thousands of years older. There are but a handful in existence and the most logical would be the Vatican. Even then, how could they have amassed such a large repository and how the hell (LOL) are they storing it now. This is all pure conjecture but in all honesty this seems a bit odd.

The OP says the data is audio and video. That alone makes the project some type of "thought" project - not real.

As you ask "How are they storing it now?" Clearly they are collecting or are about to collect the data. Over time they expect to collect some vast amount.

Equally important is How valuable is the information? I expect it has little value. I would just dump it as soon as possible.
 
It needs to be determined if this is actually a legitimate request for info or, as others have stated, a thought exercise or source of info for a senior thesis paper on massive storage requirements over time. It's already been stated that the organization requiring this is orders of magnitude older than Amazon. Granted, Amazon is not really that old (20 years?) but he is implying an organization that is hundreds or thousands of years older. There are but a handful in existence and the most logical would be the Vatican. Even then, how could they have amassed such a large repository and how the hell (LOL) are they storing it now. This is all pure conjecture but in all honesty this seems a bit odd.

Another place is the national archive lots of 200 year old documents and last I heard anything they were digitizing the archive as well
 
The OP says the data is audio and video. That alone makes the project some type of "thought" project - not real.

As you ask "How are they storing it now?" Clearly they are collecting or are about to collect the data. Over time they expect to collect some vast amount.

Equally important is How valuable is the information? I expect it has little value. I would just dump it as soon as possible.

Flight one of the data has little economic value but tremendous cultural value and cannot be destroyed ever per contract.

Flight two of the data will be worth a lot. I don't know exact figures, and I couldn't post it on the forums if I did. But it's worth a lot.
 
It needs to be determined if this is actually a legitimate request for info or, as others have stated, a thought exercise or source of info for a senior thesis paper on massive storage requirements over time. It's already been stated that the organization requiring this is orders of magnitude older than Amazon. Granted, Amazon is not really that old (20 years?) but he is implying an organization that is hundreds or thousands of years older. There are but a handful in existence and the most logical would be the Vatican. Even then, how could they have amassed such a large repository and how the hell (LOL) are they storing it now. This is all pure conjecture but in all honesty this seems a bit odd.

I guess "an" order of magnitude older. Not quite as old as the Vatican :)
 
I think you have the right general idea. I'd guess we're no more than 20 years away from multiple petabyte or exabyte hard drives, and I'd also suggest planning for a 50 year time frame with the idea of transitioning to a better, more affordable medium as technology improves.

hard drives would have to be replaced every ~2-3 years maybe as much as 5 you would want to replace probably every 2 full or not failed or not you would also want enough spare drives to just pop them in and hit rebuild as they fail. This is why i would probably go with 3-4 data centers one in the main building where main additions and revisions are made and 2 child data center to which backups are pushed and checked...
 
There are a few key issues with this project that I can see:
  • It sounds like you are planning it for a single 5-10 year implementation with offsite tape backups.
  • Tape backups are just that (backups). They are not for primary storage.
  • You need to think about the overall lifecycle of the data. If it is as important as you say it is, it should be online at all times. Consistency should be checked continuously. Your hardware should be from multiple vendors across multiple geographical sites. Data should be duplicated and replicated to multiple sites.
  • We need more information about the data. What is it? Structured/unstructured? Where/how is the index stored? How is the data lifecycle defined (i.e, is everything 1000-year retention)?
The short answer is that your budget of a few million per year is insufficient for the startup costs to host this yourself. I would suggest working to leverage multiple cloud vendors in conjunction with tape-based offsite backups to arrive at a continuous operational cost.

The real mind-boggling question about this project should be how to manage the metadata, indexing, and retention requirements for petabytes of data. This platform needs to be agnostic of the underlying hardware.
 
There are a few key issues with this project that I can see:
  • It sounds like you are planning it for a single 5-10 year implementation with offsite tape backups.
  • Tape backups are just that (backups). They are not for primary storage.
  • You need to think about the overall lifecycle of the data. If it is as important as you say it is, it should be online at all times. Consistency should be checked continuously. Your hardware should be from multiple vendors across multiple geographical sites. Data should be duplicated and replicated to multiple sites.
  • We need more information about the data. What is it? Structured/unstructured? Where/how is the index stored? How is the data lifecycle defined (i.e, is everything 1000-year retention)?
The short answer is that your budget of a few million per year is insufficient for the startup costs to host this yourself. I would suggest working to leverage multiple cloud vendors in conjunction with tape-based offsite backups to arrive at a continuous operational cost.

The real mind-boggling question about this project should be how to manage the metadata, indexing, and retention requirements for petabytes of data. This platform needs to be agnostic of the underlying hardware.

You might notice that google has a reasonable indexing method for text. This data is audio and video.

Some how this huge amount of priceless data has suddenly appeared.

Some how we know only 2PT of the data needs to be on line.

Somehow we know that the data will be important in 1000 years.

Who in their right mind asks a quesiton on the internet for a project this important and expensive?

---

Just publish the data on the internet and it will last forever.
 
You might notice that google has a reasonable indexing method for text. This data is audio and video.

Some how this huge amount of priceless data has suddenly appeared.

Some how we know only 2PT of the data needs to be on line.

Somehow we know that the data will be important in 1000 years.

Who in their right mind asks a quesiton on the internet for a project this important and expensive?

---

Just publish the data on the internet and it will last forever.
that is just it though in 1000 years no data today will be so vitally important that any one will care humans probably willnot even resemble humans as we are today over the last 100 years humans have gone from adverage shoe size of 6 to 12 we are taller and fatter who knows how that will continue. And the current trend is we will make this planet unusable in the next 200 years so pick some where water tight or high up enough that if the sea rises 30 feet that your data centers won't be harmed.
 
Last edited:
Did you know it would take 10,008 4tb drives and 139 of those 4u rack mount units.
 
Did you know it would take 10,008 4tb drives and 139 of those 4u rack mount units.

That's 12 racks full of spinning disks and power which is damn sure not cheap especially with HVAC for all of these things and then the likelihood of 5% of the drives failing (I sprung for extra drives just in case).

I do not see the OP coming back with any updates or other requirements.
 
Back
Top