Petabyte Storage

JohannesA

n00b
Joined
Oct 4, 2012
Messages
21
Hello Forum,
I'm working for a film post production company. In peak times we have 20 win7 workstations and 16 rendernodes accessing our Fedora/Samba3 Raid system via 1GbE. The server comes to an I/O Limit mainly cause of saturation of the Network. Other than these deadline problems the system is running stable for some years.

We are now planning our next storage system and love the simplicity our storage solution has. But we're aiming 1 PB in the end of 2013 and would like to add hardware on demand starting maybe with 0.25 PB. I understand that this would give us different problems to solve than what we had before.

Looking at the market we see systems like Isilon with virtualised storage and massive I/O, Looks impressive but is a bit over our budget. So we want to develop our system our self.

Software:
We would like to use ZFS in RaidZ3 preferable on centos (if it gets stable soon). We like the caching concept of zfs in ram, zil and L2arc. File sharing would be over CIFS. We have an extra AD controller (Win2008 Server) for user management.

Hardware:
Supermicro
Xeon whatever is needed
Ram plenty
48 bays SAS (softraid)
SSD Cache
10GbE Ethernet (or FC)
Infiniband interconnect (if we need to cluster)

Our plan we have right now is to start with 2 or 3 Servers and add whatever is needed when its needed.

So my questions:
1. Is a clustered filesystem necessary?
2. What is your experience in "virtualising" storage?
3. How would you try to achieve this - I would really like to hear your opinion.
4. Performance is critical - how can we avoid bottlenecks?
5. Any hints on OS, clusters etc ... is much appreciated
6. How does samba scale is ctdb a must? Which underlying filesystem works well?
7. How are you managing your solution?

I can provide more information if necessary (Failover, IOPS ...). Thank you for your time.
Johannes
 
Last edited:
What is your budget and have you spoken to other Storage Vendors?

I know [H] is all about home brew, but there are some times where you need to talk to the pros with Site Surveys, usage cases, and gobs of IOPS charts.

Off the top of my head Dell's Compellent line has merit. We were looking at a system for our storage needs as the technology that starts moving data from fast disk (SSD/15k) down to slower/larger backup drives (3TB/7200 disk) that handle retention. We have 500TB of data on our SAN and wanted to see about adding more so we talked to EMC, Compellent, and one more vendor (name escapes me) that discussed more about backup needs than storage.


The one thing we got out of all of these talks was that moving data that wasn't used off the small/fast drives onto big/slow drives was key to increasing our performance. We did not even think of that idea but it gave us something to tweek on the backend.
 
EMC's FAST is good at doing this. It will move data in 1GB stripes between storage tiers based on how often they're accessed.

The EMC could also share storage out via CIFS for you as well.

But you'd be looking at a VNX7500 or VMAX for the storage capacity you want and those are very spendy.

You should strongly consider going with a vendor on a project of this magnitude. If shit hits the fan and any data is lost, would you like to be on the hook for that?
 
Personally I would look at Isilon rather than build your own for a project of this size, too many things can go wrong when you try to do it entirely on your own.

Edit: Not to mention adding storage with Isilon is stupid simple, plug another node in and press 2 buttons. Takes less than 10 minutes.
 
although im with the posters above regarding commercial solutions (easier support) i wouldnt rule out ZFS, you could also go with a commercial ZFS provider, like area data systems. http://areasys.com/ but i would definately quote out the EMC's and the other big guys too,.
 
There are many reasons to use EMC or NetApp or Oracle/Nexenta for ZFS.
But the fact you need a hundreds of Terabytes is not enough, its the question, if you need their support and their service level agreements and if you are able to spend really a lot of money.

For a 200 TB SMB server, you can also use a 50 bay case from Chenbro with 6 cheap IBM 1015 SAS controllers on a 7 x pci-e SuperMicro mainboard with 128GB+ RAM and up to 50 x 4 TB SATA disks with a OpenSource OpenIndiana OS and it may do the same with a supposed better performance for less than 20k USD.

Upgrades for more disks is only a matter of external JBOD cases and SATA or better SAS disks.
In such a case, the only problem is: When you have problems, you must solve yourself.
But a lot of people went this way. ZFS is not too hard to learn.
 
@OP

We would like to use ZFS in RaidZ3 preferable on centos (if it gets stable soon).

This would be your first mistake. Right now, the performance of ZFS on Linux simply isn't there, and since the project isn't very widespread, it hasn't really been put to the question. I would NOT trust my data to it.

Having said that, I can see nothing wrong with using a number of ZFS nodes running an OpenSolaris variant like NexentaStor with the zvols exposed via iSCSI and then perhaps using CentOS nodes in front of them running Lustre for a clustered file system. You could also look at RedHat Storage Server, which uses GlusterFS.

I'd get some expert advice though, if only to broaden your awareness of what's involved. :)
 
Before we start throwing out ideas, what is you proposed budget over a standard depreciation cycle (do you use 3 years or 5 years for your depreciation?) Include in these costs physical hardware buildout, backup infrastructure including lots of tapes for historical in addition to rotating backups (a system of this size, with a normal commercial backup window will most likely need to do disk to disk to tape to manage the time), offsite storage, possible physical network link improvements, ongoing onsite service (since this sounds mission critical maximum of 4 hour response), possible offsite DR infrastructure and possible power upgrades.
 
I am all for a DIY system.

But you don't seem to provide enough usage details to allow any designing.

---

I would put 40-50GB of storage on each of several server. And work the problem from there. But you don't provide enough usage details.
 
First thank you everyone of getting involved.
Budget is around 100.000 EUR ~ 130.000 $ without taxes excluding switch and network. We are right now in contact with Nexenta and Red Hat. I also want to have an estimate from SGI and EMC. E.g Nexenta has serious licensing cost in a bigger setup - but we still have no numbers, yet. Normally 1st setups are quite cheap this is a matter of change once your in the vendor lock in.

Traditionally our company tried to avoid vendor lock ins and support plans. But we're growing and I could see the benefit having a support hotline. On the other hand I still believe in "engineer driven management" and educating our staff - outsourcing our core components is not an easy choice.
I'm still not convinced that a blue LED blinking black box with major marketing budget will general fulfill our needs better than the self made approach where you have to get your hands dirty. It's a matter of tradeoffs.

Some more facts:
- Video data is growing rapidly but we are now in 4k resolution. This may change to 8k the next years but compression codecs and block storage compression are even getting better - so my vague estimate is this storage could last at least 5 years.
- This system should be rock solid. But it doesn't necessarily have to have an active fail-over system. If we're online "some" hours later it's still acceptable. (Hot)-Spare Hardware should be available on site.
- Right now we don't have a particular plan how to solve backups. As we are still considering of constructing and maintaining it our self. I could see a simple cloning backup solution with zfs send/ rsync etc .. disk to disk additional with a long term (tape?) backup. I know this question is fundamental but we are still in the beginning.
- We will reuse the 1GbE for all 3D Workstations. The Compositing workstations and the renderfarm will be on a 10GbE Network. If necessary additional workstations will get hocked up for faster I/O. Time is our friend. 10GbE network is getting cheaper every day. Although switch is still expensive.
Ty for your time
 
If performance is critical, then you may be interested in a two tier approach, where your active projects are on the high performance storage, but your inactive/finished work is on the lower performance/higher capacity storage (perhaps in the future you could also move to a three tier approach with archival storage for historic data). There is an admin overhead involved (though you can automate much of it) - but very large installations of high performance storage are expensive (and sometimes not really necessary for all the storage requirements).
You could also look at repurposing your current storage for other use - it might be fine for archive purposes (though obviously that would depend on what it is and it's age/condition etc)

Also, with a current 36 clients accessing the storage, I'm not sure RAIDZ3 would satisfy your performance requirements - though really this is heavily dependant on the way these clients access the storage.

Finally, you hit the nail on the head by saying the backup is fundamental - it certainly is (and your requirements may be quite expensive too, so you need to budget for that)
 
130.000 $
Unless prices have drastically changed since I last looked, you're not going to get a petabyte from a vendor for 130K USD.

Heck, for a homebrew setup, you'd barely be able to squeeze in for that price:
34 3Us with 15x drives per 3U (10 data, 3 parity, 2 hot-swap)
510 raw drives. Best $/GB is the 3TB drives. Since I'm a Seagate fan, we'll use those at $150 per.
That's $76,500 in drives alone.
Then add in about $1500 for each box you put the drives in... $51000
You're sitting at $127,500 and you still don't have racks, power, network, or backups.

I would start by trying to convince your boss to double the budget for a homebrew setup, or quadruple it for a vendor solution.
 
Last edited:
kinda not clear, 100k eu is for startup, 250TB data, right?
so if you go for 400ish raw ... 4x 45x sm jbods 8k eu + 180x 3TB 27k eu = 35k for data
should leave ya with more then enough for networking and two headnodes
 
The two tier approach is something we should seriously consider.

The 100k EUR is a virtual number my boss gave me. It helps to find out how we can approach this task. The storage will grow over time. In the end we need the best solution for the least amount of EUR. Where TeeJayHoward is definitely right is that a vendor solution would be a magnitude more expensive.
My task is to present different technical reasonable solutions. If in the end of the day we decide to buy an EMC setup, than we at least know why.
 
The vendor cost will also greatly differ depending on the type of tier speed you are aiming for, an example would be on the EMC side the Tier 1,2,3 solutions vary drastically in price. I believe Dell, EMC, IBM san solutions take similar approaches.

A mid tier solution like the EMC 7500 can take up to 1000 drives in mixed formats such as ssd, sas and more. You would probably be starting with maybe 200 TBs you may be able to go around the 200k with support contract and have the capabilities of expanding to well over 500 TBs. Even tho its a mid tier solution, you will still get smoking performance with 10g infrastructure on the back end.
 
he said he wants 1PB of storage. the cost of this with the letters EMC anywhere involved rapidly approaches 9 figures.

I also want to have an estimate from SGI and EMC. E.g Nexenta has serious licensing cost in a bigger setup - but we still have no numbers, yet. Normally 1st setups are quite cheap this is a matter of change once your in the vendor lock in.

have you ever worked with EMC before? if you're balking at the licensing cost of nexenta can't wait to see your reaction from an EMC quote.
 
Yes buying appliances is expensive. However I would love to hear more ideas how you would design this system with the specs I have given. Especially how would you do backups, the two or three tier approach.
Thank you for your time
Johannes
 
Originally Posted by GeorgeHR:

I would put 40-50GB (my mistake should be TB) on each of several server. And work the problem from there. But you don't provide enough usage details.

Why exact this amount?

I filled in the details by saying that your work load is that a few people work on a single project and you have several projects running at the same time.

40-50TB is a reasonable amount of HD space to a small group to share. That is a group small enough to share data over a network of reasonable cost.
 
Yeh anyone who is suggesting that EMC could possibly come in for this is living in fairyland. Not to mention their low end stuff is actually crap.

Compellent is pretty interesting, ZFS based, but also with tiering rather than a standard L2ARC implementation, plus you get the support and as you grow you can continually add shelves of cheap storage and let the heads do the work of shifting your old data to the slow discs when it becomes outdated or unused, but still always available instantly.
 
Not that i have a clue what the Compellant costs but sure as hell it won't be the lunacy pricing that EMC have.
 
with your budget you're not doing backups, you just can't afford to. with petabyte sized datasets backups become tricky to manage, that is a boat load of data. there is software out there to make it mnageable but it is as expensive as your entire budget just for the storage, or can be.

1) avoid SATA drives.

you're going to get a lot of opinions and people linking you to articles where some guy got SATA drives to work behind expanders just fine and everything was ducky ... you're talking hundreds of drives here though. SATA behind expanders is/can be finicky on smaller systems the one you're talking about building is anything but small. avoid the headache and use SAS drives. yes, more expensive per drive but it will work.

2) use LSI only. LSI HBAs and LSI controllers in the JBODs. if you mix in another vendor in the SAS chain you're asking for trouble.

1. Is a clustered filesystem necessary?
ZFS is a mount once filesystem so you can't do clustered anyway. Are you asking about high availability though? If you are then i would suggest that yes it is worth it. the license from nexenta for HA is $4000ish US which is a drop in the bucket considering how much you're spending on drives. However, nexenta WILL NOT certify HA with sata drives. <--period

2. What is your experience in "virtualising" storage?

something of this size, i wouldn't virtualize. just no point. even if you wanted to carve up certain drives for specific roles you would be better off looking into SAS switch zoning and using many control heads.

4. Performance is critical - how can we avoid bottlenecks?
this question needs answering first because it dictates #3. however, there are two types of performance. low latency random workloads and high sequential throughput workloads. they are not the same and have to be attacked differently. however, can be solved within the same system, mostly.

that said off the top of my head i would suggest looking into quanta's new 4u 60 drive JBOD and pair it with their new quad socket 4U server. this gives you the most PCI-e 3.0 lanes possible currently and in those PCI-E slots you plug in at least 5 LSI 9207 HBAs. 256GB ram in each server head. can't really suggest what you need/want for SSDs as you haven't gone into any detail about what the workload looks like but I can't recommend STEC SSDs highly enough.

3. How would you try to achieve this - I would really like to hear your opinion.
can't properly answer this without knowing #4 in great detail.

5. Any hints on OS, clusters etc ... is much appreciated
nexenta. there are some NDA things i wish i could tell you but i just can't yet. you will want nexenta, the things they have coming for 4.0 are absolutely amazing.

6. How does samba scale is ctdb a must? Which underlying filesystem works well?
with nexenta you don't need samba, you get native cifs. as for samba running on freebsd or linux, honestly i haven't ever attempted to make SMB (or cifs for that matter) scale much beyond department level sharing.

7. How are you managing your solution?
my solution is 800TB (raw) of nexenta powered storage. I also get a lot of use from a product called SANTools, well worth the very reasonable cost. For backup, I currently replicate between datacenters however we're getting a tape library here fairly soon. SNMP and log monitoring, internal nexenta SMTP alerting etc are all part of the managment.

i'm a big fan of nexenta. they have a few rough edges still but those are being smoothed out. seriously, the stuff they will be announcing soon is absolutely amazing. early november keep an eye on their announcments.
 
Last edited:
Just want to make sure everyone is clear... $1m EMC/ NetApp list price does not mean you actually spend $1m (or anywhere close to that.)

Not to say they are inexpensive options... but it is no secret that nobody pays list price, especially in that size of system.
 
Its also no secret that the features listed in that 1m list price dwindle as the points come off the margin.
 
Just want to make sure everyone is clear... $1m EMC/ NetApp list price does not mean you actually spend $1m (or anywhere close to that.)

Not to say they are inexpensive options... but it is no secret that nobody pays list price, especially in that size of system.

Yes, we get 65% off list price.:p:p
But...even with that discount, to get 1PB storage, it's still way beyond OP's budget.
 
they offered me 90% off list once they realized i was building my own nexenta solution. that quote lost HA, a lot of SSD, and a lot less usable space .... while still being 130k more than what i built.
 
they offered me 90% off list once they realized i was building my own nexenta solution. that quote lost HA, a lot of SSD, and a lot less usable space .... while still being 130k more than what i built.

When we were bidding the most recent system (we eventually went DDN) EMC was offering 75% across the board on everything including service and up to 85% on particular packages.
 
DDN is awesome hardware, no doubt. i don't need jbod controllers that can sustain 96GB/s though lol.

btw their hardware may or may not find it's way on the nexenta HCL soon.
 
DDN is awesome hardware, no doubt. i don't need jbod controllers that can sustain 96GB/s though lol.

btw their hardware may or may not find it's way on the nexenta HCL soon.

Without violating any NDAs, I would not be surprised at all :D
 
Here are some high end ZFS servers:
http://www.slideshare.net/NexentaWebinarSeries/nexenta-at-vmworld-handson-lab



"Well, if it is thoroughput limitations you are worried about....
Let me cite an example. And it is a very large ZFS system, which they expect will scale to over 10PB
http://www.lsi.com/downloads/Public/Direct Assets/LSI_CS_PittsburghSupercomp_041111.pdf
each rack consists of two servers. Each of the two servers has a single LSI-9201-16e JBOD HBA (x8) connected to the LSI SAS 6160 switch which are then connected to 495 3TB SATA drives So that is 1 x8 handling 247 odd drives. Screaming 4GB/s sequential speed is not the rule in installs like these, it is high IOPs and lots of high queue depth random access requests. Oh yeah, it uses SATA drives with those evil SAS HBAs and Expanders "



Some more discussions on high end storage vs zfs here:
http://hardforum.com/showthread.php?t=1548145&page=8&highlight=zfs+high+end
 
You mentioned that you're not even fully utilizing the IOPS of your current setup due to network saturation. Have you thought about adding in additional network cards and teaming/bonding them? This would split the work load over multiple network cards and lower the overall saturation of a single card thus enabling you to reach the IOPS limit of your current server.

As for the data side. I agree you should look at a hardware solution to make it easier for expansion later on without having to deal with creating additional zpools like ZFS requires. Most hardware based raid controllers will allow you to expand a current array on the fly. For speed I would recommend R60 on a card that has 2-4GB of memory to handle the parity duties. You will lose more drives to parity over all but your write/read speeds will also increase overall. at least that's my understanding. And it makes it so that if you do lose multiple drives at once you are less likely to lose the whole array.

Example for my lan server that I have planned out. 54 SSDs. 6 Raid 6 arrays of 9 drives each then having them stripped. I lose 12 of the 54 drives to parity but now I could in theory lose 12 drives without loss of data provided it is 2 drives per R6 array. Just some thoughts.
 
Personally I would look at Isilon rather than build your own for a project of this size, too many things can go wrong when you try to do it entirely on your own.

Edit: Not to mention adding storage with Isilon is stupid simple, plug another node in and press 2 buttons. Takes less than 10 minutes.

There's a reason Isilon has a really strong position in the Media and Entertainment space. Super easy to use, works well, and scales well as data needs grow.

OP: If money is a big factor, maybe consider some of the scale-out storage options. You can alway go with 250TB today and scale to a PB very easily.
 
As for the data side. I agree you should look at a hardware solution to make it easier for expansion later on without having to deal with creating additional zpools like ZFS requires.

You're confusing zpools and vdevs. For expansion of a ZFS filesystem you have two options. Swapping drives for bigger ones, or adding vdevs. For a system of the size we're talking about, you add entire disk racks at a time, so it's not a limitation at all. It's for "small scale" use that it is a problem, since you can't just add one drive at a time if you have raidz vdevs.
 
i'll add that in large configurations you add raid sets very similarly in hardware raid environments. you dont add one or two more drives to your 6 disks raid6 sets you add 1 or two more 6 drive raid 6 sets.
 
There's a reason Isilon has a really strong position in the Media and Entertainment space. Super easy to use, works well, and scales well as data needs grow.

OP: If money is a big factor, maybe consider some of the scale-out storage options. You can alway go with 250TB today and scale to a PB very easily.


One possible problem for the OP here though, is whether you can obtain even a 250TB storage server system for the budget he has in mind.
 
One possible problem for the OP here though, is whether you can obtain even a 250TB storage server system for the budget he has in mind.


As shown by a poster above, getting even close to the required amount would require nearly doubling the budget.

I've had some dealing with Compellent, they are very agressive in going after EMC customers in even buying back old EMC equipment when you purchase new Compellent hardware. The warranty, service, and support is similar to EMC's with the monitoring and phone home equipment.

Prices are going to come in lower than EMC for sure with the Tiered support everyone is mentioning above. My company has 1/2PB of storage on 2 EMCs and we are actively looking to move off this platform due to $$$.
 
Back
Top