Fastest Theoretical SSD Build?

joel96 · Jun 27, 2013

I'm trying to find out what the fastest performing SSD setup would be, irregardless of price and risk of data loss, limited only by technical considerations, for the purpose of scientific curiosity.

My rolling component build of an optimal system:
https://docs.google.com/spreadsheet/ccc?key=0ApzIz1YSh_e7dEZYb3RORFZxU21jUDh0MWRRSkhhcUE&usp=sharing

Enterprise spam: The fastest single unit is an enterprise PCI-e based SSD from OCZ, according to cpubenchmark.net (the OCZ ZD4CM88-FH-3.2T). My theory is that larger capacity SSD generally perform better than lower capacity drives, and has been reflected in a few benchmark tests weighing larger drives against smaller drives in RAID 0. I want to know if the ZD4CM88-FH can be set up in a striped array or not, since the controller would vary from the discrete card-based controllers used for SATA 6. If the controller is slower than a discrete card controller used for SATA 6 based arrays, than a multi-SSD system might be able to out-perform a system with multiple PCI-e enterprise SSDs in an array.

SSD spam: Another potentially faster build would be one that maximizes a motherboard's PCI-e slots for RAID controller cards, and outfitting it with what the fastest SATA 6 SSDs in RAID 0. This could potentially hamper overall system performance due to having to get a motherboard with more controller card capacity rather than multi-GPU support, or with those slots taken up by controller cards rather than GPUs. But again, this is a theoretical discussion on SSD performance, with overall system performance postponed until some comparative benchmarks are available.

HDD spam: After reading the thread in hardforum regarding builds with greater than 60TB of usable storage, I'm curious as to how a system running in RAID 0 might compare with one that uses the same chassis and number of drives, but replacing the large capacity drives with SSDs. houkouonchi's build is meant for capacity rather than speed, but if it were configured for speed it would fall into this category. Supercomputers used as renderers or non-real-time application don't count because they aren't a single system, and because they aren't for real-time applications.

non-RAID higher level OS arrays: The ones I've seen used on hardforum most frequently (although not exclusively) are unRAID, ZFS RAID-Z2, Synology Hybrid RAID, and Microsoft Storage Spaces. unRAID is best run off of a USB 3.0 drive, and apparently can't be run off of a SATA 6 or PCI-e based drive. Synology apparently requires a proprietary system, even though the software isn't proprietary. My theory is that most software based array management will be slower than a hardware based system. However, a large capacity array might perform faster than a smaller one by the same logic as two larger drives in an array out-performing two smaller drives in an identical configuration.

On a side note, the most difficult benchmark for a real-time centered system I can think of (but have not heard any analysis) for a single system would be running ARMA III at max settings at 8,192×4,320 in 3D (if that's even a supported resolution), streaming uncompressed at a minimum of 144fps using Open Broadcaster Software, recording mic audio separately from in-game audio in 256DSD, and recording video at 10-bit 4:4:4 YUV uncompressed h265.

My realistic system is probably just going to be walking the rope without a net, a five year old 750GB drive and a 1TB drive for game and application storage with occasional optical media backups of documents and game saves whenever I get around to it. But this isn't about my rig or budget. It's about SCIENCE!

Lost-Benji · Jun 27, 2013

joel96 said:
I'm trying to find out what the fastest performing SSD setup would be, irregardless of price and risk of data loss, limited only by technical considerations, for the purpose of scientific curiosity.

My rolling component build of an optimal system: https://docs.google.com/spreadsheet/ccc?key=0ApzIz1YSh_e7dEZYb3RORFZxU21jUDh0MWRRSkhhcUE#gid=0

Enterprise spam: The fastest single unit is an enterprise PCI-e based SSD from OCZ, according to cpubenchmark.net (the OCZ ZD4CM88-FH-3.2T). My theory is that larger capacity SSD generally perform better than lower capacity drives, and has been reflected in a few benchmark tests weighing larger drives against smaller drives in RAID 0. I want to know if the ZD4CM88-FH can be set up in a striped array or not, since the controller would vary from the discrete card-based controllers used for SATA 6. If the controller is slower than a discrete card controller used for SATA 6 based arrays, than a multi-SSD system might be able to out-perform a system with multiple PCI-e enterprise SSDs in an array.

SSD spam: Another potentially faster build would be one that maximizes a motherboard's PCI-e slots for RAID controller cards, and outfitting it with what the fastest SATA 6 SSDs in RAID 0. This could potentially hamper overall system performance due to having to get a motherboard with more controller card capacity rather than multi-GPU support, or with those slots taken up by controller cards rather than GPUs. But again, this is a theoretical discussion on SSD performance, with overall system performance postponed until some comparative benchmarks are available.

HDD spam: After reading the thread in hardforum regarding builds with greater than 60TB of usable storage, I'm curious as to how a system running in RAID 0 might compare with one that uses the same chassis and number of drives, but replacing the large capacity drives with SSDs. houkouonchi's build is meant for capacity rather than speed, but if it were configured for speed it would fall into this category. Supercomputers used as renderers or non-real-time application don't count because they aren't a single system, and because they aren't for real-time applications.

non-RAID higher level OS arrays: The ones I've seen used on hardforum most frequently (although not exclusively) are unRAID, ZFS RAID-Z2, Synology Hybrid RAID, and Microsoft Storage Spaces. unRAID is best run off of a USB 3.0 drive, and apparently can't be run off of a SATA 6 or PCI-e based drive. Synology apparently requires a proprietary system, even though the software isn't proprietary. My theory is that most software based array management will be slower than a hardware based system. However, a large capacity array might perform faster than a smaller one by the same logic as two larger drives in an array out-performing two smaller drives in an identical configuration.

On a side note, the most difficult benchmark for a real-time centered system I can think of (but have not heard any analysis) for a single system would be running ARMA III at max settings at 8,192×4,320 in 3D (if that's even a supported resolution), streaming uncompressed at a minimum of 144fps using Open Broadcaster Software, recording mic audio separately from in-game audio in 256DSD, and recording video at 10-bit 4:4:4 YUV uncompressed h265.

My realistic system is probably just going to be walking the rope without a net, a five year old 750GB drive and a 1TB drive for game and application storage with occasional optical media backups of documents and game saves whenever I get around to it. But this isn't about my rig or budget. It's about SCIENCE!

#1 no good uploading a doc if it is restricted access.

#2 Bigger will be faster as they usually have more smaller drives added to the RAID-0 pool, higher chance of failure too. For fastest speeds WITH some redundancy, You will need something like Storage Spaces - Double Mirror.

#3 A single M1015 will put 8 (not 6, not sure where that figure was plucked from) SATA-III or SAS2 SSD's in RAID-0, 1 or 10. Easy to fit in with GPU's.

4# More drives only means more speed for bigger files, small files (OS, SQL or other databases) will go backwards fast. Limited by I/O. Using Hybrid HDD's is a wise move for those wanting their cake to eat. If you need speed, you use RAM.

#5 Too many variables.

#6 That's not a benchmark, that's is personal preference.

#7 Sounds more dreaming/tire kicking, sorry if I offend but that's about what it is to me.

Proplus · Jun 27, 2013

#6, raw video at that rate is about 7.3 GB per second and while you can make storage that fast, you're going to be limited by other factors on the machine (that's assuming you can even make a system to run a game at 8k/144fps)

joel96 · Jun 27, 2013

Here's the working link (updated in the older post too):
1 https://docs.google.com/spreadsheet/ccc?key=0ApzIz1YSh_e7dEZYb3RORFZxU21jUDh0MWRRSkhhcUE&usp=sharing

2 It sounds like the only way an array would work with PCIe based SSDs is a software managed system, in which case there should be a performance hit from moving off of hardware and into an emulated environment. The question then is whether the performance hit of a card based enterprise array in a software environment is big enough to place it below a single enterprise card.

3 SATA 6 aka SATA 600 aka SATA III. The question is if a system with seven controller cards (hypothetically) with eight SSDs each, the entire system striped without redundancy, taking up slots that could otherwise hold four GPUs will perform better than the same system with, say, four controller cards with eight SSDs each and four GPUs.

4a It sounds like the HDD spam technique can't equal the performance of an SSD spam system overall, although I would like to figure out whether the large file performance of a massive storage system will outperform a massive SSD storage system in benchmarks.

4b The same applies to hybrid HDDs--the goal is not capacity here, it's speed, so I want to know where massive hybrid HDD systems fall in benchmarks compared with the other builds.

5 The question here is which software delivers the highest benchmark scores. Variables can be removed by using as similar hardware as possible: same motherboard, same drive models, and starting off with the same number of drives. The variables are going to be the chassis, the software, and the drive that is running the software.

6 I'm looking for a non-artificial rubric for measuring performance of a system that can be used alongside artificial benchmarks. I'd like recommendations on how to take the personal preference out of the benchmark and make it about what the system can deliver in real time. Testers usually do multiple runs with different games at different resolutions. I'm just picking the most tasking stuff I can think of off the top of my head (ARMA III seems like it would be more tasking due to the number of objects on screen versus CRYSIS and Metro 2033, but again it should really just be one demo among several). I'm open to suggestions on what the toughest real world tasks would be to weigh a system's performance. And about the raw video, h265 does offer compression options, and I don't know if rates higher than 120fps are supported, if the wiki page indicates a hard upper limit of the codec.

7 I have resources to build a system with, and so do others reading this, if the thread on large capacity builds is any indiation. What I'm missing is the knowledge and benchmark results of the different builds I'm talking about, which are probably out there on the internet somewhere, but I'd like to avoid reinventing the wheel by consulting the experts here, since the search is distributed throughout multiple users that have already done the research and found the articles and explanations.

Lost-Benji · Jun 28, 2013

Buddy, you still have missed the limits of anything on the shelf. You only have limited PCI-E lanes per CPU.

A dual socket board loaded with 512GB of RAM will yield results that can not be even reliably be benched.

JoeUser · Jun 28, 2013

I don't know the technicalities but could you theoretically RAID like 1000+ SSD's together for like terabytes of I/O?

-Dragon- · Jun 28, 2013

http://the ssd review.com/our-reviews/adaptec-smart-storage-systems-and-1-million-iops/

Add more cards and SSDs as necessary

joel96 · Jun 29, 2013

Lost-Benji said:
Buddy, you still have missed the limits of anything on the shelf. You only have limited PCI-E lanes per CPU.

A dual socket board loaded with 512GB of RAM will yield results that can not be even reliably be benched.

For those just tuning in, Lost-Benji is probably referring to my rolling build list linked in the OP.

The number of PCIe lanes supported by the CPU isn't important since the motherboard PCIe chipset will handle it (the maximum number of lanes is forty on the 3970x and E5-2690). PCIe 3.0 has a maximum transfer speed of of 1GB/s, which is still 400MB/s more than the 600MB/s of SATA 3.0. The maximum number of PCIe lanes supported by the best available current PEX chips is 48.

The current PCIe lane limitations mean a maximum of six 8x enterprise cards if crappy integrated or PCI graphics were used; it is unknown if there are any motherboards that offer octuple PCIe 8x. Since PCIe 3.0 transfers at 2GB/s per link, this means eight controller cards each on PCIe 1x slots could potentially be supported in a motherboard that also supports x16/x16 SLI, with 16GB/s for the total of the controller cards. This amounts to twenty-six SATA 3.0 drives (with a remainder, but putting a twenty-seventh drive would risk hitting the bandwidth bottleneck) over the optimal number of controller cards. If there was an PCIe x16 controller card that had at least twenty-six SATA 3.0 ports, it would work (though it seems unlikely); so would two PCIe x8 SATA 3.0 controller cards with 13 SATA 3.0 ports (still unlikely); and so would four PCIe controller cards that are at least x4 and have at least seven SATA 3.0 ports (some of the controller cards I've seen in the forum have eight ports).

The Rampage IV only has eight lanes left over for a single PCIe 8x enterprise card or controller card, assuming the VGA setup runs dual x16 SLI. It has four slots left over after two x16 GPUs are in place, but only three of those are x8 or greater; the last one is x1 (which means it could only hold three drives without hitting the 2000MB/s bandwidth cap, so there's a remainder of twenty-three drives over three slots); two of the controller cards could be x4 and hold seven drives, but to maximize the bandwidth, one of the cards would have to be x8 and have at least nine ports (which is unlikely), unless the extra two drives would be hosted by the motherboard, but I doubt that the controller on the motherboard would be faster than a controller card. It might be enough to make up for the performance hit that would be caused by missing two of the drives in the single RAID volume, though.

From what I've read, dual socket boards are only available with Xeon procs. Due to the way the buffers are set up, they actually perform worse than the 3970x in realtime tasks like gaming. I'd like an explanation as to why having a larger buffer would hurt performance, since I've only seen it mentioned peripherally.

The only board I know of that holds 512GB of RAM is a dual proc one (the Asus Z9PE-D16), and as previously mentioned, would probably perform worse because of incompatibility with the 3970x and because it has only two PCIe slots. Even though it would support full dual x16 SLI, the benchmarks of quad SLI have scored mostly higher than dual SLI rigs.

JoeUser said:
I don't know the technicalities but could you theoretically RAID like 1000+ SSD's together for like terabytes of I/O?

The theoretical limit of x64 OSes is 16 exabytes, or one million 1TB drives. The largest capacity SSD I've seen is 3.2TB, so it would take 312,500,000,000,000 of those SSDs before the OS cap is hit. The limitations in regards to capacity are coming from the number of controller cards supported by the motherboard, and the number of ports on those controller cards. I don't know how houkouonchi got around that issue, but massive storage builds are built for capacity, and do not get around the SSD bandwidth saturation issue of packing every lane, slot, and controller card with the most bandwidth possible.

-Dragon-, your link is broken and I'd like to see it, but I'd also like to see if you have any thoughts on my theory of SSD PCIe bandwidth maximization.

Lost-Benji · Jun 29, 2013

Thanks for clearing that up Joel, yes, you are on the money with where I was going with it.

OK, as for mainboards, most people who are serious and into systems won't be stuffing around with desktop/consumer gear. Asus make some great stuff but when talking Pro, leave it to Supermicro or Tyan and Intel who all have offerings that hit 512GB+ in 2P or 4P configs and 256GB in 1P solutions.

As for the 1P Vs. 2P/4P argument, it is a little bit of a BS argument that goes on. Desktop OS's were never intended for 2P or 4P (Windows actually wont work on 4P). To get most out of 2P or 4P, Server OS's are best suited and can be modded to work as a Workstation OS.
There other side of the arguments also missed is that people test with Synthetic tools that rarely work at full potential of multiprocessor systems. Plenty of testing and ideas come form older platforms that relied on MCH/North-bridge chipsets to control the RAM and placed extra burdens on desktop/synthetics.
2P systems work very well and in many cases with newer software/hardware, smash single socket consumer gear even for same dollar outlay.

Consumer CPU's only support 1P systems, you need Xeon or Opteron for 2P and even 4P designed CPUs of the before mentioned to run 4P.

OP was aiming at a theoretical balls-to-the-wall system, consumer boards and CPU's just don't come close. If they did then you would see then in data centres, super computers and the likes.

-Dragon- · Jun 29, 2013

You have to take the spaces out, [H] stripes The SSD Review links. Maybe this will work, forgive the condescending link, most link shorteners are blocked too

If you used a motherboard like this you could use 6 of those adaptec cards at ~6GB/s each for an absolute maximum of probably >30GB/s (the speed increase doesn't seem to be linear across multiple RAID cards, according to that review one got 6.4GB/s read (advertised 6GB/s) and 2 cards got 11GB/s read), throw a few more SSDs and you might get another GB or two out of it. Since there's no such thing currently as quad 10Gbit NICs (or that I've found) you'd need 3 to be able to get the data capacity of even one RAID card "out" of the box, so if you had 2 onboard 10Gbits, then 4 dual 10Gbit NICs with 2 adaptec cards you'd be able to get 80-90% of those cards out to the network. Can't think of a whole lot of applications that would require 30GB/s and around 3 million IOPS locally. Heck the new adaptec 7H series of HBA cards are advertised at 1million IOPS each so if you used that and some sort of software RAID (zfs or whatever) you could probably get >30GB/s and 5-6 million IOPS.

6GB/s out of one RAID card is pretty neat, 11-12GB/s out of two is even neater esp at 600k and 1M IOPS each respectively, but due to CPU processing capabilities and network speeds much past that starts to hit the realm of "neat but useless" pretty quick.

Lost-Benji · Jun 30, 2013

I think I can see where your going with this Dragon, the whole thread is really just a e-penis extension to synthetic benchmarks and results that are totally unusable in the real world.

RAID cards and mountains of SSD's is e-Viagra, nice to have a hard-on about the end result but the time it takes to happen is a bummer. Cards need to get the command then start working, now you need to get small files across loads of disks which still takes time on SSD's even, now you need to get them flowing into a mainboard/CPU/Cache/RAM (more processing time and handling) and then where????
Where is good old ATTO on that benching?

P.S I have a feeling that the test-bed they used in that review is a little off from the norm and ideal. 16GB of 1333 ECC?????
Quad channel CPU that is designed to run 1600 optimally, that world suggest they used 8x 2GB DIMMs of cheap, slow and sluggish RAM?

joel96 · Jun 30, 2013

Lost-Benji said:
[...] when talking Pro, leave it to Supermicro or Tyan and Intel who all have offerings that hit 512GB+ in 2P or 4P configs and 256GB in 1P solutions.

Did you have any particular boards in mind? There's several dozen from each manufacturer. For the purposes of this thread, the emphasis is on SSDs performance, so one of the first criteria would be the number of PCIe lanes dedicated to the SSD cards (ideally, there would be one chip for the SSD PCIe slots, and one chip for two PCIe full x16 SLI, but any SLI support is probably going to be absent from a server board). I haven't seen any controllers that use more than a PCIe x8 interface; the limitations of 48 lanes would mean six controller cards for the total of 30GB/s as -Dragon- mentions; while the math of the maximum theoretical saturation of the SATA 3.0 bandwidth would project that it would take only ten drives to saturate the bandwidth, the SMART Optimus drives didn't get to that point until thirteen drives were added. I'm still talking theoretically with optimization of the board that will host the controllers in mind, so I'll keep it at ten drives per card, for a maximum total of sixty SSDs. This is a long-winded way of getting to the latter half of your sentence regarding RAM, and that -Dragon- brings up again with the link to the 768GB board; with as much data being transferred between all of those drives, I assume it will take a lot of RAM. I've read on Window's website that the maximum amount of RAM the OS can address is 192GB. Lost-Benji mentions that

Desktop OS's were never intended for 2P or 4P (Windows actually wont work on 4P). To get most out of 2P or 4P, Server OS's are best suited and can be modded to work as a Workstation OS.

Since #1 Only server boards have 768GB of RAM; #2 Windows 8 x64 cannot address even a quarter of 768GB of RAM; #3 There is theoretically a board available that has 48 lanes split evenly between six x8 PCIe slots; #4 Gaming, streaming, and recording aside, this thread is for finding the fastest SSD setup, this means that a modded server OS must be chosen--which one would reduce the software overhead enough that it wouldn't get in the way of SSD performance?

I'm not coming from a background that requires enterprise-level SSD performance, so I don't know what kind of real-world applications would require that kind of performance from a single unit; I'm not familiar with server-level applications, so it would probably be several hosting and data processing applications I've never even heard of.

If it were something other than synthetic testing meant for SSD-speed heavy application (and to get on to what I am familiar with), like the Let's Play-centered benchmark, it would require sticking with an OS that will support the software and deliver the lowest overhead, which would probably be a Linux-based OS, but would probably end up being Win7 for compatibility with the greatest number of applications.

The 1P/2P/4P debate seems a little distant from the SSD conversation, and I don't think the CPU would have any impact on the SSDs since the controller is a discrete chip aside from being able to deliver commands to them faster. It is related to the amount of RAM a board will support, since I've only seen 2P boards with RAM capacity of more than 64GB. If there is a 4P/2P board that delivers better performance in both synthetic and real-world, real-time applications, or a site that compares the highest performing ones, I'd like to read the articles and test results.

OP was aiming at a theoretical balls-to-the-wall system, consumer boards and CPU's just don't come close. If they did then you would see then in data centres, super computers and the likes.

With some of the supercomputers, the tasks will be non-real-time, like CG renders or scientific data processing, in which case the best consumer unit will never perform as several units with lower performance specifications, but with compatibility in tasks optimized for systems with several units. The motherboards designed for systems that are designed for real-time use are the ones that will be more suited to SSD spam builds.

-Dragon-, I read the article from the link; the ASR-72405 sounds like it would be better than the 7H series since the 7H series cannot act as a controller for RAID arrays. The benchmarks compared to a single OCZ ZD4RM88-FH-3.2T score higher for the twenty-four drive setup. Then again, they never tested six of the OCZ ZD4RM88-FH-3.2T in a single RAID (if that's even possible with PCIe SSDs), and they didn't test the higher quantity version of the ZD4RM88. For now, it looks like the superior build is the SSD spam one.

-Dragon- said:
Since there's no such thing currently as quad 10Gbit NICs (or that I've found) you'd need 3 to be able to get the data capacity of even one RAID card "out" of the box, so if you had 2 onboard 10Gbits, then 4 dual 10Gbit NICs with 2 adaptec cards you'd be able to get 80-90% of those cards out to the network.

It sounds like it would need at least six dual 10Gbit NICs to make use of 100% of the SSD bandwidth. The only dual 10Gbit NIC I've seen on hardforum is the Intel X540-T2 adapter, let alone which the highest performing one is; it would be easier if there were a site that specializes in discussion of the NIC cards like the SSD review does with SSD tech, like johnnyguru does with PSUs, or like frostytech does with air coolers; if not, articles and general threads are fine too. The setup you mention would require another four PCIe x1 slots (assuming the board has two dual 10Gbit NICs), or six PCIe slots to maximize the SSD bandwidth if they were all to be discrete cards (I doubt integrated dual 10Gbit could outperform a discrete card); this is in addition to the six PCIe x8 slots, and assumes graphics would be integrated, moved to the CPU (probably worst case scenario), or done over a PCI card if a different PCIe chip wasn't on the board.

It looks like the highest performing SSD the SSD review found was the Ultrastar HGST SSD800MM, although if the reviews that have weighed capacity, the Ultrastar HGST SSD800MR 1TB would outperform the MM 800GB in a RAID, even though the initial ratings of the MR are lower than the MM and MH series. The drive isn't on the market, so the fastest SSD for use the SSD spam system that cpubenchmark reports is the Intel SSDSC2BW240A3 (aka the 520 480GB). I've read several positive reviews of the OCZ Vertex (the original series).

Lost-Benji, you mention limitations from the MHz rating of the RAM in the test bench used by the SSD review, and I'm thinking again about server boards, and that overclocking and CAS changes might be limited in server boards to a point that the large capacity advantage wouldn't be able to overcome the speed per quad set advantage of consumer boards. The fastest RAM tested on CPU benchmark is the Corsair CMD16GX3M4A2666C10, which comes in sets of four 4GB sticks, with settings of 2666MHz and 10-12-12-31 out of the box. As a side note, I've heard RAM and CPUs vary by unit (like produce), and the settings that the manufacturer sells the sticks under is only an approximation of what the sticks can get. There are other units Corsair is selling that have different speed and latency settings, and that have a higher price, but the set listed above is the only one benchmarked so far.

Since the SSD review relies on synthetic benchmarks, what are some real-time applications that will stress the SSD portion of a system? How about overall real-time system benchmarks (or at least a better version of a potential benchmark test I posted in the OP)?

-Dragon- · Jun 30, 2013

A NIC test site would be kind of boring, most network chips are integrated and the ones that are actual cards generally all perform close enough to their design specs that there's not much to test. As far as the x540-T2's go I can vouch that they can get really really close to the rated 10Gbps speeds. Problem is there's no PCIe 3.0 10Gbit NICs out there yet, they're all PCIe 2.0, and a single PCIe 2.0 channel only gives you 500MB/s bandwidth and the 2 10Gbit ports on the x540 can theoretically do 2.5GB/s so you'd need at least 5 PCIe 2.0 channels per NIC, which rounds up to an 8x slot. So if you have a standard 6 slot board the best you can do is 2 RAID cards and 4 NICs with the onboard NIC giving ~12GB/s network throughput and 11-12GB/s disk throughput, which is reasonably balanced. Even if there were PCIe 3.0 10Gbit NICs that wouldn't really change anything as the 10Gbit jacks tend to be quite a bit larger than the 1Gbit jack and most NICs try to stay in low profile form factor, so you'd still need 4 full slots and since nobody does PCIe x4 slots it would still be 4 PCIe x8 slots.

The absolute highest I could find with not too much looking at off the shelf components would be this board which would let you add another NIC and RAID card for ~15GB/s network and ~15-16GB/s disk, and it's probably better to have a little more disk than network for internal usage and such.

-Dragon- · Jun 30, 2013

I just tested my x540-T2's to see just what they were capable of via iperf and with -d -r (bidirectional sequential) with 2 iperfs running at the same time one on each NIC for 60 seconds and 10 threads and speeds were >9.7Gbits/sec for each NIC on each test, however when I just did -d (simultaneous bidirectional) the speeds dropped down to 8.7 to 9.1Gbits/sec except for one that managed to still send at 9.7Gbit/sec. I think the bidirectional test may have been near CPU limited on my 4.6GHz 2600k, as CPU utilization on all 8 threads was over 90%. The server at the other end faired a little better as it has 24 2GHz threads and only hit ~50% CPU overall, 2 CPU cores were completely pegged though so the bottle neck could have been there too. Either way the CPU interrupts caused by 40Gbps of network traffic is quite severe (though I do have my NICs configured to minimize latency not throughput at the cost of CPU usage, but even still) 100+Gbps network IO + 10+GB/s disk IO might not be reasonably attainable by any current system. That board I linked earlier might get close if it had 4x 3.0+GHz 8 core xeon's in it, but that's an awfully expensive experiment, like more expensive than a new car.

-Dragon- · Jun 30, 2013

Also doing those bidirectional tests reminded me that while a RAID card that can read at 11GB/s and write at 12GB/s can't both read and write at the same time at those speeds, a 10Gbps NIC can, which means a 2 port 10Gbps card has a real capacity of 40Gbps, so long as the work load has equal read and write parts. So theoretically with the last board I linked if you had 4 NICs + 2 onboard for a total of 100Gbps send and 100Gbps receive, you could use the other 4 PCIe slots for RAID cards for ~20GB/s throughput (read/write combined), so if your workload was 50% read 50% write you could get 20GB/s out of that box assuming the CPUs were beefy enough and could handle the interrupts.

Lost-Benji · Jul 1, 2013

My responses I have put in RED

joel96 said:
Did you have any particular boards in mind? There's several dozen from each manufacturer. For the purposes of this thread, the emphasis is on SSDs performance, so one of the first criteria would be the number of PCIe lanes dedicated to the SSD cards (ideally, there would be one chip for the SSD PCIe slots, and one chip for two PCIe full x16 SLI, but any SLI support is probably going to be absent from a server board). I can think of a few but were the trick is the way each talk to the slots. Most now will drive half or a third off the first CPU and the rest off the other CPU. You need to look closely at the manuals for the flowcharts and diagrams for best choices.Crossfire will work on most server boards and I have seen SLI as well. I would suggest putting the RAID /HBA cards on the same CPU and put the video cards on the other for best throughput.

I haven't seen any controllers that use more than a PCIe x8 interface; the limitations of 48 lanes would mean six controller cards for the total of 30GB/s as -Dragon- mentions; while the math of the maximum theoretical saturation of the SATA 3.0 bandwidth would project that it would take only ten drives to saturate the bandwidth, the SMART Optimus drives didn't get to that point until thirteen drives were added. PCI-E expansion is something you may want to research.
http://www.magma.com/catalog
http://www.cyclone.com/products/expansion_systems/

I'm still talking theoretically with optimization of the board that will host the controllers in mind, so I'll keep it at ten drives per card, for a maximum total of sixty SSDs. This is a long-winded way of getting to the latter half of your sentence regarding RAM, and that -Dragon- brings up again with the link to the 768GB board; with as much data being transferred between all of those drives, I assume it will take a lot of RAM. I've read on Window's website that the maximum amount of RAM the OS can address is 192GB.
Must be an old site, looksy at this one: http://msdn.microsoft.com/en-us/lib...=vs.85).aspx#physical_memory_limits_windows_8

Lost-Benji mentions that Since
#1 Only server boards have 768GB of RAM; Pretty much
#2 Windows 8 x64 cannot address even a quarter of 768GB of RAM; See above link, 512GB for Win 8
#3 There is theoretically a board available that has 48 lanes split evenly between six x8 PCIe slots; No, up to 40 lanes per CPU, the rest drive onboard devices
#4 Gaming, streaming, and recording aside, this thread is for finding the fastest SSD setup, this means that a modded server OS must be chosen--which one would reduce the software overhead enough that it wouldn't get in the way of SSD performance? http://www.win2012workstation.com/

I'm not coming from a background that requires enterprise-level SSD performance, so I don't know what kind of real-world applications would require that kind of performance from a single unit; I'm not familiar with server-level applications, so it would probably be several hosting and data processing applications I've never even heard of.

If it were something other than synthetic testing meant for SSD-speed heavy application (and to get on to what I am familiar with), like the Let's Play-centered benchmark, it would require sticking with an OS that will support the software and deliver the lowest overhead, which would probably be a Linux-based OS, but would probably end up being Win7 for compatibility with the greatest number of applications. Windows 8 seems to handle anything I had on 7.

The 1P/2P/4P debate seems a little distant from the SSD conversation, and I don't think the CPU would have any impact on the SSDs since the controller is a discrete chip aside from being able to deliver commands to them faster. It is related to the amount of RAM a board will support, since I've only seen 2P boards with RAM capacity of more than 64GB. If there is a 4P/2P board that delivers better performance in both synthetic and real-world, real-time applications, or a site that compares the highest performing ones, I'd like to read the articles and test results. With some of the supercomputers, the tasks will be non-real-time, like CG renders or scientific data processing, in which case the best consumer unit will never perform as several units with lower performance specifications, but with compatibility in tasks optimized for systems with several units. The motherboards designed for systems that are designed for real-time use are the ones that will be more suited to SSD spam builds.

-Dragon-, I read the article from the link; the ASR-72405 sounds like it would be better than the 7H series since the 7H series cannot act as a controller for RAID arrays. The benchmarks compared to a single OCZ ZD4RM88-FH-3.2T score higher for the twenty-four drive setup. Then again, they never tested six of the OCZ ZD4RM88-FH-3.2T in a single RAID (if that's even possible with PCIe SSDs), and they didn't test the higher quantity version of the ZD4RM88. For now, it looks like the superior build is the SSD spam one.
It sounds like it would need at least six dual 10Gbit NICs to make use of 100% of the SSD bandwidth. The only dual 10Gbit NIC I've seen on hardforum is the Intel X540-T2 adapter, let alone which the highest performing one is; it would be easier if there were a site that specializes in discussion of the NIC cards like the SSD review does with SSD tech, like johnnyguru does with PSUs, or like frostytech does with air coolers; if not, articles and general threads are fine too. The setup you mention would require another four PCIe x1 slots (assuming the board has two dual 10Gbit NICs), or six PCIe slots to maximize the SSD bandwidth if they were all to be discrete cards (I doubt integrated dual 10Gbit could outperform a discrete card); this is in addition to the six PCIe x8 slots, and assumes graphics would be integrated, moved to the CPU (probably worst case scenario), or done over a PCI card if a different PCIe chip wasn't on the board.

It looks like the highest performing SSD the SSD review found was the Ultrastar HGST SSD800MM, although if the reviews that have weighed capacity, the Ultrastar HGST SSD800MR 1TB would outperform the MM 800GB in a RAID, even though the initial ratings of the MR are lower than the MM and MH series. The drive isn't on the market, so the fastest SSD for use the SSD spam system that cpubenchmark reports is the Intel SSDSC2BW240A3 (aka the 520 480GB). I've read several positive reviews of the OCZ Vertex (the original series).

Lost-Benji, you mention limitations from the MHz rating of the RAM in the test bench used by the SSD review, and I'm thinking again about server boards, and that overclocking and CAS changes might be limited in server boards to a point that the large capacity advantage wouldn't be able to overcome the speed per quad set advantage of consumer boards. The fastest RAM tested on CPU benchmark is the Corsair CMD16GX3M4A2666C10, which comes in sets of four 4GB sticks, with settings of 2666MHz and 10-12-12-31 out of the box. As a side note, I've heard RAM and CPUs vary by unit (like produce), and the settings that the manufacturer sells the sticks under is only an approximation of what the sticks can get. There are other units Corsair is selling that have different speed and latency settings, and that have a higher price, but the set listed above is the only one benchmarked so far. Overclocking and server boards don't mix well, they are intended for stability, not screaming their titties off. Xeons are designed to run native with DDR-III 1600. You would be wise to run @ 1600 but with the tightest timings you can source.

Since the SSD review relies on synthetic benchmarks, what are some real-time applications that will stress the SSD portion of a system? How about overall real-time system benchmarks (or at least a better version of a potential benchmark test I posted in the OP)?

If you want the fastest speeds for games, apps and work station duties then you would look at 500+GB RAM and a RAM drive. Mount partition into it and enjoy the fastest speeds under the sun.

joel96 · Jul 2, 2013

@-Dragon-: Going by the theoretical numbers, a single port on a 10Gbit/s card should deliver 1.25GB/s, and a dual-port 10Gbit/s card should deliver 2.5GB/s. For a 30GB/s SSD spam system spread over six PCIe x8 slots, it would require twelve cards to break even with the bandwidth. You're right with the proportions of two dual-NIC cards for every RAID card. I'm assuming that your tests were made per NIC on each card, and fall in line with the theoretical numbers, except that in the best case scenario of 9.7Gbit/s, it would take at least thirteen cards to keep up with the RAID bandwidth (30GBs/(((9.7Gbs)*2NICs)/8bits))=~12.37 cards. It would be a precaution to avoid any possible network bottleneck if the six cards holding ten SDDs each, which could go near 6GB/s, for a possible 36GB/s maximum (although due to RAID scalability and that the fastest individual drive I've seen hasn't hit 560MB/s, the 30GB/s is the transfer rate per second used). The (+1) 2:1 of dual NIC cards (plus one for bottleneck avoidance) to RAID cards probably holds up roughly to scale (i.e. one RAID card:three dual NICs; two RAID cards:five dual NICs). The mention of NICs brings up a possible practical use of a system, and would dictate the way testing is done. The focus of the thread is the best SSD setup, so I'm probably going to fork the spreadsheet into one for an SSD spam build, and one with a real-time task focus, specifically gaming/recording/streaming.

The reason why I've been using six RAID cards is because of the 48 lane limit of the best PEX chip; after reading the link and looking at some of the other motherboard manufacturer boards, I noticed that there was an Intel 4P board with 128 PCIe lanes. This would mean sixteen RAID cards and one hundred sixty SSDs. If a board were to support that number of drives and have the network transfer capacity, it would be 128 PCIe lanes for the SSDs, 104 PCIe lanes for the dual NICs; so a single chip running all of the PCIe lanes would need a minimum of 232 lanes. Following the dual NIC bandwidth proportions mentioned above and ignoring the number of PCIe slots and channels on the board, it would be five RAID cards and ten dual NIC cards (assuming that the theoretical board had a dual NIC x540-T2s). This would get it near the theoretical 30GB/s in disk and network I/O, assuming that four E5-2690s could handle it (again, right now it's not network utilization, it's SSD performance I'm trying to figure out).

All that is assuming that the work has equal read and write parts. I don't fully understand how a dual NIC card can read and write at the same time without losing any of its transfer capacity--it seems like if that's how the NIC is measured, a drive would have the same measurement; they're both serial--it's not like the NIC can send and receive optic signals at the same time. If that were the case, the RAID card to dual NIC card ratio would be about 1:1. As much as I'd like a simpler formula, I'd also like to be using correct measurements.

Lost-Benji, I read your post and looked at the links, but I'm going to have to post my response later.

As a side note, after looking at a few reviews, it looks like the Samsung 840 Pro Series 512GB would be the best one to use due to how well it scales performance in RAIDs; there aren't many reviews that compare SSD spam setups, and few review the largest capacity drive consistently.

Sgraffite · Jul 2, 2013

There is some good information on this topic in this thread posted by AndyE. He is/was doing a study along the lines of fast IO workloads.

-Dragon- · Jul 3, 2013

If you have a work load that's 50% read/50% write then a 2 port 10Gbit NIC can actually do 4-5GB/s of total network traffic. That x540-T2 seems to be capable of sending and receiving at ~97+% full line speed simultaneously

gjs278 · Jul 3, 2013

double post, continue on

gjs278 · Jul 3, 2013

it depends what you want

if you want the fastest 512b / 4k queue depth 1 experience, fusion io, hands down.

joel96 · Jul 6, 2013

LostBenji said:
I can think of a few but were the trick is the way each talk to the slots. Most now will drive half or a third off the first CPU and the rest off the other CPU. You need to look closely at the manuals for the flowcharts and diagrams for best choices.Crossfire will work on most server boards and I have seen SLI as well. I would suggest putting the RAID /HBA cards on the same CPU and put the video cards on the other for best throughput.

1 Supermicro:
Supermicro's socket 1567 4P boards don't support PCIe 3.0. One of the socket 2011 4P boards I looked at from Supermicro, the X9QR7-TF, assigns two PCIe slots to each CPU, with CPU1 limiting 8 of its lanes to its onboard X540 controller and 8 to its RAID controller. It has a total of 120 lanes assigned to the PCIe slots. It apparently doesn't allow users to assign lanes in any other configuration, so one of the board's eight PCIe slots is locked at a maximum of x8: "CPU1 Slot1 PCI-E Link Width This feature allows the user to set the PCI-Exp bus speed between CPU1 and the PCI-e port. The options are x4, and x8." The 2P series of boards from Supermicro, the X9DRH, apparently do allow reassignment of PCIe lanes on the board, but since the maximum number of lanes each slot (except for a single x16 slot) on the X9DRH-iTF can support is x8, it doesn't appear to be a feature that would better distribute lanes to the slots that need it: "IIO 1 PCIe Port Bifurcation Control This submenu allows the user to configure PCIe Port Bifurcation Control settings for the IIO 1 PCI-Exp port. This feature determines how to distribute the available PCI-Express lanes to the PCI-E Root Ports." I could see it being applicable if there were more slots than there were PCIe lanes, but for the 2P and 4P boards, there are more lanes available than there are slots on the board; the 4P X9QR7-TF only uses 32 of the 40 lanes on each processor, except for CPU1, which brings down the speed of one of the PCIe slots from x16 to x8 to drive its onboard X540 and RAID controller; if the lanes could be reassigned, it would enable x8 lanes going from the LSI controller to go to an additional x8 slot left off of the board that could be running a higher performing Adaptec RAID card, and x8 of the lanes going to the x540 could be assigned to one of the other CPUs that is only using x32 lanes total. Quad x16 SLI could run on the board; if septuple SLI or Crossfire were such a thing it could do that too. PCIe lanes are run entirely off of the PCIe controllers on the CPUs. The C602 chipset handles IO tasks other than PCIe management on the X9QR7-TF.

For the purposes of SSD spam builds, the X9QR7-TF is probably the best 4P board, as the TF+ has a disadvantage of four fewer PCIe slots. The advantage the TF+ has is that it has room for 256GB more RAM than the TF. Even if RAM were a factor in the speed of an SSD spam system, I can't see having four x8 slots worth of RAIDs being replaced by RAM giving it higher IO results than the TF. It might be different if the system had no drives and was running off of RAM, but if that were the case, there are other boards with higher RAM capacity from other manufacturers.

2 Intel
Intel doesn't make any 4P boards that support the E7-4870, which (along with the 8P version), is produced soley for the purpose of running in boards with more than 2P. Any 4P board is going to have to run the E5 server series. The 4P Intel S4600LT2 has capacity for 1.5GB of RAM. It has two fewer slots than the X9QR7-TF; the manual doesn't mention any configuration options for the PCIe lanes, whether any are run by the C602 chipset, or even which lanes are going to which components. It uses riser cards for the PCIe slots; since it's a proprietary card, it's possible that there's a bottleneck of less than the approximately 96GB/s a single riser card with x16 on each slot would require. http://download.intel.com/support/motherboards/server/r2000lh-lt/sb/G42282-005_R2000LH2LT2_SG.pdf

I couldn't find any 4P boards manufactured by Intel that had as many PCIe slots as Supermicro boards had, although six PCIe x16 slots combined with 1.5TB might outweight the extra one or two x8 slots gained by moving to a Supermicro board, assuming that all of the Intel x16 slots are running at full 32GB/s bandwidth.

3 Tyan
Tyan doesn't have any 4P boards. They have plenty of 2P LGA2011 and LGA1356 boards. The greatest number of PCIe x8 slots I've seen on Tyan 2P or 4P boards was four.

The processor to be used for a 4P board would be the E5-4650. The E5-2690 has higher clock speeds than the E5-4650, but has 768GB capacity versus the E5-4650's 1.5TB. The 3970X can only run in single CPU configurations, can only address 64GB, and has higher clock speeds than either of the Xeon CPUs. It depends on the intended software application, but ideally the applications would be optimized for 4P; as mentioned in this thread previously, applications are frequently biased in favor of single CPU configurations. This could mean that a single CPU configuration in single-CPU optimized applications could outperform a multi-CPU configuration due to the higher clock speeds and Extreme Edition features. My theory is that the higher IO bandwidth of multi-CPU rigs will surpass any of the disadvantages of lower single-threaded and single-CPU biased applications. This includes real-time GPU and CPU-intensive applications like the ones I favor (games, recording, and streaming).

PCI-E expansion is something you may want to research.

Since the bottleneck with the RAID and dual 10Gbit cards is x8, and since they need most of the x8 slot, an x16 to dual x8 board would help with boards limited by PCIe slots but not by lanes, so I'm hoping there are other manufacturers out there that have the available tech. The ones listed on the sites of Magma and Cyclone don't appear to remove any bandwidth restrictions, since they support PCIe 2.0, and since they at best place multiple x4 cards into a single x8 slot. Even if they filled the x8 bandwidth, it would still be equivalent to a single x8 card's bandwidth.

I've changed the rolling build's OS to Windows Server 2012 x64, due to 4P support and ability to address RAM capacities greater than 512GB.

No, up to 40 lanes per CPU, the rest drive onboard devices

The PEX chipsets can control 48 PCIe lanes, so a CPU could be left out of the equation, and leave the PCIe controlling to the chipsets, correct? What does the CPU have to do with it if there is a discrete PCIe controller?

Xeons are designed to run native with DDR-III 1600. You would be wise to run @ 1600 but with the tightest timings you can source. [...]If you want the fastest speeds for games, apps and work station duties then you would look at 500+GB RAM and a RAM drive. Mount partition into it and enjoy the fastest speeds under the sun.

The Intel S4600LT2 has 16 channels, according to the manual, "48 DIMM slots -- Three DIMMs/Channel -- Four memory channels per processor," i.e. it supports tri-channel RAM. The configuration manual for the S4600LT2 has no information on latency configuration. The Supermicro X9Q TF board manual refers to installing RAM in "2 slots per channel"; this might or might not mean it doesn't support tri-channel RAM, but it's never explicitly stated on the site. For Supermicro at least, the RAM doesn't have to be ECC to run at 1600MHz (although 1600MHz is the processor-locked upper cap). The RAM drive setup appears to perform better than an SSD setup, so system performance comparison returns to RAM spam and CPU configuration versus a 64GB RAM setup with a single OC'd 3970x and Corsair Dominator Platinum sticks with tighter latency and higher clocks. It's a side issue to the best SSD setup, but is related due to the RAM capacity versus clock settings discussion. A RAM drive setup would ideally have SSDs anyway, since SSDs would be used for booting and saving on shut down. EDIT: After looking around, I could only find one site that had DDR3 sticks that were both 32GB and 1600MHz (it would take 32GB sticks to make full use of a 1.5TB board). Going with 1333MHz sticks would set it at half of the speeds available to the RAM single CPU systems are running. The benchmarks of a 1.5TB setup with 1600MHz sticks is unknown, but the performance of a 64GB setup with 2800MHz sticks is frequently published on OC forums.

Sgraffite, I read the thread started by AndyE. He mentions in the most recent post to date wanting to get a minimum of 40GB/s out of the system. A PCIe 3.0 slot (rounded) at x1 has a bandwidth of 1GB/s; x2 is 2GB/s; x4 is 4GB/s; x8 is 8GB/s; x16 is 16GB/s. AndyE's different rigs use sixteen drives rather than the ten drives that have a maximum bandwidth of 600MB/s each that would max out the theoretical limit to the bandwidth of a single PCIe 3.0 x8 RAID card; thirteen drives was where the SSD review saw results start to level off with the SMART drives. A single Samsung 840 Pro 512GB averaged about 535MB/s in bandwidth, according to the tweak town review; it would take fifteen of those drives to hit the bandwidth cap, and if you wanted to be symmetrical (or needed it for RAID technicalities), it would be sixteen drives per controller. If each controller hit the 7880MB/s limit, with eight controllers each in x8 slots, the maximum would be 63,040MB/s, ~63GB/s. This is with sixteen SSDs in every controller, one hundred twenty-eight SSDs total.
http://superuser.com/questions/5868...e-enough-bandwidth-for-a-dual-qsfp-40gbit-nic
http://www.hardwaresecrets.com/printpage/Everything-You-Need-to-Know-About-the-PCI-Express/190
AndyE mentions the 5V limitation, but there's no reason why he can't short one or more PSUs so he can have one PSU per array. It's what I've had in the rolling build list since it started; I've seen systems with at least two PSUs on over drive's forums. It might have to do with the way the folding scores are determined. The SSD spam build here doesn't take energy budgets or budgets of any kind into consideration.

-Dragon-, you mention the x540 moves 4-5GB/s; for an eight slot motherboard, with each slot at x8, with RAID cards moving 6GB/s (realistically, according to different reviews of massive SSD arrays, not theoretically), with a single x540 on board, and with the average for the x540 used as 4.5GB/s, and with priority given to moving the maximum amount of traffic over the network, it would take five x540s and one x540 onboard to keep up with three Adaptec 71605s (27GB/s:18GB/s); with priority to local IO, the ratio would be four x540 cards and one on board to four four Adaptec 71605s (22.5GB/s:24GB/s).

gjs278 said:
"it depends what you want [...] if you want the fastest 512b / 4k queue depth 1 experience, fusion io, hands down.

Which fusion io enterprise PCIe drive are you referring to? The drives operate at PCIe 2.0 speeds, so the x16 drives can only run in an x16 slot at PCIe 3.0 speeds of 8GB/s (I don't know if they are physically backwards compatible with x8 slots or not). From benchmarks I've seen, the OCZ ZD4CM88-FH-3.2T performed above the drives from fusion io, but that's could be from a small sample size and unprofessional reviewing results. The difficulty in comparing builds that use enterprise spam versus SDD spam is that the RAID capability of enterprise PCIe drives is unknown, as are comparitive benchmarks for large capacity single unit SATA 3.0 SDDs. Do you know of any reviews that have benchmarked the 512b and 4k queues at depth 1 of fusion io's 10TB drive? How about alongside several of the popular or top-performing SSDs, like the OCZ Vertex 512GB (the original one), SanDisk Extreme II 480GB, or Samsung Pro 840 512GB? I'd like to see how a one hundred twenty-eight SSD setup with priority to speed rather than data preservation does against a single or multiple fusion io PCIe drives in multiple benchmarks (including 512b and 4k queue depth 1). I'm aware of the rock-paper-scissors of SSD reviews (the link below is a case in point); for the SSD spam scenario, the goal would be to test the fastest possible performance any SSD can hit given ideal conditions before it hits any kind of bottleneck; what is the fastest speed a drive can hit with any data whatsoever, even with one bit? (unrelated side note--you double posted)
http://www.storagereview.com/samsung_ssd_840_pro_review
Personally (and not scientifically), I'd be looking for the best drive performance for gaming, screen and mic recording, and streaming; it looks like a RAM drive is the best tool for that anyway, with SSDs to boot the OS and save files on shut down. HDDs would factor in somehow for bulk storage, with a second HDD system for backups.

bAMtan2 · Jul 6, 2013

it's not really science if you're just speculating. here's some science: http://www.anandtech.com/bench/SSD/263

bAMtan2 · Jul 6, 2013

it's not really science if you're just speculating. here's some science: http://www.anandtech.com/bench/SSD/263

omniscence · Jul 6, 2013

Those PEX chips from PLX are bridges, not controllers. At some point they have to be connected to the system. And since all newer CPUs only have PCIe as peripheral bus, only PCIe can be used. In fact the PEX chip uses 16 lanes from the CPU and provides 32 to devices, limiting its overall raw bandwidth between host and devices to 16 GB/s.

What you call chipset is only connected by measly 4 PCIe 2.0 lanes to the CPU.

Lost-Benji · Jul 6, 2013

omniscence said:
Those PEX chips from PLX are bridges, not controllers. At some point they have to be connected to the system. And since all newer CPUs only have PCIe as peripheral bus, only PCIe can be used. In fact the PEX chip uses 16 lanes from the CPU and provides 32 to devices, limiting its overall raw bandwidth between host and devices to 16 GB/s.

What you call chipset is only connected by measly 4 PCIe 2.0 lanes to the CPU.

I concur.

Joel;
This thread is more of a homework question for a school project. I have spent more time on this thread than I wanted to and won't be entertaining anymore activity in it. You don't come across as being an idiot but you do come across as someone who reads too much specs but fails to put it into a real-world case. I have strong suspicions your info comes solely from forums rather than actual hands-on builds.

Sorry to be blunt, just hit that point where I don't wish to keep arguing with you when you keep missing the point.

I will leave you with the following last tit-bits but that's it.

PCI-E expansion can be used for the video and other devices, keep data I/O on the main system. PCI-E v3 is coming and supported, look harder.

Forget about 4P boards, you can't/won't use them. You have already excluded RAM-Drives so having more than 8GB of RAM won't be doing you any good.

gjs278 · Jul 7, 2013

joel96 said:
1 Supermicro:
Which fusion io enterprise PCIe drive are you referring to? The drives operate at PCIe 2.0 speeds, so the x16 drives can only run in an x16 slot at PCIe 3.0 speeds of 8GB/s (I don't know if they are physically backwards compatible with x8 slots or not). From benchmarks I've seen, the OCZ ZD4CM88-FH-3.2T performed above the drives from fusion io, but that's could be from a small sample size and unprofessional reviewing results. The difficulty in comparing builds that use enterprise spam versus SDD spam is that the RAID capability of enterprise PCIe drives is unknown, as are comparitive benchmarks for large capacity single unit SATA 3.0 SDDs. Do you know of any reviews that have benchmarked the 512b and 4k queues at depth 1 of fusion io's 10TB drive? How about alongside several of the popular or top-performing SSDs, like the OCZ Vertex 512GB (the original one), SanDisk Extreme II 480GB, or Samsung Pro 840 512GB? I'd like to see how a one hundred twenty-eight SSD setup with priority to speed rather than data preservation does against a single or multiple fusion io PCIe drives in multiple benchmarks (including 512b and 4k queue depth 1). I'm aware of the rock-paper-scissors of SSD reviews (the link below is a case in point); for the SSD spam scenario, the goal would be to test the fastest possible performance any SSD can hit given ideal conditions before it hits any kind of bottleneck; what is the fastest speed a drive can hit with any data whatsoever, even with one bit? (unrelated side note--you double posted)
http://www.storagereview.com/samsung_ssd_840_pro_review
Personally (and not scientifically), I'd be looking for the best drive performance for gaming, screen and mic recording, and streaming; it looks like a RAM drive is the best tool for that anyway, with SSDs to boot the OS and save files on shut down. HDDs would factor in somehow for bulk storage, with a second HDD system for backups.

there is no drive in existence that out performs the read 4k QD1 on fusion-io iodrives.

I've seen quite a few pci-e reviews of the enterprise stuff and have never seen 20k iops on 4k qd1. according to http://www.storagereview.com/ocz_zdrive_r4_enterprise_pcie_ssd_review the fusion still wins.

you can't improve the speed of 4k qd1 or 512b qd1 with raid. the file is too small to split into chunks at that point, you can only improve the 4k at higher queue depths with raid. 4k qd1 can never be increased due to raid.

your best build possible will be purchasing an ssd to boot, a fusion card to hold all of your programs (I would go with iodrive duo 640gb, only $800 or less if you wait long enough on ebay), and an HDD for data, behind an L2 cache that is using either a portion of the SSD or the fusion card to handle read/write caching using fancycache.

you can't boot purely off a fusion or I would recommend that. you can buy the highest raid controller possible (I have a 9265-8i with 5 ssds) and your performance will never come close to a fusion iodrive duo. You can sequential at 1.5gb/s (you won't ever do this though) and you can 4k read qd1 at 80mb/s.

drescherjm · Jul 7, 2013

This ^
I second all of the above post by gjs278.

joel96 · Jul 14, 2013

anandtech is good for benchmark comparisons. They cover SATA SSDs mostly; there aren't many PCIe test results.

omniscience said:
the PEX chip uses 16 lanes from the CPU and provides 32 to devices, limiting its overall raw bandwidth between host and devices to 16 GB/s.

The documentation for the 48 lane 8747 bridge up to the 96 lane 8796 said they supported PCIe 3.0. http://www.plxtech.com/products/expresslane/gen3
The 8764, 8780, 8796 bridge can support two full x16 slots; The 8747 supports only one x16 slot; the 8748 and 8749 don't appear to support even one x16 slot (just splitting one between two x8 slots at best); the 8750 supports a single x16 slot, and can divide the lanes from that point. There's no mention of it being connected to the CPU as a PCIe 2.0 bridge. Since PEX bridges currently have one x16 link at best, boards with PEX bridges should be avoided for builds that would make use of PCIe for more than one device, whether it is taken up by GPUs, NICs, or SSDs.

For RAM drives (I don't mean SSDs that use RAM slots as an interface), mainstream gamer boards out perform server boards due to the gamer boards' latency adjustments, quad-channel support, and clock speed adjustments. Applications that use more than 64GB of RAM are likely going to be non-mainstream ones that require performance that is not done in real time, like 3D rendering, or image processing. Boards with OC capabilities aren't available in 4P, but there are several available with 2P. The gaming boards only support socket 2011, and since the 3970X can only run in 1P configuration, the best CPU available is the E5-2687W (which has the fastest clock speeds of the E5 line). This means a total of 80 PCIe lanes, enough for five PCIe x16 links, or ten x8 links.

The Fusion-io ioDrive Octal 5.12TB has lower latency, better IOPS mix performance, and higher write bandwidth than the 10.24TB version. The 10.24TB version has higher raw read and write IOPS, and has better read bandwidth (although it's 6.7GB/s, and with a PCIe 2.0 x8 slot, the read bandwidth performance would be the same as the 5.12TB version since it has 6.0GB/s read bandwidth). I would pick the 5.12TB version, since the majority of real world traffic is mixed read and write. Fusion states in their product forum that the drives can run in RAID 0. http://community.fusionio.com/products/f/34/t/345.aspx
It would probably be software RAID, which according to the thread below, is inferior to running a single drive. http://www.tomshardware.com/forum/265357-32-hardware-raid-software-raid-raid If the build were intended for practical use, an HDD would make complete sense (as well as a discrete NAS system in RAID 10 for regular backups connected via the x540). Fusion-io confirmed the Octal cannot host the OS. http://community.fusionio.com/products/f/34/p/566/1510.aspx As such, the system would need both a RAID card and a PCIe drive whether it has a RAM drive or a PCIe SSD for most operations other than the OS. The only advantage I can see of having a PCIe SSD that cannot boot would be capacity during already-booted operation, which is something that non-real-time applications would require, including games reading from huge installation folders (although a game or real-time application that takes up 20GB could still run off of a RAM drive with at most 44GB remaining for operating memory). There would be no speed advantage with more than one PCIe SSD, so the only way a RAM drive system would see higher speeds would be to increase the number of SSDs.

A maximized SSD system would have nine SSD RAID cards, one PCIe SSD, and would run off of a RAM drive. The next step to finding the maximized SSD system would be to find a 2P socket 2011 board that allows RAM overclocking, with ten PCIe x8 slots.

In keeping with the 4.5GB/s:6GB/s storage:NIC, a dual E5-2687W system with a focus on getting the most IO out of the system would have one x540-T2 onboard, six one PCIe SSD, three SSD RAID cards, and five x540 cards.

A system with a graphics focus with four Titans would only leave 16 lanes left over for NICs, the PCIe SSD, and the PCIe RAID card. A system with three Titans would leave 32 lanes left over for IO. 8 lanes would be taken up by the single PCIe SSD, another 8 lanes by at least one PCIe RAID card, and the remaining 16 lanes would be taken up by an onboard x540 and a x540 NIC, but this wouldn't be able to handle 3GB/s of the bandwidth. A system that would maximize the network bandwidth would have two x16 GPUs taking up 32 lanes, that leaves 48 lanes for IO (ie six x8 links), a ratio of approximately 21 lanes for the SSDs, 27 for the NICs; for the maximum amount of network headroom, this would mean four PCIe cards (three SSD RAIDs and one PCIe SSD), three NICs, and one x540 onboard (it doesn't neccesarily have to be on board, and most gaming boards aren't going to have one on board, so my spreadsheet assumes installing four x540 cards). The graphics intensive, network IO maximized system would be used for streaming; a board that would support it would need eight PCIe slots, two of which would need to be PCIe g3.0 x16 (every single one would need to be full height, and the x16 slots would need to be double height; most gaming board manufacturers will have the slots flush with each other--why they don't just make every slot double height distance apart might be due to the reduction in speed from a longer travel distance, or possibly the extra cost that would bring from additional materials, but any reduction in speed would be negligible compared to going without the extra cards crowded out by poor slot placement). A quick search of Google shopping turned up the Asus Z9PE-D8 WS, with seven super crowded PCIe slots. Reviews of 2P 2011 gaming boards have been scarce, but I'm going to keep looking for one with eight to ten PCIe slots.

AndyE · Jul 19, 2013

Almost a year ago, I started to build a 2P system with >20 GB/s transfer speed from a 48-SSD array and 6 raid controllers, hitting a few architectural limits.

The idea, selection of components, lessons learned and some results are written up in a thread at servethehome.
http://forums.servethehome.com/f12/intro-built-notes-799.html

Be warned, it's a long thread

If questions still remain, just ask

BTW,
your #1 enemy is limited main memory bandwidth and #2 the virtual memory subsystem in the OS and its HW dependencies (AT, TLB, CC, ...).
Before I forget: Forget most SSD benchmark tools, they won't scale appropriately due to internal limitations. Take iometer instead.
Networking: Use FDR Infiniband instead of 10 Gbit/s ethernet, which is too slow for this performance envelope.

Andy

PS/update:
If you just want to read a summary, my intro thread here at [H] 2 months ago might be a good start ...

AndyE · Jul 19, 2013

Joel,
some quick comments on your elaborations.

Current Intel E5 CPUs only support 1600 MHZ with 2 DIMMs per channel. With 3 Dimms, transfer drops to 1333 MHz. So the 1.5 TB config you mentioned can only run at 1333 MHz

Adding numbers of components will seldom add up to the results you are arguing. Take 6 Raid controllers, put it in one system add as many SSDs as you can and try to extract the theoretical peak performance - there is a long way to go. Filesystem config impacts it, driver issues impact it, HW conflicts impact it, etc, etc ...

With SB and IB, PCI grew significantly faster than the memory subsystem. One LGA 2011 CPU can provide about 40 GB/s mem bandwidth. A fully loaded PCI subsystem with 40 PCI lanes can deliver 40 GB/s read and 40 GB/s write concurrently, overwhelming the memory bandwidth available. add to this the unavoidable cache trashing (CPU and IO) and your IO perf goes down. Expect that 25-30% of I/O bandwidth of available mem bandwidth leaves enough bandwidth that the CPU can actually do something useful. Which is in the case of an LGA2011 CPU about 10-12GB/s.

Adding in multisocket systems the individual metrics is purely theoretical. Practically, I have seen many situations, where the move of the same IO config from a single LGA2011 CPU to a 2P to 4P system leads to a decline in the absolute performance of the solution. Just to be clear with this statement. A 1P system is faster than a 2P and 4P system (wall clock time).

You completely ignored the implications of good or bad written application software. While OS's can deal with hundreds or thousands of CPUs, IO channels, etc... the most likely core bottleneck comes from the application software. The days of implicit speedups the benchmarking community enjoyed for the last 30 years are over. Out. Gone. To reap the benefit of current and future parallel systems, applications have to be rewritten more or less from scratch. As an example: How many applications have you seen which try to utilize the QPI to the fullest extend ? (QPI is a core bottleneck in NUMA apps). As a second example: My system delivered 2.2 mio IOPS (with 4 KB each), but was ultimately bottlenecked by 95%+ CPU utilization only dealing with the I/O interrupts the driver architecture and OS triggered. So yes, I had 8.6 GB/sec random IO transfer rate which is fantastic, but it represents only 50% of the transactions the 48 SSDs pre se were capable of. Third example: Unless your app is only streaming data from the SSD to memory, the most likely bottleneck will be the size of the Translation Lookaside Buffer of the CPU. Fourth example: Only a handful of Raid controllers can deal with massive SSD arrays. Fifth example: specialized HW like Fusion IO might be nice, but (a9 very expensive and (b) were limited last year by PCIv2 connectors. They didin't have PCIv3 support (Don't know if the already switched to PCIv3)

There is much more to this, but just wanted to write about a few things which need to be considered.

Andy

staticlag · Jul 19, 2013

Asking a question like: "Fastest Theoretical" and "irregardless of price and risk of data loss"

Personally I would hand Intel $20million and say I wanted a one-off custom array that was fast enough to melt my face off.

question answered.

AndyE · Jul 19, 2013

staticlag said:
Asking a question like: "Fastest Theoretical" and "irregardless of price and risk of data loss"

Personally I would hand Intel $20million and say I wanted a one-off custom array that was fast enough to melt my face off.

question answered.

.... replace Intel with DDN and add a zero to your budget for a really unique system ....

They power 7 of the top10 High Performance Computers of the Top500 list.

From their website:

The Worldwide Leader in Supercomputing Storage

World's fastest file storage systems in production
Most production Petascale systems (4 now, more TBA)
More Top 100 systems than all others combined
More 100GB/s systems than all others combined
Partner of choice to HPC systems providers
800% Better Performance; Fastest In The World
40%+ Better Data Center Density
TB/s throughput in only 25 Systems
World's broadest HPC storage solutions portfolio
HPCWire Editor's Choice Award 8 years running

Andy

joel96 · Jul 29, 2013

The board mentioned in the spreadsheet markets itself as being able to run up to 2133 overclocked RAM. http://www.asus.com/Motherboards/Z9PED8_WS/#specifications It uses the E5-2600 family of procs. I can't think of how they could market it as being able to OC unless it ran in a single LGA2011 1P configuration. I'm going to ask Asus about it. OC capability is one of the main reasons I would choose it over the Z9PE-D16.

@Andy:
"your #1 enemy is limited main memory bandwidth and #2 the virtual memory subsystem in the OS and its HW dependencies (AT, TLB, CC, ...)." Translation Lookaside Buffer (TLB): You need a big one for a system with a large number of high capacity drives. This is because the TLB needs to map more virtual memory addresses than a system with a single low capacity drive. From the Wikipedia article on TLBs, it sounds like the maximum size of the TLB required depends on the maximum number of addresses for an individual drive multiplied by the number of drives in the system; this would be the case regardless of how many hardware or software RAIDs are used. The TLB stores the translated virtual addresses as physical addresses. I'm assuming you're referring to the Page Attribute Table (PAT), which manipulates memory caching no a per-page basis. I don't think you're referring to allocation tables, since those are an on-disk file systems. I'm assuming that CC refers to Cache Coherence or the Cache Controller; either way, when a system has a large number of drives, ensuring coherency would be dependent on the speed and bandwidth capacity of the cache controller.

"the most likely bottleneck will be the size of the Translation Lookaside Buffer of the CPU." The question then becomes which CPU configuration has the largest TLB in consideration of other disadvantages. Multi-processor configurations would have the advantage at least in the TLB aspect.

"With SB and IB, PCI grew significantly faster than the memory subsystem." I'm assuming you're referring to PCIe capacities and PCIe speeds independent of the bandwidth the Sandy Bridge (SB) and Ivy Bridge (IB) can handle.

I've updated the rolling component build to use the Mellonox MCB194A-FCAT dual port 40Gb/s card. It's marketed as having 56Gb/s per port, which would give it a total of 14GB/s per card, but is more likely to run around 5GB/s if the 40Gb/s per port ratings are accurate. It is a PCIe 3.0 x16 card, which gives it a clear advantage over the x540, even if adapters are used for compatibility.

In regards to which RAID card you're using, have you found better performance from 24 drives on the 24 port version of the Adaptec 71605 or from 16 drives on the 16 port version?

What is it about DDN that makes their systems superior to Intel's supercomputers? Part of it appears to be custom database software. http://www.ddn.com/images/SFA10K-M_0.jpg They also prefer object duplication storage, as opposed to RAID, mostly due to the time it takes to rebuild a system after drive loss, but this thread doesn't take drive failure or cost into consideration. It looks like they also use RAM drives, SSD flash storage, and HDDs (sort of like what I proposed in my previous post, except with the SATA SSDs only hosting the OS for the transfer upon boot to the RAM drive, and with a Fusion IO Octal used for applications). They mention in the SFA7700 brochure that the cache is "64GB - mirrored power-loss safe." This sounds like DDN is using either RAM, a PCIe SSD, or a SATA SSD to act as the cache. DDN claims their systems can get more than 50mio IOPs, but this is with multiple systems--it's like saying a distributed computing client over thousands of systems is faster than a single system. This thread is about realtime performance, so if a multiple system configuration can perform better than a single system, that's fine, as long as the system is the fastest (i.e., the multiple system isn't taking two systems into consideration, it's taking the number of systems that can theoretically exist given OS, CPU, and other limitations, even if the number exceeds the amount of SSDs that exist on the planet). I'd like to know which of DDN's systems performs the best in realtime applications in IOMeter, and at 4kb q1. You are correct that newer RAID cards run PCIe 3.0 whereas Fusion IO's PCIe SSDs run on 2.0, but the serve the home thread mentions saturation of the CPU cores occurs before the drives can fill the PCIe bandwidth, hence needing to break it down into a lesser number of drives per RAID array. So which is better overall, or at least for 4kb q1--SSD RAID spam or the 5TB Octal? Since the serve the home thread also mentions software RAIDs as a viable method of reducing the bandwidth load on the CPU, it's possible to use the PCIe enterprise drives in a software RAID, which might end up being faster than a single PCIe drive.

ehorn mentions on the serve the home thread "When I did this on an 1155/3570K platform (4 ramdisks, 4 workers, 1MB seq @ 32QD), I saw right around that level and could get no more regardless of how much more I stacked." It sounds like RAM disks work in a similar way to SSD spam systems, where bandwidth must be balanced in its distribution--it makes sense that a quad-channel system with eight slots would perform better if it were set up as two RAM disks in RAID rather than one. Is it possible to set up RAM disks in RAID, and is performance increased? This would also mean that a system with 512GB capacity might perform better than a single CPU system, due to the greater number of RAM disks involved in the RAM disk RAID.

Why is research on CPU's not moving towards increasing the on-die cache size to a point that it could act like a RAM drive? It seems that fault tolerance and latency worsen as the size of the cache increases; I've read that modern cache designs handle specific tasks, and aren't a general memory pool.

Post #49 in the serve the home thread mentions, "BTW, my numbers are in IR mode, raid0 of 8 SSDs per controller, SW striping of the 4 logical drives." Does this mean RAID 0'ing of the 8 different controllers can only be done in software mode, or is there some bottleneck that leads software striping to be faster than hardware striping between the cards? That part of the thread took place before the move to Adaptec cards, so I'm hoping it's a different story. Continued with #85, "Reflecting on the "bottleneck" of the LSI 9207-8i past 5 drives and the inevitable higher CPU load in a pure HBA setting with 32 SSDs passed through the HBA's to establish a software raid-0, I configured the 4 LSI HBAs with 2 raid0 each (4 SSDs per raid0). 2 software raid0 on top of that with each taking one 4-SSD raid per LSI HBA. This arrangement is in my case faster than the "classic" arrangement of of connecting all 16 SSDs of 2 LSI adapters into one raid0 (2 x LSI based 8-SSD raid0, with one software raid0 on top)." Sounds like two RAIDs within a software RAID. If performance is better with that kind of pyramid, wouldn't a faster RAID scheme be one that takes two drives, puts them in a RAID, puts that RAID in a RAID (four drives), that RAID in an HBA or controller RAID (eight), that card in a software RAID with another card (sixteen), and so on with software RAIDs until the motherboard runs out of x8 slots or the CPU runs out of bandwidth? Post #108 mentions SSD saturation of the PCIe 3.0 x8 with the Adaptec 24-port card at three SSDs, but later, post #115 mentions 24 drives on one card, with an unresolved bottleneck due to the drives:
"I would not use the 72405 it in a pure SSD environment (unless you need capacity without expanders):
1) The 24 6gbps ports have too much bandwidth vs. the PCI bus interface"
Why is sending too much bandwidth traffic across the PCI bus interface (I'm assuming you mean the PCIe bus) something to avoid? Wouldn't the drives wait until they could send the data?

Once the EVO 840 1TB drive is released, it will supplant the 840 currently listed in my rolling component build. It should be noted that the current listing is weighted towards the gaming streaming configuration, and has not forked into two or more spreadsheets, one of which being the realtime IO for SSDs as discussed in this thread.

There are a few things mentioned in the thread that seem like physical logistics issues rather than direct performance issues:
You mentioned 5V line power issues, but I don't see why you couldn't use multiple PSUs (there's a way to short the motherboard cable so that it still generates power), unless it goes against the scoring rubric of the folding team.

Cooling for the drives is mentioned at one point in the thread. Overclock dot net has several examples of computer systems entirely submerged in mineral oil circulation systems for cooling. I don't see any reason why the SSDs could not be stacked tightly without a backplane, unless it's for cable management. They don't have traditional HDD moving parts that would be unable to function with liquid in the way.

AndyE · Aug 4, 2013

1) OC RAM
The motherboard would theoretically support it. Check out the Hyperspeed series from Supermicro, where a user (Movieman) at www.extremesystems.org showed 6% CPU overclock and 1940 MHz RAM speed with a dual Xeon system

2) TLB
Your interpretation of what a TLB is wrong. TLB size is independent from the # of drives in the system. TLB performance is often an issue when a lot of random I/O is happening AND the CPU need to sweep randomly through RAM. Think of sorting data.

3) Infiniband:
You need the appropriate OS to leverage the full performance of these cards (with low CPU overhead)

4) 24 drives and Raid controller
No.
As the bottleneck is in both controllers the same (the x8 PCIe bus), there is no performance difference between 16 and 24 drives

5) DDN
software makes the difference

6) FusionIO vs SSD/HBA
For sequential transfers, the PCIe speed is critical. PCIe v2.0 cards (HBA, raids, PCIe SSDs) have to deliver data through this bottleneck, independent of the internal speed.

7) ehorn /1155 platform
2 RAM disks as raid? Why the additional overhead? is nonsense, unless there is software driver related bottleneck per RAM disk

8) CPU research / on die cache as RAM disk
Because it is cost prohibitive to use the second most expensive storage layer after registers for such a mundane thing as a disk cache. Is probably a factor of 1000 off

9) STH post #49
Raid 0 over multiple LSI 9207-8i HBA controllers can only be done in the OS
There is no additional value in generating unnecessary raid hierarchies, stick with the minimum for the desired performance level.

10) STH Post #108
The Adaptec driver is unstable with high performing SSDs. Out of curiosity I tried a 24 SSD setup with one raid controller - not recommended. Not even good performance due to raid controller bottlenecks

24 x 600 MB/s = 14.400 MB/s total bandwidth from the SSDs to the raid controller. The x8 PCIe interface can transfer only 6-7 GB/s only. This is less than 50% of the SSD performance. If the system should maximize transfer speed, 24 SSDs doesnt make sense on a single raid controller. 12 SSD/controller is more cost effective for high transfer rates.

11) Samsung EVO drive
I would not expect good raid performance from EVO drives - unless proven otherwise. I have not found many consumer drives with good predictable and proactive garbage collection. Given the TLC design of the Evo, I'd be heavily suprised if Samsung manages to get in one design optimized for super high capacity (at good cost efficiency) the holy grail of write performance solved as well. Don't take any well known benchmark values for granted. They are mostly useless in high perf raid systems

12) 5 V
Yeah, multiple PSU can be used. But you need to tinker to get the additional PSU started without connected to the motherboard. Can be solved, but not out of the box

12) Cooling
Feel free to solve many of the unresolved challenges of submerged cooling. Alternatively, a single silent 12V fan is sufficient for the SSDs if the airflow is unobstructed

13) other
For gaming this system you are "envisioning/designing" is probably never going to get to specified performance you are collecting from spec sheets, unless you completely rewrite the games you are interested in. There is a reason why this is the case: Parallel programming for computation is hard, and parallel I/O is hard as well. Feel free to solve this "grand challenge"

Recommended reading for more realistic assumptions:
Hennessy, Patterson: Computer Architecture, Fifth edition: A quantitative Approach

brutalizer · Aug 5, 2013

AndyE said:
.... replace Intel with DDN and add a zero to your budget for a really unique system ....

They power 7 of the top10 High Performance Computers of the Top500 list.

From their website:

The Worldwide Leader in Supercomputing Storage

World's fastest file storage systems in production

Most production Petascale systems (4 now, more TBA)

More Top 100 systems than all others combined

More 100GB/s systems than all others combined

Partner of choice to HPC systems providers

800% Better Performance; Fastest In The World

40%+ Better Data Center Density

TB/s throughput in only 25 Systems

World's broadest HPC storage solutions portfolio

HPCWire Editor's Choice Award 8 years running

Andy

Lustre + ZFS is used in the ibm sequioa supercomputer. It has also good stats with 55PB data storage and 1TB/sec of bandwidth. There are interesting info on this, google it for more info. But this is very new, lustre used ext4 earlier, but had to be rewritten to zfs due to bad perfomance with ext4. Maybe this is faster than DDN?

But these are clustered solutions. Not a single server.

joel96 · Aug 8, 2013

The X9DAX-7/i(T)F line from Supermicro are the ones that support OC'd RAM. The only differences between the four boards are the drive ports and network ports onboard. The X9DAX-7TF had the fullest feature set out of the four that I looked at, but board features are not relevant when you plan on probividing your own Infiniband and RAID cards (ideally, you'd be able to disable any PCIe lanes routing to unused board ports, and assign them to only the slots that are used). Supermicro only tested up to 1959MHz with 8DIMMs, and 1680MHz with 16DIMMs (there is no mention of latency configuration).
http://www.supermicro.com/products/nfo/files/Hyper-Speed/Hyper-Speed_Memory_Testing_List.pdf
I was never able to get a straight answer from Asus or their ROG forums regarding the OC abilities of the Asus Z9PE-D8 WS, so that gives Supermicro an advantage of more reliable customer support and tested backing of their speed claims, even if the speeds are slower. If overclocking is permitted on the board, I'm guessing it could go further than Supermicro's testing. I couldn't find the thread with Movieman's test bed.

"TLB performance is often an issue when a lot of random I/O is happening AND the CPU need to sweep randomly through RAM. Think of sorting data." Seems like this would be more for read-intensive applications.

"You need the appropriate OS to leverage the full performance of these cards (with low CPU overhead) [...] software makes the difference." Which OS is fastest? It doesn't look like DDN licenses their software apart from their machines, so it would take a quote to find out how much some OEM part or a barebones system that includes their software would cost.

"For sequential transfers, the PCIe speed is critical. PCIe v2.0 cards (HBA, raids, PCIe SSDs) have to deliver data through this bottleneck, independent of the internal speed." -AndyE
"there is no drive in existence that out performs the read 4k QD1 on fusion-io iodrives.

I've seen quite a few pci-e reviews of the enterprise stuff and have never seen 20k iops on 4k qd1. according to http://www.storagereview.com/ocz_zdr...cie_ssd_review the fusion still wins.

you can't improve the speed of 4k qd1 or 512b qd1 with raid. the file is too small to split into chunks at that point, you can only improve the 4k at higher queue depths with raid. 4k qd1 can never be increased due to raid." -gjs278
I agree with gjs278 regarding 4k capabilities of PCIe 2.0 cards, but I agree with AndyE that the bottleneck would hurt performance--maybe if it were saturated with 4k files or the less frequent larger files.

"Raid 0 over multiple LSI 9207-8i HBA controllers can only be done in the OS [...] The Adaptec driver is unstable with high performing SSDs. [...] stick with the minimum [raid heirarchies] for the desired performance level." So if OS or software RAIDs are inferior performers with single volume RAIDs, then a single card would be faster than a multiple card RAID (whether with the Adaptec or the LSI card). Does the instability of the Adaptec with high performing drives lead to lower speeds in short-term testing?

"Can be solved, but not out of the box" Paperclip in the 24V motherboard connector.

"For gaming this system you are "envisioning/designing" is probably never going to get to specified performance you are collecting from spec sheets, unless you completely rewrite the games you are interested in." It's true that gaming lags behind software written to specifically take advantage of certain hardware features, which is why this thread primarily emphasizes raw performance under ideal temporary circumstances (similar to LN2 benchmarks--unless users can afford several thousand kiloliters of liquid hydrogen annually and a machine to pump unrecyclable liquid into a vat, users are probably going to stick with a peltier system and a water-based liquid cooling solution long term). Systems that are multithreaded outperform single thread machines due to tech other than the multithreaded processing features. I figure that games that are old enough not to have a "enable multithreaded processing" checkbox are going to be old enough that they wouldn't be able to take advantage of faster single-threaded performance anyway. You're right about games not taking advantage of multiple processors, as I've only seen mention in games of multiple threads or occassionally multiple cores rather than multiple processors. "OK, as for mainboards, most people who are serious and into systems won't be stuffing around with desktop/consumer gear." -Lost-Benji I'll have to take a look at 1P configurations of server boards, but those won't be ideal for this thread's discussion of SSDs. It also doesn't consider that most games are GPU-intensive rather than CPU-intensive, which means that a 2P or 4P configuration that supports full x16 on every slot will outperform a 1P system.

Computer Architecture, Fifth Edition, looks like it's worth reading. It was published in 2011, but I'm sure the topics on the basics are still relevant to understanding the newer technology. Might be easier than trying to cull information from Wikipedia and random forum posts.

rronald · Sep 18, 2013

Andy,

I've been trying to maximize performance in an sTec SSD array of 24 drives. It has three LSI 9207-8e controllers going to a JBOD from DataOn (1640) with dual SAS expanders.

We can hit 2.1 GB/sec to any four drives with each of the three controllers. But, each controller appears limited to 2.6 GB/sec. That's suspiciously close to the "bottleneck of the LSI 9207-8i past 5 drives" that you mentioned in an earlier post and was referenced earlier in this thread.

We should be able to hit 4 GB/sec. Testing with one controller we have the A port cable assigned to 4 drives and are using the B port cable assigned to 4 different drives. We're manually assigning drives to controllers and ports in order to avoid any conflicts that would create bottlenecks. We're using two instances of our test software, one on the A side and one on the B side. It's the same software and methodology that gets use to 2.1 GB/s * 3 controllers. When we're trying to use both ports (8 drives) on a single controller, we bottleneck. When we use 8 or even 12 drives spread over three controllers, we don't.

2.6 GB/s is not at a PCIe2 vs. PCIe3 boundary. The only thing I can point to is the controller, but it seems like other folks have achieved 4 GB/sec with this controller in online benchmarking (see http://www.thessdreview.com/our-rev...8i-pcie-3-0-host-bus-adapter-quick-preview/5/)

Any insights would be greatly appreciated. If you can help solve this, I should be able to get from my current 7.65 GB/s for three controllers up to 12GB/s...which puts me squarely back on topic for fast SSD rates.

RR

brutalizer · Sep 19, 2013

The new 12Gbit SAS card LSI3000 are out. They are much faster than LSI2008 cards. There are reviews on www.servethehome.com

Fastest Theoretical SSD Build?

n00b

Limp Gawd

n00b

n00b

Limp Gawd

2[H]4U

2[H]4U

n00b

Limp Gawd

2[H]4U

Limp Gawd

n00b

2[H]4U

2[H]4U

2[H]4U

Limp Gawd

n00b

Supreme [H]ardness

2[H]4U

Gawd

Gawd

n00b

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

Gawd

[H]F Junkie

n00b

Limp Gawd

Limp Gawd

[H]ard|Gawd

Limp Gawd

n00b

Limp Gawd

[H]ard|Gawd

n00b

n00b

[H]ard|Gawd