brutalizer
[H]ard|Gawd
- Joined
- Oct 23, 2010
- Messages
- 1,602
Linux scales quite bad in some circumstances, actually. There are (at least) two different types of scaling:^ Linux is used in the majority of the world's supercomputers, which have far more than 16 sockets.
Also, since when has Linux been limited by 8 sockets? I think you're thinking of Windows.
Replace "Linux" with "Windows" in everything you just said, and you'd be 100% correct.
Even the top 10 supercomputers in the world are all using Linux: http://en.wikipedia.org/wiki/TOP500
Unless I'm wrong, do you have any links to backup what you're saying about Linux and >8 socket systems and mainframes?
Or at least elaborate on why Linux isn't good with >8 socket systems?
I'm not sure why the number of sockets would hold Linux back since it scales with multiple processor cores (and sockets) perfectly fine.
-Horizontal scaling. Scale out. This is a cluster and you scale by simply adding a new node to a fast switch. Linux scales excellent on clusters, that is, HPC clusters. For isntance, Google's server park is such a cluster with 900.000 servers. There is no single fat server with 900.000 cpus, all large servers with 100s of cpus are clusters. These clusters run HPC workloads, that is "embarrasingly parallel workloads", such as number crunching scientific computations CFD or solving PDEs. Another example would be to compress a movie to MPEG - each thread can compress a few pixels. If you have 1000 of cpus, each cpu can compress 10 pixels each. These problems are easy to scale, the more cpus you add the faster it gets. And this domain is very Linux strength is, just look at Google or any other big company, they typically have large clusters of Linux pcs. For instance World of warcraft runs a separate world on each individual server. WoW is easy to parallelize. It would be much much harder to create one single huge world, instead of many separate worlds - either you need a single huge server (see below) or a very tight cluster because the latency would be so bad between the nodes. Most probably this would not work, and this is why WoW has many separate worlds, and not a single huge world. These HPC servers are relatively cheap, it is just a bunch of PCs on a fast switch. You take standard components. Beowulf Linux cluster is an example of a cheap Linux server (which is a cluster). You can even do it yourself, just get a few PCs and a fast swithc and install Beowulf on it, and voila you have built yourself a huge Linux server (which is actually a cluster). Sure, Beowulf can only do one thing: HPC number crunching easy parallel problems. But it is a huge Linux server. Or, as I prefer to say: cluster. And a 32 socket Beowulf cluster costs as much as 32 pcs and a switch, which is very cheap.
-Vertical scaling. Scale up. This is a single fat server running problems that are not easy to parallelize. For instance, if a chef makes a sauce, he can not do it until the steak is done, he must wait. You can not add more chefs and complete the steak faster. These servers are not HPC, instead they are called SMP servers. Such a SMP server is typically a large single server in a cabinet weighing 1000 kg or so, and they typically have 16 or 32 cpus. Some SMP servers even have 64 cpus - and they are all Unix or Mainframes. Mainframes are SMP servers, the largest IBM Mainframe today has 24 cpus. These SMP servers costs many millions as they are really hard to manufacture, you can not just add a new PC. For instance, the largest IBM P595 Unix used for the old TPC-C record, had 32 POWER6 cpus and it costs $35 million, list price.
You can not just take a few PCs and add them, that does not make them a SMP server. No, it must be built from ground up, and they are all very very expensive. Typical contemporary SMP server examples are IBM AIX P795 (32 sockets), Oracle Solaris M6 (32 sockets), Fujitsu Solaris M10-4S (64 sockets), HP HP-UX Integrity/Superdome (64 sockets), IBM z/OS Mainframe (24 sockets). To build a huge SMP 32 cpu server is not easy to do, and no Linux vendor has ever done this. Sometimes they have tried to compile Linux onto a Unix server, with horrible results. HP tried with their "Big Tux" (google it) and ran Linux ontop Superdome server, with ~40% cpu utilization, this means every other cpu idled under full load. This is quite bad.
To scale well on a SMP server is hard to do, and Linux can not do that. Why? Because there does not exist any huge SMP Linux servers. So how can Linux developers optimize the Linux kernel to 32 sockets? They can not.
These SMP servers are typically used in Enterprise companies, large investment banks, etc. An typical SMP workload would be to run a huge database on many cpus - this is what Enterprise does all the time. HPC workloads are typically number crunching scientific computing. There are distributed databases that run on clusters but it is not the same thing as a monolithic database running on a single SMP server.
Main problem is that latency in a cluster can be bad, so when you program for a cluster you need to design your software for cluster, allocate memory to close nodes, etc. Typically you use MPI to pass information between nodes. You can not copy binaries running on a small standard server, to a large cluster - you need to reprogram the binary and use MPI etc.
On a SMP server, the latency is always low, so you dont need to care when you write your software. You can copy binaries from a small server to a large SMP server without any problems - the SMP server act as a small server, but just a bit more powerful.
Regarding Linux, there are no big SMP servers and have never been for sale. Sure, you can buy a shit load expensive IBM P795 Unix server and compile Linux for it - but it is still a Unix server. There are no pure Linux servers. This is the reason Linux does not make it into the large Enterprise server sector - which is the most profitable sector, where everyone wants to be. If Linux scales well on large SMP servers, then Unix and Mainframes will die out. Until someone sell a cheap Linux SMP server, for only a few million USD, Linux will never venture into high end Enterprise sector.
PS. There are no true SMP servers on the market today, I suspect. A true SMP server has the same latency between all cpus. No matter which cpu you acess, it goes equally fast.
Today all huge SMP servers are a bit of an NUMA server (which is a cluster):
http://en.wikipedia.org/wiki/Non-uniform_memory_access#NUMA_vs._cluster_computing
but they are so well designed so latency is very low, no matter which cpu you acess. This means you dont need to use MPI, just program as normal. But on a HPC cluster, the latency to far away cpus can be extremely bad so you need to handle extremely bad latency, which might grind the system to a halt - which never occurs on a SMP server.
So a cluster might connect each node via a very simple way, maybe a single line via switch, and if you need to access some node, you go through many nodes - or something like that. Quite bad latency in worst case.
In contrast, a SMP server connects every cpu to each other. You dont need to go through another cpu, so it will be fast. In reality there will be some hops, but a well desinged SMP server minizmises the hops needed to reach another cpu. Look at the very bottom on this new SMP SPARC M5 server, and see how each cpu is connected each other in a very intricate way:
http://www.theregister.co.uk/2013/08/28/oracle_sparc_m6_bixby_interconnect/
There are only 2-3 hops at most to reach any other cpu, so latency is very low. Almost as a true SMP server.
Regarding the huge ScaleMP Linux server with 1000s of cores and gobs of terabytes of RAM, yes it is a cluster that is tricked into believing it is a single huge fat SMP server running single image Linux kernel. It can not run SMP workloads, only run the easier HPC number crunching:
http://www.theregister.co.uk/2011/09/20/scalemp_supports_amd_opterons/
Since its founding in 2003, ScaleMP has tried a different approach. Instead of using special ASICs and interconnection protocols to lash together multiple server modes together into a SMP shared memory system, ScaleMP cooked up a special software hypervisor layer, called vSMP, that rides atop the x64 processors, memory controllers, and I/O controllers in multiple server nodes....vSMP takes multiple physical servers and – using InfiniBand as a backplane interconnect – makes them look like a giant virtual SMP server with a shared memory space. vSMP has its limits.
...
The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit
And regarding the SGI Altix and UV1000 Linux server with 1000s of cores and gobs of RAM, it is also a HPC number crunching server - it is not used for SMP workload, because it does not scale well enough to handle such difficult workloads. SGI says explicilty that their Linux servers are for HPC only, and not for SMP.
http://www.realworldtech.com/sgi-interview/6/
The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,
...
However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time
Ergo, there are no big Linux servers doing SMP workloads on the market, and has never been. All large Linux servers, are actually clusters running HPC number crunching workloads. If these Linux servers were SMP servers, they would immediately venture into the high profit very expensive SMP market - but they can not. They stay in the low profit small server, with max 8 sockets (just a normal standard IBM or HP x86 server).
So, now I have answered your question whether Linux scales well or not. Linux scales excellent on clusters. But scales very bad on a single fat huge server - because they does not exist, the kernel developers can not optimize Linux for such huge servers.
OTOH, Unix servers had 32 sockets for decades and scales well on such servers. Oracle is creating a SMP 96 socket server with Bixby. It will be brutal, with 96TB RAM running databases extremley quick from RAM.
EDIT: If someone knows of a Linux SMP server with 16 sockets, or even 32 sockets, I invite them to post links here. I have never seen such a large server. Sure there are Linux servers with 100s of cpus, but they are clusters, see the links above.
Last edited: