New IBM POWER8 CPU

Discussion in 'All non-AMD/Intel CPUs' started by Red Falcon, Sep 9, 2013.

Thread Status:
Not open for further replies.
  1. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    Linux scales quite bad in some circumstances, actually. There are (at least) two different types of scaling:

    -Horizontal scaling. Scale out. This is a cluster and you scale by simply adding a new node to a fast switch. Linux scales excellent on clusters, that is, HPC clusters. For isntance, Google's server park is such a cluster with 900.000 servers. There is no single fat server with 900.000 cpus, all large servers with 100s of cpus are clusters. These clusters run HPC workloads, that is "embarrasingly parallel workloads", such as number crunching scientific computations CFD or solving PDEs. Another example would be to compress a movie to MPEG - each thread can compress a few pixels. If you have 1000 of cpus, each cpu can compress 10 pixels each. These problems are easy to scale, the more cpus you add the faster it gets. And this domain is very Linux strength is, just look at Google or any other big company, they typically have large clusters of Linux pcs. For instance World of warcraft runs a separate world on each individual server. WoW is easy to parallelize. It would be much much harder to create one single huge world, instead of many separate worlds - either you need a single huge server (see below) or a very tight cluster because the latency would be so bad between the nodes. Most probably this would not work, and this is why WoW has many separate worlds, and not a single huge world. These HPC servers are relatively cheap, it is just a bunch of PCs on a fast switch. You take standard components. Beowulf Linux cluster is an example of a cheap Linux server (which is a cluster). You can even do it yourself, just get a few PCs and a fast swithc and install Beowulf on it, and voila you have built yourself a huge Linux server (which is actually a cluster). Sure, Beowulf can only do one thing: HPC number crunching easy parallel problems. But it is a huge Linux server. Or, as I prefer to say: cluster. And a 32 socket Beowulf cluster costs as much as 32 pcs and a switch, which is very cheap.

    -Vertical scaling. Scale up. This is a single fat server running problems that are not easy to parallelize. For instance, if a chef makes a sauce, he can not do it until the steak is done, he must wait. You can not add more chefs and complete the steak faster. These servers are not HPC, instead they are called SMP servers. Such a SMP server is typically a large single server in a cabinet weighing 1000 kg or so, and they typically have 16 or 32 cpus. Some SMP servers even have 64 cpus - and they are all Unix or Mainframes. Mainframes are SMP servers, the largest IBM Mainframe today has 24 cpus. These SMP servers costs many millions as they are really hard to manufacture, you can not just add a new PC. For instance, the largest IBM P595 Unix used for the old TPC-C record, had 32 POWER6 cpus and it costs $35 million, list price.

    You can not just take a few PCs and add them, that does not make them a SMP server. No, it must be built from ground up, and they are all very very expensive. Typical contemporary SMP server examples are IBM AIX P795 (32 sockets), Oracle Solaris M6 (32 sockets), Fujitsu Solaris M10-4S (64 sockets), HP HP-UX Integrity/Superdome (64 sockets), IBM z/OS Mainframe (24 sockets). To build a huge SMP 32 cpu server is not easy to do, and no Linux vendor has ever done this. Sometimes they have tried to compile Linux onto a Unix server, with horrible results. HP tried with their "Big Tux" (google it) and ran Linux ontop Superdome server, with ~40% cpu utilization, this means every other cpu idled under full load. This is quite bad.

    To scale well on a SMP server is hard to do, and Linux can not do that. Why? Because there does not exist any huge SMP Linux servers. So how can Linux developers optimize the Linux kernel to 32 sockets? They can not.

    These SMP servers are typically used in Enterprise companies, large investment banks, etc. An typical SMP workload would be to run a huge database on many cpus - this is what Enterprise does all the time. HPC workloads are typically number crunching scientific computing. There are distributed databases that run on clusters but it is not the same thing as a monolithic database running on a single SMP server.

    Main problem is that latency in a cluster can be bad, so when you program for a cluster you need to design your software for cluster, allocate memory to close nodes, etc. Typically you use MPI to pass information between nodes. You can not copy binaries running on a small standard server, to a large cluster - you need to reprogram the binary and use MPI etc.

    On a SMP server, the latency is always low, so you dont need to care when you write your software. You can copy binaries from a small server to a large SMP server without any problems - the SMP server act as a small server, but just a bit more powerful.

    Regarding Linux, there are no big SMP servers and have never been for sale. Sure, you can buy a shit load expensive IBM P795 Unix server and compile Linux for it - but it is still a Unix server. There are no pure Linux servers. This is the reason Linux does not make it into the large Enterprise server sector - which is the most profitable sector, where everyone wants to be. If Linux scales well on large SMP servers, then Unix and Mainframes will die out. Until someone sell a cheap Linux SMP server, for only a few million USD, Linux will never venture into high end Enterprise sector.

    PS. There are no true SMP servers on the market today, I suspect. A true SMP server has the same latency between all cpus. No matter which cpu you acess, it goes equally fast.

    Today all huge SMP servers are a bit of an NUMA server (which is a cluster):
    http://en.wikipedia.org/wiki/Non-uniform_memory_access#NUMA_vs._cluster_computing
    but they are so well designed so latency is very low, no matter which cpu you acess. This means you dont need to use MPI, just program as normal. But on a HPC cluster, the latency to far away cpus can be extremely bad so you need to handle extremely bad latency, which might grind the system to a halt - which never occurs on a SMP server.

    So a cluster might connect each node via a very simple way, maybe a single line via switch, and if you need to access some node, you go through many nodes - or something like that. Quite bad latency in worst case.

    In contrast, a SMP server connects every cpu to each other. You dont need to go through another cpu, so it will be fast. In reality there will be some hops, but a well desinged SMP server minizmises the hops needed to reach another cpu. Look at the very bottom on this new SMP SPARC M5 server, and see how each cpu is connected each other in a very intricate way:
    http://www.theregister.co.uk/2013/08/28/oracle_sparc_m6_bixby_interconnect/
    There are only 2-3 hops at most to reach any other cpu, so latency is very low. Almost as a true SMP server.


    Regarding the huge ScaleMP Linux server with 1000s of cores and gobs of terabytes of RAM, yes it is a cluster that is tricked into believing it is a single huge fat SMP server running single image Linux kernel. It can not run SMP workloads, only run the easier HPC number crunching:
    http://www.theregister.co.uk/2011/09/20/scalemp_supports_amd_opterons/

    And regarding the SGI Altix and UV1000 Linux server with 1000s of cores and gobs of RAM, it is also a HPC number crunching server - it is not used for SMP workload, because it does not scale well enough to handle such difficult workloads. SGI says explicilty that their Linux servers are for HPC only, and not for SMP.
    http://www.realworldtech.com/sgi-interview/6/

    Ergo, there are no big Linux servers doing SMP workloads on the market, and has never been. All large Linux servers, are actually clusters running HPC number crunching workloads. If these Linux servers were SMP servers, they would immediately venture into the high profit very expensive SMP market - but they can not. They stay in the low profit small server, with max 8 sockets (just a normal standard IBM or HP x86 server).

    So, now I have answered your question whether Linux scales well or not. Linux scales excellent on clusters. But scales very bad on a single fat huge server - because they does not exist, the kernel developers can not optimize Linux for such huge servers.

    OTOH, Unix servers had 32 sockets for decades and scales well on such servers. Oracle is creating a SMP 96 socket server with Bixby. It will be brutal, with 96TB RAM running databases extremley quick from RAM.



    EDIT: If someone knows of a Linux SMP server with 16 sockets, or even 32 sockets, I invite them to post links here. I have never seen such a large server. Sure there are Linux servers with 100s of cpus, but they are clusters, see the links above.
     
    Last edited: Nov 19, 2013
  2. Activate: AMD

    Activate: AMD [H]ard|Gawd

    Messages:
    1,984
    Joined:
    Nov 6, 2004
    ^^ an interesting read, but you wouldn't happen to go by the handle kebabbert elsewhere would you?
     
  3. Red Falcon

    Red Falcon [H]ardForum Junkie

    Messages:
    9,760
    Joined:
    May 7, 2007
    brutalizer knows his stuff!

    In fact, I just built a small 4-node cluster out of Raspberry Pi systems for the Hadoop File System (HDFS).
    So that would be horizontal scaling then, interesting.

    Vertical scaling would be more CPUs within one system, essentially mainframes, nice.
    Thank you brutalizer for sharing all of this information, you've made my night! :D
     
    Last edited: Nov 19, 2013
  4. niomosy

    niomosy Limp Gawd

    Messages:
    247
    Joined:
    Nov 21, 2005
    I think the largest server I'm aware of for Linux are 8 sockets. Sun currently has the Sun X2-8, HP has the DL980, and Unisys has the ES7000.

    The cost on these types of servers ends up high, hence the dwindling interest by corporations to buy them. Even we've moved from buying p595 servers to smaller P7s like 750s and 740s as we just didn't have any heavy workloads that needed a high enough CPU count to spend the money on larger servers. We usually just splice up workloads into smaller ones and carve out more small hosts rather than fewer large hosts.
     
  5. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    Yes, these are all 8-socket x86 servers using the Intel Xeon E7 cpu, probably. I actually know of a 16 cpu Linux server, it was launced just recently, april 2013:
    http://www.bulldirect.eu/2013/04/23...modernization-of-critical-it-infrastructures/

    So, it seems the first 16 socket Linux SMP server was launced this year. But it is not that efficient. It connects groups of four cpus, to each other. Each cpu is not connected to each other, instead there are four groups of cpus connected to each other. This introduces high latency. So, they have benchmarked an easy to parallelizable HPC benchmark SPECint, and got a score of 4,110, which is quite bad for 16 sockets:
    http://www.v3.co.uk/v3-uk/news/2207260/bull-claims-x86-enterprise-server-performance-crown

    If we look at 8-socket POWER7 server we have 2,770 SPECint. And if we look at 8-socket SPARC T5 servers, which is the worlds fastest cpu, it is almost as fast as this 16-socket Linux server, achieving 3,750 SPECint.
    https://blogs.oracle.com/BestPerf/entry/20130326_sparc_t5_speccpu2006_rate

    So, no, the Linux server is not that efficient. It stands to see SMP benchmarks and see how well the first 16-socket Linux SMP server fares. I suspect SMP performance will be much worse than easy HPC benchmarks, because 8-socket Linux scales quite bad on SMP workloads, with low cpu utilization. Just look at how groups of cpus are connected to each other, this is a less optimal design. Compare this design with the Oracle Solaris M6 server with 32 sockets, see the topology map in my previous post.

    Here is an interesting read from the Linux camp on Linux "superior" scalability. Do you spot the error he is making when he bashes the ZFS main architect Jeff Bonwick?
    http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/04/10


    Yes, these huge SMP servers are very expensive, and if you dont need the power, it is better to buy cheaper smaller 4- or 8-socket servers. Today, a standard commodity 8-socket x86 servers gives decent performance and you get quite far with one x86 server. You seldom need something more powerful than 8-sockets today, 32 or 64 socket servers are not often needed.
     
    Last edited: Nov 22, 2013
  6. Red Falcon

    Red Falcon [H]ardForum Junkie

    Messages:
    9,760
    Joined:
    May 7, 2007
    brutalizer, you should have the title "[H] Cluster Bomb" or "No RISC, no reward". :cool:
    This thread just gets better and better!
     
  7. AndyE

    AndyE Limp Gawd

    Messages:
    277
    Joined:
    May 13, 2013
    @brutalizer,
    you asked me to join this thread. Happy to do so.

    Given the extensive posts in the past 3 pages, I will not try to address all aspects in one post. Let's have a debate in pieces.

    #1: What is the performance objective to consider?
    Total system performance or CPU subsystem performance? These are different beasts.
    My initial reading would rather suggest CPU subsystem. Which is interesting in itself, but doesn't reflect total system performance or value for money considerations.

    #2: What is the relevant workload?
    The main reason we still have multiple CPU architectures and system architectures is based on the different need and requirements of workloads. One CPU architecture optimized for one workload might excel there but completely drowns with other types of apps. How to measure versatility of an architecture?

    #3: The relevancy of benchmarks:
    Beware of benchmarks unrelated to your workload. They are not only irrelevant, they can provide a strong suggestion to worng system choices and configuration, if not considered appropriately. Especially be beware if one benchmark result is completely out of bounds. Need much deeper analysis before being accepted to be relevant (your Siebel example)

    #4: What consitutes a balanced system?
    You mentioned quite often archetypes of system design choices (SMP, NUMA, cache sizes, memory bandwidth): BTW, didn't see much about latency optimiziation :) (Old saying: You can buy bandwidth, but have to design for latency). I'll go deeper into this in other posts

    #5: Scale-up vs. scale-out
    You rightfully pointed out some of the differences. There is much deeper fundamental shift going on

    #6: Explicit parallelism
    Entering developer territory here. For the last 30 to 40 years, devlopers got mostly a free ride on performance improvement. Most of the speed-up had been done by a very specialized group of engineers in CPU architecture - optimizing frequency (CPU, memory), hiding latency (cache hierarchy, branch prediction, speculative execution) and ILP (Instruction level parallelism, the ability to launch dynamically or statically multiple instructions in the same CPU cycle). These days are over. We (the CPU designers, hit the energy wall)
    New approaches to speed-up require explicit parallelism (at the application SW layer) to leverage increased capabilities in new sytems, be it NUMA, Multicore, SMT, etc ...
    So the load to develop increased speed is now not only in the hands of a few thousand high profile HW engineers, but in the hand of hundredthousands of application software developers, probably millions.

    rgds,
    Andy
     
  8. niomosy

    niomosy Limp Gawd

    Messages:
    247
    Joined:
    Nov 21, 2005
    The sad part about increased speed being partly in the hands of software engineers is the sheer number of developers content on writing poor code. These were the developers content to fix their performance problems with more hardware.

    In at least some cases, this still works so there's little need for them to address the problem via software. Honestly, I think they're afraid to touch the code at that level a lot of times.
     
  9. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    How do you support this assertion? Do you have statistics ("numbers")?
    You're overlooking the OS and tools developers.
     
    Last edited: Nov 27, 2013
  10. AndyE

    AndyE Limp Gawd

    Messages:
    277
    Joined:
    May 13, 2013
    I was using my words carefully.

    You are right that OS and dev/tool developers play an important role to facilitate this new world. At the scale of parallelism discussed in this thread, the work this group of professionals are doing is imho a prerequisite for the apps developers to create faster and more reliable high performing applications - not the solution in itself.

    There are edge cases where a tool framework or OS can magically leverage the existing hardware to the maximum extent, but in general, there is still a long way to go to decompose the original problem domain fully automatically and at maximum efficiency on the availabe hardware architecture and resources.

    And don't get me started on the famous "speed-up claims". ;)

    Currently, most large supercomputer apps are written for portability (to preserve the huge investment and to maintain flexbility to be able to leverage the latest and greatest new systems). The next step towards a simpler life for apps developers is to address the issues raised in performance portability( Example: Paper on performance portability in OpenCL). and this is "only" addressing the problem in one environment, less so in multiple.

    regards,
    Andy
     
  11. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    For non-trivial tasks, nothing (and nobody) can reach the maximum performance. "Optimal" is really a theoretical limit that's unattainable. Improving parallelism can be automated with tools, and that's helpful. It's also helpful to have newer tools that make declarative parallelism easier to implement, test, and debug. I figure it's that work you're calling prerequisite and I agree; but we can never expect an optimal and general-purpose solution from tools or systems. Only enabling features.
     
  12. niomosy

    niomosy Limp Gawd

    Messages:
    247
    Joined:
    Nov 21, 2005
    I'm not referring to the people writing the operating systems or shrink-wrapped software but the developers writing apps in languages like Java and writing poor SQL code that destroys a database server because they don't know how to write a proper SQL query to get only the results that they want. I've listened to DBAs on the phone with developers telling them they're going to need to change their queries as they kill the query that's bogging the system.

    I see it regularly with many Java developers and application development managers; solve the performance problems by throwing more software at the situation rather than looking at the code that was written. In many cases, the people that originally developed that code are long gone and no one wants to do more than patch things here and there or add new code in. Re-writes end up continually pushed out because the existing code works and they've got new code on some other piece to crank out in a hurry. And if it's not developers writing poor code, it's managers letting poor code stay in place. In the end, throwing hardware at the solution ends up easier for many companies. The code may well be a Pandora's box whereas hardware is a known up-front cost.
     
  13. AndyE

    AndyE Limp Gawd

    Messages:
    277
    Joined:
    May 13, 2013
    There is some truth in this statement, but it is a broader (and more rational) issue.

    Most programmers these days started to be trained to learn sequential programming. With all the bells and whistles, starting from "do" and "while" loops, throwing exceptions, etc..

    These developers built up a tremendous inertia (aka experience) for this sequential execution thinking. Supported by the great progress in the tool and OS areas, where many of the invisible parallelizations happened under the hood. To name a few: Instruction reordering, Branch prediction, overlap of memory access and computation, multiple compute pipes (i.e. MMX), a whole bunch of compiler optimizations without impacting the accuracy of the original application as written by the apps developer.

    The transition to constant parallel excellence is a hard one. First, not all problems are easy to parallelize (How would you use 100 cores efficiently with vi or Tex?) Second, the key reason for the usage of parallel approaches is simple - increased speed.


    To get back closer to the original topic:
    SMP machines are great at hard to decompose applications, or applications which have frequent interactions in the components, which are latency dependent. SMP machines reach fast their limits (well below maximum performance claims), when the computational density is low and lots of data need to be transferred from main memory to be processed.


    A hypothetical example:
    A CPU socket has 16 cores
    Each core has the ability to process 4 threads
    Each thread can do 2 multiplications per cylce
    The cylce is 5 GHz

    The memory interface is 8x 64 bit channels, 2GHz / channel

    The theoretical compute performance is:
    16 x 4 x 2 x 5 = 640 GFlop/s per socket

    The maximum memory bandwidth is:
    8x8x2 = 128 GB/s (usually 80% is the real max perf)


    Take a simple task:
    Create 3x arrays in main memory with 128 GB each (we use in this example 8 Byte DP elements)
    Run the indices from 0 to 16 billion
    Add the respective elements from array 0 and 1 and store the result in array 2

    For one multiply you need 24 Bytes of data transfer. With the currently unusual high memory transfer capacity of our hypothetical processor (128 GB/s), we are able to load the cores of this socket to 128/24 = 5,333 GFlop/s
    5,3 GFlop out of a maximum of 640 GFlop/s max, less than 1% utilization.


    The limitation in memory bandwidth is one the key reasons, why many HPC machines are used without SMT (or Intel's term HT). If the CPU has SMT capability, it is turned off for maximum performance. The bottleneck is memory, not CPU. Unless you start to code differently. Like in the highly optimized Linpack benchmark, the simple matrix algorithm is highly modified (with blocking, cache-line optimization, etc ...)

    Another example: Mergesort
    Neither the CPU nor the memory interface will most likely be maxxed out. The usual suspect limiting the performance is the TLB (Translation Lookaside Buffer) in the virtual memory subsystem of the CPU. Necessary to efficiently translate virtual addresses (as seen by the sort program) into physical addresses the memory subsystems needs to work properly. If the requested virtual address is in the TLB, the physical address can be issued almost immediately. If not, the TLB entry has to be purged, new translation entries need to be brought in from memory, etc, etc.. Potentially costing hundreds of CPU cycles - per virtual address miss. If this is your defining application, what is the TLB architecture for your targeted system?

    The good thing in commercial workloads is the typical high locality of memory references (helping the caches to be effective) and the irregularity of loads on single CPU components vs. HPC workloads (see memory example above).

    Maximal throughput, not minimal latency is the major goal in commercial servers (vs. minimized latency in interactive workloads like in desktops and workstations). With many outstanding parallel memory references, SMT seems to be one of the architectural choices to hide the memory access latency.

    Cray Computer's MTA-2 (originally developed by Tera computer which acquired Cray later on) is a great example for an extreme example. This machine has up to 8192 cores, each core can do up to 128 threads. In total 1 mio threads per machine. With 128 threads per core, there is no need for any cache hierarchy between the CPU and the memory subsystem. Caches are in system to optimize frequently used memory accesses, but they are a scalability nightmare for large systems (mostly driven by cache directory syncronization overhead). get rid of the need for caches and you get a very scalable system architecture for problems with huge memory requirements. Does the app has high irregularity in its memory access pattern? No problem, there are no caches, so there no cache misses. A very powerful architectural approach for certain classes of problems like social graph analysis with massive datasets. Optimized for throughput, not latency.

    Looking at the GHz of a CPU is an interesting exercise.
    If a CPU can do one Instruction per Cycle at 1 GHz, it is as fast as another CPU which takes 5 cycles/instruction at 5 GHz. So, the GHz alone isn't a very useful metric (but marketing departments love it). For a better assessment, use Cycles/Instructions and the frequency as a combination.

    more to come,
    Andy
     
    Last edited: Nov 27, 2013
  14. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    I guess I wasn't directing that question at you -- sorry, I thoguht it was apparent from the quotation blocks. What I was asking you was how you substantiate your claim. From your more recent post, it sounds like it's a matter of personal experienced extrapolated to the rest of the industry, and you have no "sheer numbers". Is that correct? It would be very interesting if you did, as measuring and quantifying such things is very challenging.
     
  15. banedox

    banedox [H]Lite

    Messages:
    110
    Joined:
    May 11, 2011
    so what are these used for?
     
  16. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,108
    Joined:
    Jun 22, 2004
  17. The_Moves

    The_Moves Limp Gawd

    Messages:
    183
    Joined:
    Oct 8, 2008
    And databases, and web servers, and hosting JVMs, and running warehouses, and supporting Banks
     
  18. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    No, the IBM Mainframes use a very slow cpu, it is really really ineffecient. The latest IBM Maifnrame cpu runs at 5.26GHz and has something around ~250MB cpu cache, and still it is much much slower than a high end x86 cpu. I dont know what IBM has done, but they failed miserably with their transistor budget. The largest IBM Mainframe with 24 cpus, costing very much, is beaten by 8-12 Intel Xeons.

    These IBM POWER8 cpus are used in Unix servers, and the Unix servers are vastly faster than Mainframe cpus. For instance, the old IBM 32-socket P595 with POWER6 cpus, costed $35 million list price. These Mainframes are much more expensive, in fact, Mainframes make up a big part of IBMs revenues. Very few are sold each year, and still IBM earns shit load of money on these few. You can imagine the price. This is the reason IBM sued open source Mainframe emulator "TurboHercules", allowing you to emulate an IBM Mainframe on your laptop. IBM is deadly afraid of anything that threatens their big cash cow, and will go to great lengths to stop all attempts. Monopoly abuse is the words.
     
  19. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,108
    Joined:
    Jun 22, 2004
  20. jimmyb

    jimmyb 2[H]4U

    Messages:
    3,172
    Joined:
    May 24, 2006
    250MB cache will consume an enormous number of transistors.
     
  21. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    Six transistors per cell for static RAM. But I don't think the information presented by brutalizer is accurate.
     
  22. jimmyb

    jimmyb 2[H]4U

    Messages:
    3,172
    Joined:
    May 24, 2006
    It's possible that they're using embedded dram, which can be as low as 1 transistor + a cap - but it still seems absurdly large.
     
  23. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    DRAM is possible, I suppose, but is generally regarded as too slow for cache. Thing is, I can't find any POWER8 or "IBM Mainframe cpus" (like the Z196, for instance ...) that have anything like "~250MB cache". I figure such a cache would have pretty high latency just because of its size and the cost of the selection and coherency implementations.
     
  24. FLECOM

    FLECOM Modder(ator) & [H]ardest Folder Evar Staff Member

    Messages:
    15,623
    Joined:
    Jun 27, 2001
    I'd be willing to risk saying that the number of websites sitting on $35 million dollar servers is tiny

    I have never run across one any of these machines, everyone I have run across scales out... cheaper and gets it done... imagine planning redundancy with $35 million dollar boxes? hard to sell that to a client when you can fill an entire data center cage with pretty decent hardware for less than that depending on application I guess...

    these things have their place in the world... but it's in los alamos and such places
     
  25. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,108
    Joined:
    Jun 22, 2004
    This is about vertical scaling. No need for classic redundancy when the whole thing is built around redundancy.
     
  26. FLECOM

    FLECOM Modder(ator) & [H]ardest Folder Evar Staff Member

    Messages:
    15,623
    Joined:
    Jun 27, 2001
    and if the building becomes uninhabitable/unpowered/disconnected from networks because of a hurricane?

    ask everyone that went through Sandy about that
     
  27. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    Why ask someone who built a data center on low land about disaster planning?
     
  28. FLECOM

    FLECOM Modder(ator) & [H]ardest Folder Evar Staff Member

    Messages:
    15,623
    Joined:
    Jun 27, 2001
    I'm in Miami, we don't get disasters here! ;)

    [​IMG]
     
  29. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    I dont have the link I read this thing about Mainframe cpu huge cache size (L1+L2+L3+L4 caches), but I made the calculation earlier and it was something like ~200MB cache or so. An quick google showed these links:
    http://hardware.slashdot.org/comments.pl?sid=3078075&cid=41149487
    Somewhere around 200MB cpu cache, another person confirms.

    At the bottom they talk about caches.
    http://www.theregister.co.uk/2012/09/05/ibm_z12_mainframe_engine/

    If you emulate an IBM Mainframe on an old 8-socket x86 server, each cpu having 8-cores, you will get the equivalent of Mainframe 3.200 MIPS. Emulation is 5-10x slower than running native code. If you could run the Mainframe code natively on the x86 server, it would not be 3.200 MIPS, it would be 16.000-32.000 MIPS. The largest IBM Mainframe today, gives 60-70.000 MIPS or so. This number of 3.200 MIPS is from an Mainframe expert who wrote an Mainframe emulator on x86.
    http://en.wikipedia.org/wiki/TurboHercules#Performance
    Todays x86 servers are much faster and would give much higher emulated MIPS. So you would need two or three 8-socket x86 servers to match the largest IBM Mainframe with 24 sockets, cpu wise.

    Here is another guy who ported Linux to Mainframes and could compare same software on x86 and on Mainframes. He says that 1MIPS equals 4MHz on x86. So, the largest IBM Mainframe with 75.000 MIPS would equal 300GHz on x86. But a 8-core Xeon running at 2.5GHz gives 20 GHz. So again, you would need two 8-socket x86 servers to match the largest IBM Mainframe cpu wise.
    http://www.mail-archive.com/linux-390@vm.marist.edu/msg18587.html

    Here is an consulting firm that compared a Xeon single core 900MHz to a Mainframe z9 cpu. "we found that each [z9] mainframe CPU performed 14 percent less work than one [single core] 900 MHz Intel Xeon processor running Windows Server 2003." The z10 is 50% faster than z9, and the z196 is 50% faster than z10, and the zEC12 cpu is 25% faster than the Z10 cpu, which means a z12 is 1.5 x 1.5 x 1.25 = 2.8 times faster than a z9. This means a z12 corresponds to 2.8 x 900MHz = 2.5 GHz Intel Xeon. So, again you need only a few x86 cpus to match the largest IBM Mainframe.
    http://www.microsoft.com/en-us/news/features/2003/sep03/09-15linuxstudies.aspx

    Ergo, as I have proven, the IBM Mainframes have very inefficient cpus. They have preserved backwards compatibility to the 1970s or so. Which means they are very bloated with legacy instructions eating up the transistor budget. Start clean slate, and they would be wicked fast. But as now, no way. Today there exists 12-core x86 cpus which are probably 50% faster than 8-core cpus. So these numbers are a bit inflated.

    Sure, but there are some workloads that can not be run on a cluster, you need a huge single server for that.
     
  30. gigatexal

    gigatexal [H]ardness Supreme

    Messages:
    7,108
    Joined:
    Jun 22, 2004
  31. FLECOM

    FLECOM Modder(ator) & [H]ardest Folder Evar Staff Member

    Messages:
    15,623
    Joined:
    Jun 27, 2001
    like? (genuinely curious)
     
  32. jimmyb

    jimmyb 2[H]4U

    Messages:
    3,172
    Joined:
    May 24, 2006
    I seriously doubt supporting legacy instructions (and more generally backwards compatibility) is eating up significant transistor/area. Like most modern processors it is almost certainly internally RISC with instruction to microcode decoding.

    FWIW modern x86 still supports all sorts of esoteric instructions which date as far back as the 70s.
     
  33. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    Read the part about SMP workloads here, and read the SGI interview at the bottom for more information on workloads suitable on huge servers, instead of clusters. You can not run SMP workloads, on a HPC cluster.
    http://hardforum.com/showpost.php?p=1040393845&postcount=41

    Proven and proven, maybe a better phrasing would be "plausible". There is a reason IBM never releases benchmarks of Mainframes vs x86. No matter how much you look, you will never find any such benchmarks - why? IBM is well known for bragging a lot when they have good tech, like the POWER7 cpu, IBM had P7 vs x86 vs SPARC benchmarks all over the internet. But there are no Mainframe benchmarks. The reason is because they are very slow in comparison to a x86. Until IBM releases benchmarks themselves showing how slow the Mainframe cpus are, it will never be proven with hard numbers. But IBM will never ever reveal the slowness of Mainframes, those are one of IBMs largest cash cows. Something like 100 Mainframes sold each year, generates a huge portion of IBM's multi billion revenue. Compare that to the millions of x86 and Unix servers sold each year, generating much less revenue proportionally. IBM shuts down every effort to question Mainframes.

    And yes, the x86 instruction set is a catastrophy. Even Intel thinks so, too. Otherwise Intel would not see the need to clean up the slate and begin a new with Itanium instruction set. But, x86 has so much research poured in it, so it performs quite well even though it is a catastrophy. But if Intel started a new, the new instruction set would be very efficient and lean, much better than old x86 instruction set. Mainframes are even older, and does not have half the research resources that Intel and AMD pours into x86.

    The x86 architecture has over 1000 instructions today. That is just hilarious. On x86 you need more transistors to decode it's instructions than a whole UltraSPARC CPU needs. That is why one RISC with 10 million transistors is faster than a CISC with >50 million transistors (10 million transistors are needed just to figure out which x86 instruction you just read and where the next instructions starts, 20 million transistors are never used but needed for backwards compatibility. 20 million are used for cache, etc). To create a x86 you need to support over 1000 instructions today. Bloat is a BAD thing. Mainframes are much worse on this.

    http://www.anandtech.com/show/3593
    ..."The total number of x86 instructions is well above one thousand" (!!)
    "CPU dispatching ... makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs."
    "the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are"
    The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution...."
     
  34. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    These links both point out that the L4 cache is not on-chip. The forum posts (which don't "prove" anything) say that the cache is "per server" -- but it's not clear to which chips they're referring. The Resgister site says that the z12 parts have L4 cache which is off-chip. It explains the available memory at each cache layer. L1 and L2 are available per-core; L3 cache is shared and is 48 MB of DRAM memory.

    The zEC12 has 6 cores, so that gives 160k * 6 = 960 kB of L1 mixed instruction/data cache total for the chip, plus 12 MB of L2 mixed instruction/data cache for the chip. 13 MB cache, rounding up, plus the 48 megs of L3 cache gives us a total of 61 MB of cache on-chip. Maybe there's some way you can prove that this adds up to "around 200 MB cpu cache", but I don't see it.
     
  35. jimmyb

    jimmyb 2[H]4U

    Messages:
    3,172
    Joined:
    May 24, 2006
    I'm not sure if the numbers you're quoting are accurate or not, but 30 million transistors for backwards compatibility is negligible on modern high performance microprocessors.

    This is a die shot of a modern i7; how much of it do you think is used for decode? How much is the decode logic affecting IPC and clock frequency (if at all!)?
    http://img.hexus.net/v2/cpu/intel/Haswell/4770K/HVK/haswell-02b.jpg

    OK. I guess this number is meant too seem very large.

    This is a statement about difficulties of software development for proprietary ISA extensions, not about the impact of legacy support to the size and performance of microprocessor.

    This is purely speculation. Only those involved in the actual physical implementation could comment on whether it is a bottleneck.

    Not really true with modern microarchitectures. Complex (and simple) instructions are decoded into microprograms which run on standard execution resources.


    Note, I'm not trying to say that backward compatibility for old, complex ISAs is a good thing, or that x86 is well thought out - just that for big modern processors it isn't consuming a lot of area or significantly affecting performance.
     
  36. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    I'm not sure I understand your use of "hilarious" here. I don't think the x86 has more than 1000 instructions unless you're counting individual opcodes. I'd think of "MOV" as one instruction, for instance. With that definition, you've less than 1000 instructions. If you think of the different sources and targets, then the combinations start taking over. "MOV AX,BX" is different than "MOV BX,AX"; and different still than "MOV EAX,EBX" or "MOV AL, BL".

    Since most variants with the more explosive counting approach are just changes to comparisons or targets, bits in the instruction's actual opcode end up driving a demultiplexer to select a particular register. Different addressing and indexing modes similarly drive a demultiplexer to pick up an effective address. I think your estimates for transistor count usage in the decoder for various features are pretty far off.
     
  37. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    If you count the L1+L2+L3+L4 cache for a cpu, you will find that the total cpu cache for a zEC12 is huge. The L4 is off-chip, true. And L4 is huge, like... 384MB or so. So add up all total L1+L2+L3+L4 cache for all cpus, and divide with the number of cpus, and you will end up somewhere round ~200MB cache for each cpu.

    But does the exact cpu cache size matter? My point is that even though the zEC12 has a huge cpu cache (much bigger than normal cpus, for isntance Xeon might have 16MB cpu cache) - the IBM Mainframe cpus are much slower than Xeons. Are you trying to evading this slowness problem, by focusing on the exact number of cpu cache? Is it relevant if the zEC12 has 200MB cache or if it has 216MB cache? It still runs at 5.26GHz and still is much slower than a Xeon. What do you want to discuss? The exact cache size, or why the Mainframe cpus are so slow? What is the interesting thing here?



    It is true that 30 million is negligble. But, that adds up to bloat. As a person from the US Senate said "you take one billion here, and one billion there, and soon you have found some serious money". 30 million transistors here and there - it does sound plausible such a mindset makes bloat, yes? "A common PC has 4GB RAM, so it not that important to optimize the code" - and suddenly we have 1-2GB RAM OSes. A standard Linux distro is quite bloated, much more than WindowsXP - which was the epitome of bloatness.

    What do you think? 1000 instructions? Have you read the Anandtech article?

    This is not speculation. If had read more on this, you would have seen that it is a big problem to find out where an instruction starts and another ends. One of the points of RISC cpus was to avoid this, and by keeping it simpler, making them run much faster.

    So are you saying that supporting 1000, or even 10,000 instructions is not something to avoid? It is not a problem? Lean and mean, is not important? If you develop software, it is well known it is better to have as few LoC as possible. Bloat is a bad thing.

    There are lot of experts saying the opposite. They say that bloat is a bad thing, to have to support instructions from intel 4004 (?) from the 1970s is bad. Bloat just adds to complexity, and complexity just eats up transistors, power, introduces bugs, etc. If you dont know about this, it makes it hard to talk about this further. Every developer says bloat is a bad thing, if you dont agree on this - well, what can I say?


    I dont know what the article author at Anandtech meant, I have not read the original paper. But I know that x86 is a mess, I dont have to read the paper to know that. When we developed emulators for different cpus at university, we studied x86 in depth, and yes, it was a mess. Even Intel thought so, and wanted to start anew with Itanium. Sure, x86 is fast, but if they could start anew, maybe they cut the die size in half and half the power usage - and still get the same performance, or even higher.
     
    Last edited: Dec 29, 2013
  38. mikeblas

    mikeblas [H]ard|DCer of the Month - May 2006

    Messages:
    12,782
    Joined:
    Jun 26, 2004
    Sounds like you've changed your mind as one of your original points was that IBM "failed miserably with their transistor budget". I'm doing nothing more than trying to ascertain your reasoning for that claim. Since you're admitting that you don't know exactly how big the caches are, I'm not sure if you understand the claim yourself.

    It's also unclear if you're counting sockets or cores, but it makes no sense multiple shared cache by the number of CPUs that share them, then divide the number back down to come up with cache per CPU.

    The 384MB external L4 cache is shared by all six sockets on the MCM. That is, in the zEC12, there are six hex-core processors -- 6 sockets each with 6 cores, so 36 cores per book. The book has 384MB external L4 cache, not each processor. Since the transistors in that cache aren't in the package, they don't count towards the transistor budget for the CPU.

    The L3 cache is in the processor package, but is only 48MB -- shared by all 6 cores within the package. If we do the math you're asking for, we don't end up anywhere near 200 MB for each CPU: 160 KB L1, plus 2048 KB L2, plus 8 MB L3 (48 MB shared shared among 6 cores), plus 10922 KB L4 (384 MB shared among 36 cores, outboard) gives 21,322 KB of cache per CPU.

    Are you doing some sort of different math? How do you reconcile these counts to support your transistor budget assertion? What was the transistor budget for the zEC12, and how do you believe IBM failed in utilizing that budget?

    What it sounds like you're saying is that you think the x86 instruction set is "a mess", but you can't specifically state why it is a mess, or what "a mess" really means. Can you provide any substance to your claims? You've not provided any relative numbers, either; measuring "instructions" the same way across all platforms, how many instructions do x86, Itanium, Power Architecture, and UltraSPARC have? How much latency do their corresponding decoding implementations contribute to their architectures? What transistor budgets to they spend on combating those latencies -- predictive branching, pipelining, and so on?

    You've asked "what's interesting". To me, what's interesting is not learning facts; it's learning why those facts are so.
     
  39. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    I am certainly doing some sort of different math, yes. For instance, 48MB of L3 cache shared by all 6 cores: you explained that the zEC12 cpu has 6 cores. So, I assume that one zEC12 cpu has 48MB of L3 cache. Not 8 MB according to your calculations.

    And, if one book has 384MB of L4 cache, and if a book has six cpus, this means each cpu has 384MB / 6cpus = 64MB of L4 cache. You calculate it as 384MB / 36 cores = 10.9MB cache.

    It seems that when you speak of one "cpu" you actually mean one "core". Well, I am talking about the zEC12 processor, one cpu - not one core. I am not talking about benchmarking one zEC12 core vs a Xeon core. I am talking about benchmarking a zEC12 processor with 6 cores, vs a Xeon cpu with 8 or 12 cores. What kind of terminology are you using? Mixing cores with cpus? It seems that you calculate that one zEC12 core has 21,322KB cache. And because one zEC12 cpu has six cores, it means one zEC12 cpu utilizes in total, 6 x 21,322KB = 127,8MB probably 128MB cpu cache.

    I admit it, this IS confusing. IBM is trying to hide and conceal facts so it will be difficult to make a direct comparison to commodity cpus, such as the x86 cpu. I have actually tried to find this out, and asked IBM persons, but havent got any good answer. They are all evading my questions. "Is it true that you can emulate an Mainframe on a laptop using TurboHercules?", etc. I had to find out this, the hard way, by trying to read between the lines from the IBM persons.

    But now that I have you online, someone who knows quite a lot about IBM Mainframes, maybe you can help us pinpoint the exact numbers, so I dont have to say "somewhere around ~200MB cpu cache" anymore. I would like to have an exact number to write down. It is not that satisfactory to talk about "~200MB cpu cache", I would like to quote a more specific number.

    So let us begin: how many books does the zEC12 have, and how many cpus on each books? And how much cpu cache does each level have? "The system can support 120 cores" - does this mean that it has 120/6 = 20 cpus? And another extra four, dedicated to zOS?

    So, you think that a cpu that runs at 5.26GHz and uses somewhere ~128-200 cpu cache - and still is much slower than a x86 cpu - is not a big failure? The predecessor, using four cores at 5.26GHz gobbed up 300Watt:
    http://www.tomshardware.com/news/ibm-mainframe-server-z-power,16716.html

    This zEC12 cpu, is actually clocked at 5.5GHz (not 5.26GHz as I thought), and has six cores, so I would not be surprised if it uses 300Watt too, at minimum. Or even more. And still zEC12 is much slower than a decent x86 cpu at 130Watt. Is this not a big failure? What has IBM done with all the transistors, wattage, and Hz? How can IBM fail so miserably? Imagine a Xeon clocked at 5.5GHz and allowed to use 300Watt! Intel would have to work hard to lower the performance down to IBM's level. Intel could introduce even more massive bloat into the x86 architecture that eats up even more transistors and wattage, so the bloat will be on par with Mainframes which is backwards compatible back to the 1960s.

    IBM proudly claims the predecessor z196 was the worlds fastest cpu:
    http://www-03.ibm.com/press/us/en/pressrelease/32414.wss
    The z196 is slower than the zEC12 cpu, which is far slower than a decent x86 cpu.

    IBM also claimed the predecessor z196 can replace 1.500 x86 servers. That is interesting. How can 20 slow Mainframe cpus, replace 1.500 faster cpus? I dug a bit, it turned out that IBM assumes all x86 servers to idle, and the Mainframe be fully loaded! Well, what would happen if a few of the x86 servers started to do some work? Then the Mainframe would choke. I could boot up three Mainframes on my laptop, and let them idle - would it be fair if I claimed that my laptop can replace three Mainframes? Would IBM consider this a lie?


    I am not going to lecture you on x86 architecture. If you dont understand that paper at anandtech, and if you dont understand why bloat is a bad thing - what can I say? Someone who dont develop code - it will take a long time to explain to them.


    And you think that it is more important to discuss details, than to discuss the big picture? Some people are detail oriented, and some are big picture oriented. If IBM's Mainframes have very slow cpus - you are more interested to discuss the exact figure and details, than debunking the IBM Mainframe myth? Something does not add up, in IBM's claim of worlds fastest cpus and the might of Mainframes. If you study the big picture you will see there is a discrepancy. If you study details, you will not see the flaws. It is only when you try to puzzle it together you will see the glaring holes.
     
  40. jimmyb

    jimmyb 2[H]4U

    Messages:
    3,172
    Joined:
    May 24, 2006
    I don't disagree that bloat adds up, but you were talking about legacy support specifically - not the accumulation of bloat. 30 million transistors in itself is not a lot in modern processes.

    I don't have a reaction to the number without information on the corresponding performance metrics. How is this affecting critical timing paths? Did it grow the logic such that an additional pipeline stage was necessary? In what other ways did it affect the architecture?

    Also, I tend to believe Mike in that the number is probably inflated with all the addressing mode and register operand/target combinations.

    As I mentioned earlier, modern microprocessors are RISC internally with instruction decode to micro-ops.

    Lean and mean is a catchphrase, not real performance metrics.

    Aside from the fact that we're not talking about software development, this isn't even true.


    Which experts? Which processor design experts are saying that supporting legacy instructions is having a significant negative impact on performance and area in modern high speed microprocessors?

    I spent many years doing integrated circuit design at an x86 producer (among other things), and none of the experts I worked with suggested this was a problem. In fact, it was specifically suggested that given the huge transistor budgets of modern processes, and the fact that processors are designed risc internally with a micro-op decoder, that supporting legacy instructions has negligible impact on the performance and area.
     
Thread Status:
Not open for further replies.