New IBM POWER8 CPU

mikeblas · Dec 29, 2013

brutalizer said:
I am certainly doing some sort of different math, yes. For instance, 48MB of L3 cache shared by all 6 cores: you explained that the zEC12 cpu has 6 cores. So, I assume that one zEC12 cpu has 48MB of L3 cache. Not 8 MB according to your calculations.

My calculations were normalized to be per-core. 48 MB of L3 cache across 6 cores means 8 MB per core. Of course, a single core might be using more, and reuse can happen per core.

brutalizer said:
And, if one book has 384MB of L4 cache, and if a book has six cpus, this means each cpu has 384MB / 6cpus = 64MB of L4 cache. You calculate it as 384MB / 36 cores = 10.9MB cache.

That's because there are six cores per socket, and 384 MB / 6 = 64, and 64 / 6 is 10.67 megabytes.

brutalizer said:
It seems that when you speak of one "cpu" you actually mean one "core". Well, I am talking about the zEC12 processor, one cpu

Sure. Either way, each CPU socket doesn't have 200-some megabytes of cache available to it.

brutalizer said:
- not one core. I am not talking about benchmarking one zEC12 core vs a Xeon core. I am talking about benchmarking a zEC12 processor with 6 cores, vs a Xeon cpu with 8 or 12 cores. What kind of terminology are you using? Mixing cores with cpus? It seems that you calculate that one zEC12 core has 21,322KB cache. And because one zEC12 cpu has six cores, it means one zEC12 cpu utilizes in total, 6 x 21,322KB = 127,8MB probably 128MB cpu cache.

Yep, that's closer to the truth. Thing is, only 10,400 KB is in the processor package, and it's about 1/20th of the 200-something MB number you were using when you made your claim about about IBMs engineers and their with budget management. Now that you've learned some of the facts, I wonder if you thinking of re-evaluating your position.

brutalizer said:
I admit it, this IS confusing. IBM is trying to hide and conceal facts so it will be difficult to make a direct comparison to commodity cpus, such as the x86 cpu.

This is a very strong assertion for which you offer no real evidence.

brutalizer said:
So let us begin: how many books does the zEC12 have, and how many cpus on each books? And how much cpu cache does each level have? "The system can support 120 cores" - does this mean that it has 120/6 = 20 cpus? And another extra four, dedicated to zOS?

There's six sockets per book, and the number of books in the system depend on how many were ordered or installed in the machine. The Redbook for the system describes its architecture; I'm surprised to think you haven't read it.

brutalizer said:
So, you think that a cpu that runs at 5.26GHz and uses somewhere ~128-200 cpu cache - and still is much slower than a x86 cpu - is not a big failure?

You're putting words into my mouth in support of your own position. I've made no claim about the success or failure of zEC12 machines. In fact, I'm asking you why you think they're such a failure to educate myself.

brutalizer said:
I am not going to lecture you on x86 architecture. If you dont understand that paper at anandtech, and if you dont understand why bloat is a bad thing - what can I say? Someone who dont develop code - it will take a long time to explain to them.

I've written plenty of code -- certainly enough to know that someone who's trying to minimize lines of code is working to minimize the wrong thing. (And we're talking about hardware anyway, not software.) You'll want to familiarize yourself with my resume before you make further foolish presumptions.

I'm not trying to understand the paper at AnandTech -- there's no paper at AnandTech, just an article reporting that somoene made a blog post. I'm trying to understand the claims you've made in your posts. I'm disappointed that your responses to my questions don't bring any clarity.

brutalizer said:
And you think that it is more important to discuss details, than to discuss the big picture? Some people are detail oriented, and some are big picture oriented. If IBM's Mainframes have very slow cpus - you are more interested to discuss the exact figure and details, than debunking the IBM Mainframe myth? Something does not add up, in IBM's claim of worlds fastest cpus and the might of Mainframes. If you study the big picture you will see there is a discrepancy. If you study details, you will not see the flaws. It is only when you try to puzzle it together you will see the glaring holes.

What is "the IBM mainframe myth"? I guess I'm not asking about that because it hasn't come up before this point.

niomosy · Dec 30, 2013

brutalizer said:
No, the IBM Mainframes use a very slow cpu, it is really really ineffecient. The latest IBM Maifnrame cpu runs at 5.26GHz and has something around ~250MB cpu cache, and still it is much much slower than a high end x86 cpu. I dont know what IBM has done, but they failed miserably with their transistor budget. The largest IBM Mainframe with 24 cpus, costing very much, is beaten by 8-12 Intel Xeons.

A lot of your mainframe software cost is going to be in software licensing given the way IBM and most every mainframe software vendor licenses their mainframe software. You're paying per MIPS for a lot of this software so there's a huge desire to not get into a higher MIPS quantity than you really need for business needs.

As for the mainframe itself, it's been more I/O focused then specifically CPU focused for some time. Most shops will use the mainframe as the system of record with other systems pulling from and putting into the mainframe.

King of Heroes · Dec 31, 2013

niomosy said:
As for the mainframe itself, it's been more I/O focused then specifically CPU focused for some time. Most shops will use the mainframe as the system of record with other systems pulling from and putting into the mainframe.

This. I work at an IBM i shop, in which I started as an intern, and an important thing that I've learned as that IBM machines are built to be I/O monsters, NOT computation monsters. We use IBM i because its integration with DB2 makes it an enterprise database that's only seriously rivaled by Oracle.

niomosy · Dec 31, 2013

King of Heroes said:
This. I work at an IBM i shop, in which I started as an intern, and an important thing that I've learned as that IBM machines are built to be I/O monsters, NOT computation monsters. We use IBM i because its integration with DB2 makes it an enterprise database that's only seriously rivaled by Oracle.

Interesting as I see fewer and fewer i (AS/400) shops out there. Our i hardware is pretty old and handles specific apps that eventually feed to the mainframe. Same as our UNIX/Linux databases.

Funny thing, we almost ended up with the i servers since they can run on the pSeries hardware. My boss ended up suggesting the Mainframe/Tandem team would be better suited to handling it.

brutalizer · Jan 1, 2014

jimmyb said:
I don't disagree that bloat adds up, but you were talking about legacy support specifically - not the accumulation of bloat. 30 million transistors in itself is not a lot in modern processes.

Sure, 30 million transistors is not a lot, but it adds up. There are at least two different schools out there. Microsoft belongs to one school saying that bloat is not something terrible. The cpus are so fast today, and we have so much RAM, it does not matter if a software used 100KB more or less. It does not matter. And another 30 million transistors does not matter.

The other school says that bloat is a bad thing, you should try to minimize the code and keep it lean and mean, this is best from an engineering view point.

If you have two pieces of code, one code using 1000 LoC, and another using 600 LoC - and they are doing the same thing - then you should almost always choose the 600 LoC alternative. The difference between an experience programmer and a beginner, is that the experienced will use much less code to do the same thing as the beginner. And the risk for bugs are less, as an side effect. The fewer LoC, the fewer bugs you will see. Complexity is a bad thing.

And this must apply to cpus too. If you have one cpu using 1,000 million transistors and another using 600 million transistors doing the same thing - you should choose the fewer transistor cpu. More unnecessary transistors just eats up more watt, and you will see more bugs too.

In circuit complexity it is very important to try to reduce the number of transistors. If you can reduce 5% of all transistors - that is better than not trying to optimize at all.

Obviously, you belong to the same school as MS.

I don't have a reaction to the number without information on the corresponding performance metrics. How is this affecting critical timing paths? Did it grow the logic such that an additional pipeline stage was necessary? In what other ways did it affect the architecture?

How will bloat affect the architecture? What kind of question is that? Are you serious with that question?

Well, first of all, we all know that the x86 is buggy, with lot of bugs. For instance, the FDIV bug.
http://en.wikipedia.org/wiki/Pentium_F00F_bug
http://en.wikipedia.org/wiki/Pentium_FDIV_bug
There have been many stories about Intel and AMD having bugs in x86, forcing them to replace cpus at a huge cost, sometimes. Is this not a problem? This is a consequence of being bloated, and you dont see it as a problem? If you do admit it is a problem, how can you ask me how being bloated is a problem?

Second, if you compare something truly bloated, such as Windows kernel, to a not as bloated kernel, such as a Linux/Unix kernel - we all know that bloated Windows has problems with unstability and performance problems, in comparison to a Linux/Unix kernel. In WinXP, Microsoft has no idea what was going on in the kernel. No one knew what was happening in that huge mess, causing lot of unstability and performance problems. It was Sinofsky at MS, who first started to try to understand the Windows kernel. And he created the MinWin kernel, the "tiny" Windows kernel, which was only 200MB or so. And after his work Win7 is much more stable and well structured kernel than WinXP ever was. Win7 has a more stripped down kernel, trying to remove bloat, improving stability and performance. I heard that Windows 2000 kernel had something like... 1,500 API calls or even more? Many of them legacy and not in use anymore, but you still have to have them in Windows. In comparison, Linux kernel has something like ~150 API calls. So the effect of bloat, is unstabilty and perofrmance problems. And talking about CPUs, more transistors eat more wattage.

Really, I dont get it. How can anyone (except people at MS) ask me why bloat is a problem? Are you serious?

Lean and mean is a catchphrase, not real performance metrics.

...

Aside from the fact that we're not talking about software development, this isn't even true.

In software development, reducing LoC is important. And so it is in circuitries too. Reducing transistors is a big field of research. If you have all these transistors, can you find a smaller set, producing the same output? Have you missed this? You never try to optimize anything you construct, because removing 30 million transistors here and there, is not a problem?
http://en.wikipedia.org/wiki/Circuit_complexity

I spent many years doing integrated circuit design at an x86 producer (among other things), and none of the experts I worked with suggested this was a problem. In fact, it was specifically suggested that given the huge transistor budgets of modern processes, and the fact that processors are designed risc internally with a micro-op decoder, that supporting legacy instructions has negligible impact on the performance and area.

And there are developers at Microsoft saying that producing code with low LoC is not important, because Windows is so big in the first place, another 1000 LoC does not matter. And what has this resulted in? Windows being unstable and having low performance. And yes, it IS a problem, no matter how much the MS developer will assure the customers it is not a problem.

And yes, the x86 architecture IS buggy and ineffecient - and that IS a problem.

brutalizer · Jan 1, 2014

mikeblas said:
My calculations were normalized to be per-core. 48 MB of L3 cache across 6 cores means 8 MB per core. Of course, a single core might be using more, and reuse can happen per core.

Normalized per core? Why? Are we discussing how slow the Mainframe cpus are, or how slow the Mainframe cores are? Why are you shifting focus from cpu to core? That is something IBM is very fond of, distorting facts.

Here is an example on this thing that IBMers love to do:
http://whywebsphere.com/2013/04/29/...uble-the-cost-of-the-websphere-on-ibm-power7/
"Oracle announced their new SPARC T5 processor with much fanfare and claiming it to be the “fastest processor in the world”. Well, perhaps it is the fastest processor that Oracle has produced, but certainly not the fastest in the world. You see, when you publish industry benchmarks, people may actually compare your results to other vendor’s results. This is exactly what I would like to do in this article."

And he "analyzes" (or twists the facts) and concludes that IBM POWER7 is a faster cpu at WebSphere, because POWER7 has higher EjOPS per core. There is a slight problem. SPARC T5 do has the world record. Each SPARC T5 core might be less powerful than POWER7, but T5 has many more cores - so the end result is T5 is the worlds fastest cpu on this.

IBM says:
POWER7 has faster cores than SPARC T5 -> might be true
And therefore:
POWER7 is a faster cpu than SPARC T5 -> not true.

So, this is a lie. By shifting focus from CPUs to cores, IBM draws a conclusion, and applies the same conclusion back on the cpus again. Which is no longer true. Besides, SPARC T5 often has faster cores too, and many more of them too.

So, I dont understand why are you talking about cores? I am talking about cpus.

Yep, that's closer to the truth. Thing is, only 10,400 KB is in the processor package, and it's about 1/20th of the 200-something MB number you were using when you made your claim about about IBMs engineers and their with budget management. Now that you've learned some of the facts, I wonder if you thinking of re-evaluating your position.

I said that I would like you to help me to figure out the exact numbers. And that means I am willing to re-evaluate my position, yes. I said so. The thing is, when I try to ask IBM people about this, they never help me. They duck and evade all my questions. I never get a clear answer.

So, maybe you can help me. Again, how many books does the largest zEC12 have? I am comparing the largest and most powerful IBM Mainframe to x86 servers here. One book apparently has six sockets. And one book has 384 MB L4 cpu cache. Each cpu has 48MB L3 cache. These are facts?

NB, when I say "cpu" I mean, one socket. Not one core. If I mean "core", I say "core". I will not say "cpu" and mean "core". That is only confusing people.

And no, I have not read the RedBook, I dont have time to read books on this. Now when I have you, an IBM Mainframe person, online maybe you can help me figuring out this. Neither you nor I, would want me to say untrue things. So let us figure out how much cpu cache each Mainframe cpu has.

This is a very strong assertion for which you offer no real evidence.

So, IBM is not trying to hide and conceal facts, to make it difficult to assess performance of Mainframes vs x86?

Well, how do you expect me to prove this? By linking to a document by IBM where they confessed they do try to hide facts? What kind of proof do you want?

Facts are, IBM dont ever release benchmarks comparing x86 to Mainframes. There are no benchmarks, with say, SPECint2006 of zEC12 vs x86 cpus. Why is that? Of course, if you can show me such a benchmark, then I was wrong on this, and will stop say this. But fact is, IBM will not allow you to compare Mainframes to x86 or SPARC. No way. And I will not be able to show a written confession from IBM on this. I just state facts, IBM has not released benchmarks vs x86 or SPARC.

You're putting words into my mouth in support of your own position. I've made no claim about the success or failure of zEC12 machines. In fact, I'm asking you why you think they're such a failure to educate myself.

They are a failure, because you can software emulate a big Mainframe on a x86. If Mainframes where that much faster, then it would not be possible to use software emulation to replace a Mainframe.

I've written plenty of code -- certainly enough to know that someone who's trying to minimize lines of code is working to minimize the wrong thing. (And we're talking about hardware anyway, not software.) You'll want to familiarize yourself with my resume before you make further foolish presumptions.

You should know that a experienced developer writes less code, achieving the same result as a beginner. And the less code, the better - says every developer I met.

I'm not trying to understand the paper at AnandTech -- there's no paper at AnandTech, just an article reporting that somoene made a blog post. I'm trying to understand the claims you've made in your posts. I'm disappointed that your responses to my questions don't bring any clarity.

Please ask your questions again, and I will try to answer them. Maybe you can answer my questions so we can finally figure out how much cpu cache there is in total in the largest configured Mainframe. How many books is the largest supported configuration?

What is "the IBM mainframe myth"? I guess I'm not asking about that because it hasn't come up before this point.

The myth is that they are so much more powerful, the most powerful SMP computers in the world. Well, they are not. They are in fact very slow cpu wise. worst in class.

brutalizer · Jan 1, 2014

King of Heroes said:
This. I work at an IBM i shop, in which I started as an intern, and an important thing that I've learned as that IBM machines are built to be I/O monsters, NOT computation monsters. We use IBM i because its integration with DB2 makes it an enterprise database that's only seriously rivaled by Oracle.

I have never denied that Mainframes are superior at I/O. This IS true, they are superior at I/O, and that is why many use them. One Mainframe can have 296.000 I/O channels, I heard. And they have lot of I/O co processors. The question is, how much I/O would you get from a x86 server, if you used that many I/O co processors?

But it seems that you agree with me, they are not good cpu wise, but very good at I/O?

mikeblas · Jan 1, 2014

brutalizer said:
The difference between an experience programmer and a beginner, is that the experienced will use much less code to do the same thing as the beginner. And the risk for bugs are less, as an side effect. The fewer LoC, the fewer bugs you will see. Complexity is a bad thing.

This is a pretty vast oversimplification. Arbitrary complexity is bad, but users demand features -- and features add complexity. It's rare that simply useless code exists; code exists because someone wanted it and added it to the project. It serves some purpose, for someone.

No user is concerned with lines of code; they're concerned with features, performance, and stability. The correlation between lines of code and any of those deliverables is indirect at best.

brutalizer said:
And this must apply to cpus too.

Why must it apply? Just because you have said so?

brutalizer said:
If you have one cpu using 1,000 million transistors and another using 600 million transistors doing the same thing - you should choose the fewer transistor cpu. More unnecessary transistors just eats up more watt, and you will see more bugs too.

Except: you won't. If the code (or circuitry) is bloat, then it's not used. If it's not used, then the bugs won't be exercised and end up being irrelevant. If the code (or circuitry) is useful, then it's not bloat and the bugs have come from required complexity and aren't a result of bloat, anyway.

brutalizer said:
In circuit complexity it is very important to try to reduce the number of transistors. If you can reduce 5% of all transistors - that is better than not trying to optimize at all.

Most people who buy processors are concerned with performance, not any notion of "bloat". A design that uses more transistors and performs better is more desirable than a design that uses fewer transistors and is slower.

brutalizer said:
Well, first of all, we all know that the x86 is buggy, with lot of bugs.

Are tehre any processors which are completely bug free? Will you enumerate them?

brutalizer said:
This is a consequence of being bloated, and you dont see it as a problem?

Bugs aren't a consequence of bloat; they're a consequence of complexity. The instructions you cite aren't unnecessary waste; they're useful and necessary. Their implementation isn't trivial. Intel and AMD (and all other succesful processor manufacturers) do incredible verification and validation testing. (Lots of LoC in the tests -- that's bloat too, right?)

brutalizer said:
In software development, reducing LoC is important.

Really, it isn't. No developer will spend a few days deleting code. They might re-factor code, but that usually ends up in more lines of code than less. There are other metrics in software development that are more important.

brutalizer said:
Normalized per core? Why?

brutalizer said:

Because cache has locality a core and that made the math easier to perform. Further, shared cache isn't as impactful for performance as core-local cache.

brutalizer said:

I said that I would like you to help me to figure out the exact numbers. And that means I am willing to re-evaluate my position, yes. I said so. The thing is, when I try to ask IBM people about this, they never help me. They duck and evade all my questions. I never get a clear answer.

Click to expand...

You can get a clear answer from the redbook for the systems you're interested in. IBM people wrote the redbook, and it pretty specifically explains the structure of the system, so the personal attacks you're making as false as they are irrelevant.

brutalizer said:

And no, I have not read the RedBook, I dont have time to read books on this.

Click to expand...

That must be why you're so far from the truth and unable to knit together any solid reasoning.

brutalizer said:

Now when I have you, an IBM Mainframe person, online maybe you can help me figuring out this.[/QOUTE]What is there to figure out? Just read the redbook; it explains the architecture both quantitatively and structurally. I'm not an IBM Mainframe person; I haven't used big iron since the late 1980s.

brutalizer said:

You should know that a experienced developer writes less code, achieving the same result as a beginner. And the less code, the better - says every developer I met.

Click to expand...

You must know a lot of short-sighted developers. How do your acquaintances justify sacrificing reuse just to reduce their LoC metrics? Are their managers really measuring them based on LoC and not delivering features to customers?

Click to expand...

Click to expand...

Red Falcon · Jan 1, 2014

mikeblas said:
You can get a clear answer from the redbook for the systems you're interested in. IBM people wrote the redbook, and it pretty specifically explains the structure of the system, so the personal attacks you're making as false as they are irrelevant.
That must be why you're so far from the truth and unable to knit together any solid reasoning.

brutalizer hasn't made any personal attacks, and has been quite logical and reasonable throughout this whole discussion.
It would be nice if you actually answered his questions, as I am interested in this information as well.

jimmyb · Jan 1, 2014

brutalizer said:
Sure, 30 million transistors is not a lot, but it adds up. There are at least two different schools out there. Microsoft belongs to one school saying that bloat is not something terrible.

Again, we're talking about hardware design here.

And this must apply to cpus too. If you have one cpu using 1,000 million transistors and another using 600 million transistors doing the same thing - you should choose the fewer transistor cpu. More unnecessary transistors just eats up more watt, and you will see more bugs too.

1) The extra logic to support decoding legacy instructions is doing something.
2) Fewer number of transistors is not necessarily the better design, and does not necessarily use less power. A very big example of this is in digital logic where the industry has almost universally moved to CMOS implementations due to their lower power, despite typically having twice as many transistors as there equivalent NMOS implementation.
3) Having smart designs, BIST components, modern DFT techniques, and thorough verification reduces bugs more than *anything* else; usually this means using more transistors than the smallest design possible.

How will bloat affect the architecture? What kind of question is that? Are you serious with that question?

It's a very legitimate question. Tell me specifically how "bloat" in the decode logic affects the microarchitecture. Tell me why it's making x86 slow, buggy, and inefficient - don't just make general statements, tell me something specific.

Is the decode logic:
1) A critical timing path in the design? This would put an upper bound on the clock frequency. If it's not the critical timing path it's not slowing down the design.
2) Is it necessitating an extra pipeline stage that would otherwise not be required? If it's not, then it isn't effecting the decode latency.
3) Is it so slow and large that it has caused significant design decisions to accommodate it? What were these decisions specifically?

If the "bloat" isn't slowing down the design and is providing some function, then it's not really bloat.

Well, first of all, we all know that the x86 is buggy, with lot of bugs. For instance, the FDIV bug.

All ASICs have bugs. Neither of those that you mentioned were due to "bloat" in the decode logic. They don't support you point.

Intel and AMD's x86 processors generally have fewer errata than other vendors in my experience.

Really, I dont get it. How can anyone (except people at MS) ask me why bloat is a problem? Are you serious?

Your argument is to conflate features you don't find useful as "bloat", and then generalize all "bloat" as having specifically determinantal effects. Some "bloat" causes issues, but not all does. You are claiming that supporting legacy instructions *specifically* is causing significant increases in die area, but your "proof" is only through generalizations.

You never try to optimize anything you construct, because removing 30 million transistors here and there, is not a problem?

If they aren't slowing down the design, aren't consuming significant area, don't have any known bugs, and are providing a useful (albeit infrequently used) feature why would you "optimize" the design out?

http://en.wikipedia.org/wiki/Circuit_complexity

I don't need a wikipedia link to a page on circuit complexity. How much experience do you have doing integrated circuit design? My designs have been taped out in dozens of billion+ transistor products; ranging from CPUs, GPUs, supercomputers, telecom, datacom, etc. - in multiple leading edge submicron processes.

All your arguments are generalizations without any specific data or analysis.

And yes, the x86 architecture IS buggy and ineffecient - and that IS a problem.

What do you mean buggy and inefficient? Provide some data on this please.

How many errata does Intel's and AMD's x86 microarchitecture have compared to ARM, PowerPC, etc? How severe are the errata? What is the comparative efficiency per watt? per area? etc.? You have provided no data on this, and yet are making claims.

jimmyb · Jan 1, 2014

Red Falcon said:
brutalizer hasn't made any personal attacks, and has been quite logical and reasonable throughout this whole discussion.

He also hasn't substantiated any of his claims about hardware design.

mikeblas · Jan 1, 2014

Red Falcon said:
brutalizer hasn't made any personal attacks, and has been quite logical and reasonable throughout this whole discussion.

Imagine that you work (or have worked) at IBM or Microsoft, and re-read his posts. Then, tell me what you think. People who don't agree with him "aren't experienced", on top of it. Rhetoric like this might not be precisely what you think of as a personal attack, but to people who are personally involved in the technologies or companies in question (as I have been) it certainly is an unpleasant and ineffective way to make a point.

Red Falcon said:
It would be nice if you actually answered his questions, as I am interested in this information as well.

The questions about cache size, book architecture, and book count are all in the redbook. The redbook can be freely downloaded from the IBM website. It's quite well written, though it uses terms that most PC users aren't too familiar with (like "book" for a CPU card, and a stack of acronyms specific to the Z-series architecture).

Section 2.2.1 explains that four books can be in the processor cage. Secetion 2.2 explains how physical memory is arranged, and shows a block diagram of a book. Section 2.4 draws functional diagram of the processors and their cache, plus the off-chip cache on the book itself.

Section 2.4.2 explains some of the processor's technical design characteristics, including the chip-level features that x86-class machines simply don't have.

Section 2.4.3 describes Processor Units (cores, essentially -- this is another reason why it's natural to assume "CPU" means "core" instead of "socket") that are available on the different configurations.

The off-package L4 cache is discussed in Sections 2.4.4 and 2.4.5.

Section 2.5 explains how the system addresses memory. Note that the systems can be configured with 3 terabytes of memory.

Red Falcon · Jan 2, 2014

^ Great info, thank you for the link.

niomosy · Jan 2, 2014

brutalizer said:
They are a failure, because you can software emulate a big Mainframe on a x86. If Mainframes where that much faster, then it would not be possible to use software emulation to replace a Mainframe.

I want to see a bit more dive-down on this from our mainframe-familiar people. While, yes, you can emulate a mainframe on PC hardware and get CPU-based performance easily (I've run Hercules with the available OS'es at home to play around). I work with an ex mainframer that put VSE on Hercules and supported that in production for a shop that needed a small mainframe for some code still there.

Yet what I recall about the mainframes is less the CPU importance and more the I/O importance. How is the PC going to be at handling the I/O loads that a mainframe would handle for large financial organizations, for example?

mikeblas · Jan 2, 2014

niomosy said:
How is the PC going to be at handling the I/O loads that a mainframe would handle for large financial organizations, for example?

x86 hardware can't handle the IO as well. It also doesn't have the redundancy, fail-safe, and security features the mainframe hardware has. There are some Itanium systems that are competitive, but I don't know of any contemporary x86 gear that's close. (There have been some entries into the market -- the last I know about was the HP Superdome boxes, and the Unisys ES7000-series, but I don't keep up to date on that market much anymore.)

I think the conclusion that the ability to emulate one on the other means the other is superior in all aspects is quite flawed.

FLECOM · Jan 3, 2014

brutalizer said:
Read the part about SMP workloads here, and read the SGI interview at the bottom for more information on workloads suitable on huge servers, instead of clusters. You can not run SMP workloads, on a HPC cluster.
http://hardforum.com/showpost.php?p=1040393845&postcount=41

that's all nice and generic but name an ACTUAL task that these things do that a linux cluster cannot do (and that apparently would not require redundancy)

not "SMP" tasks vs non-smp tasks that's nice, but not an answer, I mean REAL applications

niomosy · Jan 4, 2014

FLECOM said:
that's all nice and generic but name an ACTUAL task that these things do that a linux cluster cannot do (and that apparently would not require redundancy)

not "SMP" tasks vs non-smp tasks that's nice, but not an answer, I mean REAL applications

I would guess OLTP databases as an opener since he's going for HPC versus non-HPC.

Red Falcon · Feb 3, 2014

FLECOM said:
that's all nice and generic but name an ACTUAL task that these things do that a linux cluster cannot do (and that apparently would not require redundancy)

I too would like to know this, as it is an interesting point.

Ins0mnyteq · Feb 17, 2014

i just Started working with a guy here in NC that works for IBM directly for this project. Some of the info he gave me led me to believe these chips were going to be the future of Computing. stating various facts & math that i dont know anything about. seemed super passionate about it. He told me that they were planning on selling the x86 side of IBM so they could compete with intel.

Red Falcon · Feb 17, 2014

^ If that's true, that would make things very interesting.
If you find out anything more, let us know!

FrankD400 · Feb 17, 2014

I used to test/diagnose POWER6/POWER7 mainframes (which are NOT all on zArchitecture, though I did work with z9/z10/zEnterprise with z196 MCMs) .. the specs on those chips were insane, especially POWER7. I'd kill for a POWER8 box at home!

I did mainly 9119-FHA, 9119-FHB and 9125-F2C. More commonly known as Power595, Power795, and Power 775. 775 was especially ridiculous. I really should've been mining BTC on them.. they did have OpenCL libraries.

iSeries software on Power595/795 was a pain in the junk. How do people use that crap?

The big caches on z196 are eDRAM by the way.

FrankD400 · Feb 17, 2014

Edit: IBMer informed me I was wrong about the L4, it's on the SC chips at 96MB per. I have to bug him for some documentation

He says the L3 is eDRAM but he's not 100% sure about the L4 on the SCs. I did waaaay more pSeries/775 than I did with zSeries, guess my memory got a little foggy!

http://www.vosizneias.com/wp-content/uploads/2010/09/savbg.jpg

There's a picture from Poughkeepsie, that looks like either the node room in Building 007 or Mariner test in Building 052. The 775 supercomputer is/was referred to internally as 'Mariner'.

Possibly interesting side note: We tested the 775 nodes (a super wide/long 2RU high box with water-cooled RAM, MCMs, and optical hubs) with 8400GS cards in the PCIE slots, save for 2 slots with SAS cards which were hooked up to drives bolted to a shelf powered by a Corsair PSU of all things. No disk IO drawers at that stage..

Ins0mnyteq · Feb 18, 2014

Red Falcon said:
^ If that's true, that would make things very interesting.
If you find out anything more, let us know!

I have another meeting with him tomorrow to Plan the up-fit on his new building ( he runs a Property management company as a 2nd Job...wtf) I planned on asking him a bunch about it, They way he described it Reminded me of when i discovered Computers when i was a kid. Ill try and pry some hard facts from him, so that i can actually add to the thread lol

FLECOM · Feb 22, 2014

FrankD400 said:
I used to test/diagnose POWER6/POWER7 mainframes (which are NOT all on zArchitecture, though I did work with z9/z10/zEnterprise with z196 MCMs) .. the specs on those chips were insane, especially POWER7. I'd kill for a POWER8 box at home!

I did mainly 9119-FHA, 9119-FHB and 9125-F2C. More commonly known as Power595, Power795, and Power 775. 775 was especially ridiculous. I really should've been mining BTC on them.. they did have OpenCL libraries.

iSeries software on Power595/795 was a pain in the junk. How do people use that crap?

The big caches on z196 are eDRAM by the way.

is this post in english?

Red Falcon · Feb 22, 2014

FLECOM said:
is this post in english?

It's nerdy as hell, but the guy is totally [H]ard, nonetheless!

FrankD400 · Feb 22, 2014

http://www.flickr.com/photos/glennklockwood/11095625885/in/photostream/

Found that, check out the 775 node/water cooled DIMMS -- dem memories! Dat nerd pr0n! 94TFLOPS in a rack seems like a pittance when we have 5TFLOPS video cards, but they were real ALU/decoders .. wonder how fast it could encode x264 on its general purpose hardware

niomosy · Feb 24, 2014

FLECOM said:
that's all nice and generic but name an ACTUAL task that these things do that a linux cluster cannot do (and that apparently would not require redundancy)

not "SMP" tasks vs non-smp tasks that's nice, but not an answer, I mean REAL applications

FrankD400 said:
I used to test/diagnose POWER6/POWER7 mainframes (which are NOT all on zArchitecture, though I did work with z9/z10/zEnterprise with z196 MCMs) .. the specs on those chips were insane, especially POWER7. I'd kill for a POWER8 box at home!

I did mainly 9119-FHA, 9119-FHB and 9125-F2C. More commonly known as Power595, Power795, and Power 775. 775 was especially ridiculous. I really should've been mining BTC on them.. they did have OpenCL libraries.

iSeries software on Power595/795 was a pain in the junk. How do people use that crap?

The big caches on z196 are eDRAM by the way.

As a Unix guy, I would never call any POWER-based system a mainframe. I reserve that, within IBM, for the z/Series servers. It's really odd to hear a non-z/Series IBM referred to as a mainframe unless they're referring to z/Series predecessors like the 390, 370, and 360 architectures.

Those POWER7 systems can get really massive but even at that, this is the first time I've heard them called mainframes.

FrankD400 · Feb 26, 2014

IBM calls them mainframes internally in most departments, even though they're not 360/390 decent. I think it kind of sticks in IBM because of the culture and the fact that the computational/IO heart sits in a big rack (a frame). They're at a high level much the same with a large midplane for interbook IO, very high availability, hotswappable books, and have an IO orientation bias. They're all expensive as hell, even if the z is pricier

What POWER systems do you work with?

niomosy · Feb 26, 2014

FrankD400 said:
IBM calls them mainframes internally in most departments, even though they're not 360/390 decent. I think it kind of sticks in IBM because of the culture and the fact that the computational/IO heart sits in a big rack (a frame). They're at a high level much the same with a large midplane for interbook IO, very high availability, hotswappable books, and have an IO orientation bias. They're all expensive as hell, even if the z is pricier

That seems really odd that IBM thinks of them internally as mainframes. Now if they'd just release a mainframe ADCD hobbyist package so I could play with z/OS and z/VM on an officially blessed mainframe emulator, that would be great.

What POWER systems do you work with?

These days, mostly p740 and p750 systems running AIX. We've still got some P6 595s left but are migrating to p750 systems mostly. I'm hoping we pick up some P8s eventually. Then we'll go through moving all the P7 stuff to P8 hardware.

FrankD400 · Feb 26, 2014

Well if you compare a zEnterprise and a 795 at a physical level they're not all that different. Nodes/books with some MCMs and RAM that communicate over a wide midplanar and connect to a crapload of IO via Infiniband. You use Tres drawers on your 595s?

niomosy · Feb 27, 2014

No tres drawers for us. Storage is fibre channel to NetApp mostly.

Deleted member 82943 · Apr 25, 2014

http://www.extremetech.com/computing/181102-ibm-power8-openpower-x86-server-monopoly

the open power initiative is interesting, apparently they are trying to oust intel from the datacenter.

niomosy · May 1, 2014

Unseating Intel from the data center is going to be a rather difficult thing. Good luck, IBM, you're going to need it.

brutalizer · Jun 25, 2014

I talked about reading a link about Mainframe cpus using lot of cache, I stumbled on this link again! Here it is! So now we can continue the discussion. Could someone help me interpret these numbers, and please no "read this book if you want to know". Has somebody read the book and can help me out here?
http://www.tomshardware.com/news/z196-mainframe-microprocessor,11168.html
"Additionally, a 4-node system uses 19.5MB of SRAM for L1 private cache, 144MB for L2 private cache, 576MB of eDRAM for L3 cache, and 768MB of eDRAM for a level-4 cache."

I interpret this "4-node system" as a Mainframe with four sockets. Hence it has
19.5 + 144 + 576 + 768 = 1504.5MB cache in total. That is, 376MB of cache per socket. It all revolves around what is a "4-node system"? If it is a 4-socket system, then I think most people agree that closer to half a GB of cache per socket is hilarious. But if it is not 4-sockets, I am willing to reevaluate my standpoint. No problem.

jimmyb said:
He also hasn't substantiated any of his claims about hardware design.

My claims about hardware design are as follows. In engineering, you should try to avoid bloat. That is a general principle. And very important to keep the unnecessary bloat down. Any programmer knows that less bloat is superior to another source that uses way more code to the same thing. Among others, the attack vector for compromising the system is larger. There are more bugs in more code, etc etc. These things are obvious to any engineer.

FLECOM said:
that's all nice and generic but name an ACTUAL task that these things do that a linux cluster cannot do (and that apparently would not require redundancy) not "SMP" tasks vs non-smp tasks that's nice, but not an answer, I mean REAL applications

I posted a link from SGI. Read that again? Here it is for your convience:
http://www.realworldtech.com/sgi-interview/6/
"...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time..."

So you will not see benchmarks on large enterprise systems performed on a cluster. For instance, such as this SAP benchmark done on a Oracle M6-32 SMP server:
https://blogs.oracle.com/BestPerf/entry/20140327_m6_32_sap_sd
All these servers on the benchmark are SMP servers, no clusters.

mikeblas said:
x86 hardware can't handle the IO as well. It also doesn't have the redundancy, fail-safe, and security features the mainframe hardware has. There are some Itanium systems that are competitive, but I don't know of any contemporary x86 gear that's close. (There have been some entries into the market -- the last I know about was the HP Superdome boxes, and the Unisys ES7000-series, but I don't keep up to date on that market much anymore.)

The reason the x86 hardware can not handle IO as well as Mainframes is simple: Mainframes have lot of help IO cpus. Imagine if you tucked that same amount of help IO cpus on a x86 server...

I think the conclusion that the ability to emulate one on the other means the other is superior in all aspects is quite flawed.

Well in math it is not. I have a heavy math back ground, and in math it is like this: if you can emulate a machine with another well enough - but not the other way around - then one system is superior, and the other inferior. I think this sounds reasonable?

What is "the IBM mainframe myth"? I guess I'm not asking about that because it hasn't come up before this point.

Here is the Mainframe myth, Ive posted this link before, and post it again for your convenience:
"Perpetuing myths about the zSeries"
http://www.mail-archive.com/[email protected]/msg18587.html

I've written plenty of code -- certainly enough to know that someone who's trying to minimize lines of code is working to minimize the wrong thing
Brutalizer: "In software development, reducing LoC is important."
Really, it isn't. No developer will spend a few days deleting code.

Forgive me for saying this, but I dont believe you have been developing a lot. It seems you only have some basic understanding about development. If we are going to talk about our resumes as you do, I have a double Masters, one in theoretical computer science from a top ranked university and another in Math (not as good university, but one of the best in my country), and work with High frequency trading as a researcher at an investment bank.

And I can promise you, no developer I know would say these things you say. And I know that you have not studied higher math, because of the things you say. No mathematician would for instance, consider a proof that is more bloated as good as a smaller proof - mathematicians (and developers) considers bloat a very very bad thing. You do not. Hence you are not a real mathematician nor a real developer.

All developers tries to refactor the code to make it smaller whenever they can. Here is for instance, a _real_ developer that deletes code at Apple. The Apple managers didnt consider less code as better, instead they encouraged bloat and made no fuss of bloat. Just like you who dont consider bloat a very bad thing. Maybe you are more of a manager these days? You are not a developer, that I can tell with certainty. This real developer considers more LoC a bad thing, just as I have said all the time. Talk to a real developer and they all say the same thing as I do - bloat is a very bad thing. Anyone saying the opposite is not a real developer.
http://www.folklore.org/StoryView.py?story=Negative_2000_Lines_Of_Code.txt

"In early 1982, the Lisa software team was trying to buckle down for the big push to ship the software within the next six months. Some of the managers decided that it would be a good idea to track the progress of each individual engineer in terms of the amount of code that they wrote from week to week. They devised a form that each engineer was required to submit every Friday, which included a field for the number of lines of code that were written that week.

Bill Atkinson, the author of Quickdraw and the main user interface designer, who was by far the most important Lisa implementor, thought that lines of code was a silly measure of software productivity. He thought his goal was to write as small and fast a program as possible, and that the lines of code metric only encouraged writing sloppy, bloated, broken code.

He recently was working on optimizing Quickdraw's region calculation machinery, and had completely rewritten the region engine using a simpler, more general algorithm which, after some tweaking, made region operations almost six times faster. As a by-product, the rewrite also saved around 2,000 lines of code.

He was just putting the finishing touches on the optimization when it was time to fill out the management form for the first time. When he got to the lines of code part, he thought about it for a second, and then wrote in the number: -2000.

I'm not sure how the managers reacted to that, but I do know that after a couple more weeks, they stopped asking Bill to fill out the form, and he gladly complied."

mikeblas · Jun 25, 2014

brutalizer said:
Well in math it is not. I have a heavy math back ground, and in math it is like this: if you can emulate a machine with another well enough - but not the other way around - then one system is superior, and the other inferior. I think this sounds reasonable?

I guess the problem is that I have a more practical definition of superiority.

Your car has an ECU on it which controls engine function. That ECU contains a computer which listens to sensors, makes some decisions, and controls the engine. That ECU can't emulate a large computing cluster, but the large computing cluster can certainly emulate the ECU.

Is the ECU superior to the computing cluster? It certainly is: nobody wants to buy a car that requires a genset to run a computing cluster that weighs a few orders of magnitude more than the car itself.

I might not argue that the cluster is more capable in most respects, but even then it's not capable of being portable or running in constrained heat or power environments. Things like capability and superiority are situational, not absolute; a fact that makes your comparisons of minicomputers to mainframes very brittle.

brutalizer said:
All developers tries to refactor the code to make it smaller whenever they can. Here is for instance, a _real_ developer that deletes code at Apple. The Apple managers didnt consider less code as better, instead they encouraged bloat and made no fuss of bloat. Just like you who dont consider bloat a very bad thing.

I never said that bloat wasn't a bad thing. I just said that you're wrong in claiming that all developers try to actively refactor code; and that you're wrong when you claim smaller code is always better; and that you're wrong in your belief that minimization is most important.

fixedmy7970 · Jun 26, 2014

wait a sec.

when is larger code better than smaller code?

i was under the impression that size doesn't matter,

mikeblas · Jun 26, 2014

fixedmy7970 said:
wait a sec.

when is larger code better than smaller code?

i was under the impression that size doesn't matter,

You're asking about code size, but I think brutalizer is making claims about code complexity.

Code size matters at a small level. Larger code runs slower when it starts to not fit in cache. OTOH, larger code might be faster becasue it accesses memory less, implements a more efficient algorithm, or offers more features.

Some of those changes also involve (or, at least, imply) adding complexity. While we measure code size in bytes (or maybe lines of code), there's no real concensus about how to measure complexity, so comparisons of complexity are harder than comparisons of size. More complex code is better when the complexity is required, and worse when the complexity is arbitrary or not required.

Claims like "The difference between an experience programmer and a beginner, is that the experienced will use much less code to do the same thing as the beginner" or "All developers tries to refactor the code to make it smaller whenever they can" simply don't reflect the reality of commercial software development.

jimmyb · Jun 26, 2014

brutalizer said:
My claims about hardware design are as follows. In engineering, you should try to avoid bloat. That is a general principle.

It is a *general* principle. That means you can't use it to back *specific* claims and comparisons about hardware.

You made a claim about how legacy instruction support was significantly slowing down mainframe and x86 CPUs. "Bloat" is not an argument. There are thousands of engineers doing actual analysis and research on CPU design and verification, and they aren't basing their arguments solely (or even substantially) on the concept of bloat.

jimmyb · Jun 26, 2014

brutalizer said:
No mathematician would for instance, consider a proof that is more bloated as good as a smaller proof

This just isn't true. Sometimes a longer and more complex mathematical proof provides greater intuition into *why* is something is true. Lots of mathematicians will avoid proofs by exhaustion for this very reason, even though they can be quite concise.

brutalizer · Jun 30, 2014

jimmyb said:
This just isn't true. Sometimes a longer and more complex mathematical proof provides greater intuition into *why* is something is true. Lots of mathematicians will avoid proofs by exhaustion for this very reason, even though they can be quite concise.

You are dead wrong on this. Smaller is better. Talk to any mathematician, you dont have to take my word for it. Just ask anyone who has studied math at a high level, not some college or undergrad level.

A shorter proof is considered more elegant than a bloated longer proof. A short clever proof is more beautiful, and math is all about beauty. Just ask any serious mathematician what beautiful math is. Mathematicians are obsessed with beauty. All serious mathematicians are.

This also applies to source code.

New IBM POWER8 CPU

[H]ard|DCer of the Month - May 2006

Limp Gawd

[H]ard|Gawd

Limp Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

[H]ard DCOTM December 2023

2[H]4U

2[H]4U

[H]ard|DCer of the Month - May 2006

[H]ard DCOTM December 2023

Limp Gawd

[H]ard|DCer of the Month - May 2006

Modder(ator) & [H]ardest Folder Evar

Limp Gawd

[H]ard DCOTM December 2023

[H]ard|Gawd

[H]ard DCOTM December 2023

Limp Gawd

Limp Gawd

[H]ard|Gawd

Modder(ator) & [H]ardest Folder Evar

[H]ard DCOTM December 2023

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Deleted member 82943

Guest

Limp Gawd

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

Gawd

[H]ard|DCer of the Month - May 2006

2[H]4U

2[H]4U

[H]ard|Gawd