Have we reached a dead-end in CPU microarchitecture performance?

butterfliesrpretty · Feb 19, 2018

Yeah we have to rethink cpu's at this point. Sandy Bride might actually be fine for another ten years for gaming alone unless something changes.

SvenBent · Feb 19, 2018

PhaseNoise said:
As for a document, you can google it yourself. Try something like "hardware threads", or "Hyperthreading" (HT) or "simultanteous multithreading" (SMT). Note that all are referring to threads. In hardware.

Of course microsoft is talking about OS threads. They make an OS.
AMD is using the term threads for a CPU because they design CPUs, and that's what we call it when designing CPUs. I'm sorry you don't care for the term, but it is what it is.

Software people tend to call them logical cores, hardware people call them threads or hardware threads. And that's pretty

accurate, since the HW thread is the unit which will execute a SW thread.

I did any anything mentions threads as being part of software directly or indirectly.
even Intels own description from hyperthreading. wiki etc etc

https://en.wikipedia.org/wiki/Thread_(computing)
https://www.intel.com/content/www/u...per-threading/hyper-threading-technology.html
https://en.wikipedia.org/wiki/Hyper-threading
https://en.wikipedia.org/wiki/Simultaneous_multithreading
https://msdn.microsoft.com/en-us/library/windows/desktop/ms684841(v=vs.85).aspx

etc etc etc

now please I've shown you my hand. please show me the documentation you where talking about otherwise you claim is just a moot point.

Software has threads. A thread goes to a core (logical or physical) calling the logical cores for threads make no sense

How would you know what I mean if i tell you that thread 7 is running on tread 3 ?
I'm a taking about software thread 7 on core 3 or software thread 3 on core 7?
There is a difference.

OutOfPhase · Feb 19, 2018

SvenBent said:
I did any anything mentions threads as being part of software directly or indirectly.
evne Intels own description from hyperthreading. wiki etc etc

Of course. A "thread" is a stream of instructions in both the OS and the HW. Both are tracking the things they need to in order to achieve a correct execution of that stream, and each manages a different set of resources to achieve that.

The OS generates a thread of execution which it schedules on the processor. From the CPU's perspective, this happens on a HW thread - the unit which executes a coherent stream of instructions. Now what happens after that is very implementation dependent.

The most common implementation as you know is the dual issue/decode to one grouped set of execution units ("physical core"). But that is by no means the only design in a highly threaded processor. It could easily feed N threads into a single large pool of common resources. There has been a lot of research on that front, and in many ways, this is how a GPU works. That's partially why we shy away from discussing "cores". It's really just threads on the top end, and some grouping of execution units on the bottom.

I apologize for being possibly overly pedantic on this point, but hey, I spent a couple decades on this exact construct (a long time before it was productized!), so I feel obligated to pipe up sometimes.

None of my nit-picking should detract from your correct points about core utilization vs. what people see in task manager.

Edit: You've updated your post. Please read the links you posted.
SMT - Simultaneous MultiTHREADING
HT - HyperTHREADING
Also, see docs for CPUID (https://en.m.wikipedia.org/wiki/CPUID), discussing thread/core/package heirarchy.
These refer to the CPU having multiple threads per core. I'm not sure how else to make the point. I AM a CPU designer. I know what terms we use. I worked on SMT for years.

pututu · Feb 19, 2018

My opinion is current CPU silicon based architecture/technology is hitting some asymptotic limits in terms of size, speed, power and scalability from performance and economic sense.

The future is probably something like this as one example. Perhaps we may not talk about bits but qubits, or perhaps not electrons but photons or perhaps some form combination of physical-biological computing.

Hopefully I will see at least one major evolutionary computing change in my lifetime!

stormy1 · Feb 19, 2018

PhaseNoise said:
The most common implementation as you know is the dual issue/decode to one grouped set of execution units ("physical core"). But that is by no means the only design in a highly threaded processor. It could easily feed N threads into a single large pool of common resources. There has been a lot of research on that front, and in many ways, this is how a GPU works. That's partially why we shy away from discussing "cores". It's really just threads on the top end, and some grouping of execution units on the bottom.

The larger the pool of resources the more costly a miss would be so that is not very practical for mixed work loads or multiple workloads on the same resources, it works for a gpu because of the nature of the work load.
You will quickly hit the point of diminishing returns when you have to flush that deep pool and refill it.

OutOfPhase · Feb 19, 2018

stormy1 said:
The larger the pool of resources the the more costly a miss would be so that is not very practical for mixed work loads or multiple workloads on the same resources, it works for a gpu because of the nature of the work load.
You will quickly hit the point of diminishing returns when you have to flush that deep pool and refill it.

Yes and no. The big penalty for a "miss" (if I'm understanding you correctly) is what drove the HT implementation in the P4. The damn pipeline was so long that misses just utterly destroyed performance. So, how do we improve the utilization of the internal processing units during this time? Feed them an alternate instruction stream which is hopefully not stalled as well. To be fair, the P4 had many issues, but this was a reasonable attempt at handling some of them.

With the top-heavy issue vs. execution you actually mitigate misses a lot, as the execution units are still likely to get used by alternate instruction streams even if one stalls. Now of course, this requires a specialized and very parallel workload - not at all suited to consumer usage. If that's what you meant, you're spot on, and that's why consumer CPUs are the way they are. I think this is what you are getting at.

But there is an emerging middle-ground between the very strong per-thread general usage desktop CPU and the specialized but massively parallel GPU. Particularly for some scientific tasks, there are some workloads which are quite parallel but weirdly branchy and eclectic for lack of a better word, and this makes it tough to use GPUs effectively. Some of the Xeon Phi stuff is in this direction.

stormy1 · Feb 19, 2018

PhaseNoise said:
Yes and no. The big penalty for a "miss" (if I'm understanding you correctly) is what drove the HT implementation in the P4. The damn pipeline was so long that misses just utterly destroyed performance. So, how do we improve the utilization of the internal processing units during this time? Feed them an alternate instruction stream which is hopefully not stalled as well. To be fair, the P4 had many issues, but this was a reasonable attempt at handling some of them.

With the top-heavy issue vs. execution you actually mitigate misses a lot, as the execution units are still likely to get used by alternate instruction streams even if one stalls. Now of course, this requires a specialized and very parallel workload - not at all suited to consumer usage. If that's what you meant, you're spot on, and that's why consumer CPUs are the way they are. I think this is what you are getting at.

But there is an emerging middle-ground between the very strong per-thread general usage desktop CPU and the specialized but massively parallel GPU. Particularly for some scientific tasks, there are some workloads which are quite parallel but weirdly branchy and eclectic for lack of a better word, and this makes it tough to use GPUs effectively. Some of the Xeon Phi stuff is in this direction.

Yes that is what I was getting at but its not just consumer loads, Virtual machine loads act the same way. Do something for this vm then do something for that one that requires a cache dump then go back and forth wash rinse and repeat.

There are a lot of tasks that will never be doable on gpu or a cpu that acts like a gpu but your are right there are some that are in the middle.
The xeon Phi stuff is a kludge to make something into something it is not.
It helps but it will never be as fast as a specifically designed processor.

Morlock · Feb 19, 2018

OP was TL;DR for me at the moment, but I gave this a moment's thought a few days ago so I'll throw in my tuppence:

I dunno about a dead end, but we seem to have hit a much shallower slope. I just spent more money to get only 40% (IIRC) more performance, according to at least one benchmark*, in my current CPU than I had in my previous CPU. I bought my previous CPU five years ago (again, IIRC).

5 years is not 18 months. I don't think Moore's Law is a law anymoore (hahahaha, get it?).

* Feel free to correct me: previous CPU was an i5-4440, new one is i5-8400.

Edit: re-checked my source:

http://cpu.userbenchmark.com/Compare/Intel-Core-i5-8400-vs-Intel-Core-i5-4440/3939vs1993

It says +45% effective speed, +60% average user bench, and +52% peak overclocked bench.

So let's be nice and use the 60% figure; that's still far short of a doubling every 18 months.

Edit2:

i5-4440 transistor count: 1.4m
i5-8400 transistor count: n/a.

Intel doesn't want us to know.

defaultluser · Feb 19, 2018

The short answer is "Yes."

The long answer is "the discussion will never be finished." Everyone has their own ideas of how to improve performance at "lowest cost," but very few have access s to a design team the size of AMD/Intel/Apple/Qualcomm.

You can improve IPC more slowly by improving efficiency of existing systems (Skylake), or just making things wider (Haswell).

Or you can improve IPC more quickly by using unproven designs, like AMD's Ryzen branch predictor. But unproven can sometimes backfire (Netburst trace cache, Bulldozer's shared front-end), so you have to use care before you commit.

juanrga · Feb 19, 2018

Revenant_Knight said:
Mostly because Intel hasn’t designed for clock speed in a long time. The Pentium 4 was originally designed to scale to 10GHZ. Thermal solutions at the time couldn’t keep up, and Intel changed their game plan. However, a non LN2 cpu wouldn’t be out of the question with modern coolers. What would drop though is the IPC ( due to the longer and longer pipelines required) and core count.

Prescott also had the ability to run some of its ALUs at double bass clock speed. Effectively, this meant parts of the core were running at 7.6GHZ. However, it had an 31 stage pipeline that ultimately had poor error prediction.

At the end of the day, it wouldn’t be very efficient. And with computers trying to get smaller and smaller, there is no point.

My original post was a bit tongue and cheek for those who recall the Prescott days.

Honestly, most cpu advances are in the cell phone market right now.

P4 was designed to scale to 10GHz because engineers believed that silicon would continue to scale following the classic models they used for former process nodes. But once traditional scaling laws for silicon ceased to apply, the whole CPU model was wrong and the physical couldn't never achieve those 10GHz achieved by the model.

There is no way to get 10GHz today with actual cooling solutions in current nodes. Even with LN2 it is hard to get 8GHz. We did hit a frequency wall, as is easily observed in the frequency graph that I have given in this same thread. This is not Intel problem, any other company is bound by the same wall, because it is a silicon issue, not a microarchitecture issue.

Sun/Oracle continue designing simpler cores to push for higher clocks but still they are limited to 5GHz.

juanrga · Feb 19, 2018

Shintai said:
The problem with high clock speeds is you need to have the entire clock domain in sync.

Indeed, and the maximum frequency achieved by a design is inversely proportional to the size. The larger the domain, the lower the frequency.

Filiprino · Feb 19, 2018

Amdah'ls Law graphs depending on parallelizable fraction: https://askldjd.com/2010/01/30/some-plots-on-amdahls-law/
Algorithms are usually 90%+ parallelizable. Typical HPC workloads are 99% parallel.
And people try to design algorithms to have 99% parallel fractions.

On top of that there's FPGAs.

Define me "better". 1% better? 5% better?

Runtime libraries manage resources dynamically to adapt the code. What you describe is old school programming. That is outdated, as I said.

juanrga said:
As explained before, adding more cores is limited by Amdahl law. This is not a programing model problem, but just a consequence of the sequential nature of some algorithms.

A big.LITTLE approach does not solve the performance problem, because the sequential portions of code will have to be executed on the big cores, whose performance will continue being limited by both frequency and IPC, as it happens on current cores.

Moreover, those heterogeneous approaches have the additional problem that the partition of silicon into big and small is static and made during the design phase, whereas different applications require different combinations of latency and throughput: one application would work better on a 16 BIG + 256 LITTLE configuration, whereas another application would work better on a 4 BIG + 1024 LITTLE configuration. If your heterogeneous CPU is 8 BIG + 512 LITTLE, then those two applications will run inefficiently compared to the respective optimal silicon cases.

Filiprino · Feb 19, 2018

PhaseNoise said:
It's not a programming model problem at all. It's that most of the problems we're solving are not widely parallel at all, even from a logical standpoint (totally independent of programming).

You either have problem sets with independent data, or you need to access common data atomically (from the perspective of core interaction) thus must serialize access. Decrying things used to serialize access as "old" or something is neither here nor there. It doesn't matter what that mechanism is, once you gate access (which you must, for data coherency), your parallelism is gone.
And sadly, most general purpose computing does not have a bunch of completely independent things to do.

Lock-free Linux kernel exists since a long time ago. Windows NT still struggles with kernel locks slowing down systems with high core counts.

You can design an algorithm and your data structures to access data without using locks. Coordination can be made only between relevant groups of cores instead of locking a full data structure.

And even better, you can use data dependencies to access data only when it is available instead of using mutexes on multiple data structures or making a team of cores to wait for the whole group to finish something.

OutOfPhase · Feb 19, 2018

Filiprino said:
Lock-free Linux kernel exists since a long time ago. Windows NT still struggles with kernel locks slowing down systems with high core counts.

You can design an algorithm and your data structures to access data without using locks. Coordination can be made only between relevant groups of cores instead of locking a full data structure.

And even better, you can use data dependencies to access data only when it is available instead of using mutexes on multiple data structures or making a team of cores to wait for the whole group to finish something.

I have no doubt there are examples of inefficient locking out there. That's not at all what I'm getting at. Even with zero-cost serialization techniques, you run into Amdahl's Law because of your problem space - not the locking mechanism. Tasks are not infinitely parallel, and when they are not, there will be an upper bound to how wide you can spread it.

For even a very parallel task which is 75% parallel, you're going to smash into diminishing returns right at 4 cores. This is logic, and independent of any implementations.

juanrga · Feb 19, 2018

Filiprino said:
Amdah'ls Law graphs depending on parallelizable fraction: https://askldjd.com/2010/01/30/some-plots-on-amdahls-law/
Algorithms are usually 90%+ parallelizable. Typical HPC workloads are 99% parallel.
And people try to design algorithms to have 99% parallel fractions.

I don't know from where you got those percentages but they are wrong.

Ghost of Cyrix · Feb 19, 2018

SvenBent said:
Did you mean threads or did you mean logical cores? it's different. CPU's does not have threads they have logical cores that can run a thread. It's a general misconception that most people do due to bad marketing and tech sites not really being that techie anymore.

if you did mean threads I would like to see that.
if you did mean logical cores (As seen in task manager, msi afterburner etc) you have to remember that just because the load is distributed among all cores does not mean the load benefits from it or is in anyway or shape multithreaded.
due to the simple fact that you are seeing an average OVER TIME.

-snip-

I meant logical cores/hardware threads and by "heavily-spread", I meant seemingly distributing the workload evenly across all 12 of the 8700K's logical cores and still pinning each of them to simultaneous, real-time, near-max usage. Like so (overall CPU usage in top left showing 88%):

In some other games, I'll see a uniform ~50-60% usage on each logical core, but upon disabling every other one in the task manager (effectively disabling hyperthreading), the remaining 6 non-HT cores are also simultaneously pushed to like 80-90%+ usage (and disabling any of those further results in performance losses of some kind). I assume that reported spread is just due to some limits of hyperthreading because as far as I can tell, a good number of modern games are indeed quite strongly multithreaded and will successfully attempt to use multiple cores simultaneously for something intensive (but not nearly all, as I mentioned in my OP; got plenty which hit one or two cores abnormally hard and barely seem to do much with the rest). And the gains, despite the reported usage, are of course not anywhere linear nonetheless.

stormy1 · Feb 20, 2018

Interesting thread.
Something to keep in mind is that IPC is workload dependent.
Just because there is no gain in your chosen test does not mean there is not an IPC gain in others.
That is a huge problem with using one cpu design across mobile, desktop, workstation and server.
Even just in servers alone there are several hugely different workloads.
You have to make a lot of compromises to make it work across all those workloads and variations that is going to hurt IPC in all of them.

OrangeKhrush · Feb 20, 2018

This thread has covered it rather well but the issues remains the inability of x86 to effectively scale parallelism well, this is why I think x86 will be phased out once a more parallel architecture arises. One where 128 cores/256 threads at 1ghz will handsomely outperform high frequency. what we need is a true scalar architecture to see notable CPU performance gains, x86 is completely inefficient at it.

From other standpoints coders and one particular coder on another forum was rather outgoing about the fact that coders like to do a little as possible, essential copy pasta, do little for good pay or do much more for that same pay seems inefficient use of skillsets so they adopt the mendoza line of just enough when it comes to coding on x86 level. It is why when someone says something is parallel and it really only uses 1 or 2 cores while everything else sits on low usage is a rather baffling thing, application test benches like blender and cinebench that are supposedly parallel still are rather poor in terms of using total threads available efficiently and this is coder level to, the excuse of "it will be a mountain upon a mountain of individual code stacked on more of the same is just to much work." At every bottleneck the big issue is human evolution, we remain a lazy species.

Shintai · Feb 20, 2018

The only alternative currently is IA64, else you have the exact same issue on every other uarch. But even then you wont really solve the issues either. Just postpone it a bit while you sit on 15-20 issue wide cores instead of 6-8 issue wide.

Armenius · Feb 20, 2018

pututu said:
My opinion is current CPU silicon based architecture/technology is hitting some asymptotic limits in terms of size, speed, power and scalability from performance and economic sense.

The future is probably something like this as one example. Perhaps we may not talk about bits but qubits, or perhaps not electrons but photons or perhaps some form combination of physical-biological computing.

Hopefully I will see at least one major evolutionary computing change in my lifetime!

Optical transistors are certainly interesting, but I'm afraid their fate is tied to the development of quantum computing which I believe is still decades off from being within practical application for us normies.

I think in the more immediate future they should stop racing to have the smallest node possible and come up with a solution to the underlying archaic x86 instruction set. One major sacrifice or another will need to be made to move computing forward in a significant way.

defaultluser · Feb 20, 2018

Shintai said:
The only alternative currently is IA64, else you have the exact same issue on every other uarch. But even then you wont really solve the issues either. Just postpone it a bit while you sit on 15-20 issue wide cores instead of 6-8 issue wide.

Well, obviously we need to go Mill. That solves all problem, by creating an entirely distinct set of new problems

There is no perfect architecture. I think there's still more efficiency to be gained by trying risky shit in existing x86 designs, because there's a billion different ways to design an x86 processor.

But it will still be slow-going, because such innovations don't happen overnight. All the easy shit has been conquered.

Dick Johnson · Feb 20, 2018

Dan_D said:
The problem isn't that Intel and AMD can't make CPU's that perform better. Raw performance is relatively easy to achieve. The problem, is that they can't do so while maintaining a reasonable power envelopes. On the desktop this isn't really a problem, but the desktop is now a niche market product and not worth Intel or AMD spending tons of money on it.

BINGO!

Desktop PC sales are steadily declining every year and the number of people who don't even own a desktop PC and who do everything on a phone or tablet is increasing every year. I don't like it, because I have zero interest in trying to use a phone or tablet as a real computer and less than zero interest in anything with a tiny screen, but unfortunately, that's what is happening.

After decades of performance increases and everything getting bigger and faster we're now going backwards. The "computer" used by most people today has a wimpy CPU, little RAM and a tiny screen. I am constantly amazed by this.

Shintai · Feb 21, 2018

Dick Johnson said:
BINGO!

Desktop PC sales are steadily declining every year and the number of people who don't even own a desktop PC and who do everything on a phone or tablet is increasing every year. I don't like it, because I have zero interest in trying to use a phone or tablet as a real computer and less than zero interest in anything with a tiny screen, but unfortunately, that's what is happening.

After decades of performance increases and everything getting bigger and faster we're now going backwards. The "computer" used by most people today has a wimpy CPU, little RAM and a tiny screen. I am constantly amazed by this.

Laptops sales are replacing desktop sales. Its not going to phones or tablets. Just PC mobility. Even in gaming laptops have overtaken desktops.

Anyone developing anything directly for the desktop got a loser case. It has to be servers or laptops.

memememe · Feb 21, 2018

was there more examples than video games?

Shintai · Feb 22, 2018

profiled said:
was there more examples than video games?

In terms of? Needing faster cores? Pretty much everything outside a sub 1% section that includes renders and encoders.

OutOfPhase · Feb 22, 2018

Shintai said:
In terms of? Needing faster cores? Pretty much everything outside a sub 1% section that includes renders and encoders.

And since those scale with faster cores too - it is literally everything which could ever be CPU bound which would like faster cores.

Have we reached a dead-end in CPU microarchitecture performance?

2[H]4U

2[H]4U

Supreme [H]ardness

[H]ard DC'er of the Year 2021

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

Limp Gawd

[H]F Junkie

2[H]4U

2[H]4U

Limp Gawd

Limp Gawd

Supreme [H]ardness

2[H]4U

n00b

[H]ard|Gawd

[H]ard|Gawd

Supreme [H]ardness

Extremely [H]

[H]F Junkie

Weaksauce

Supreme [H]ardness

Limp Gawd

Supreme [H]ardness

Supreme [H]ardness