butterfliesrpretty
2[H]4U
- Joined
- Mar 15, 2002
- Messages
- 2,482
Yeah we have to rethink cpu's at this point. Sandy Bride might actually be fine for another ten years for gaming alone unless something changes.
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
As for a document, you can google it yourself. Try something like "hardware threads", or "Hyperthreading" (HT) or "simultanteous multithreading" (SMT). Note that all are referring to threads. In hardware.
Of course microsoft is talking about OS threads. They make an OS.
AMD is using the term threads for a CPU because they design CPUs, and that's what we call it when designing CPUs. I'm sorry you don't care for the term, but it is what it is.
Software people tend to call them logical cores, hardware people call them threads or hardware threads. And that's pretty
accurate, since the HW thread is the unit which will execute a SW thread.
I did any anything mentions threads as being part of software directly or indirectly.
evne Intels own description from hyperthreading. wiki etc etc
The larger the pool of resources the more costly a miss would be so that is not very practical for mixed work loads or multiple workloads on the same resources, it works for a gpu because of the nature of the work load.The most common implementation as you know is the dual issue/decode to one grouped set of execution units ("physical core"). But that is by no means the only design in a highly threaded processor. It could easily feed N threads into a single large pool of common resources. There has been a lot of research on that front, and in many ways, this is how a GPU works. That's partially why we shy away from discussing "cores". It's really just threads on the top end, and some grouping of execution units on the bottom.
The larger the pool of resources the the more costly a miss would be so that is not very practical for mixed work loads or multiple workloads on the same resources, it works for a gpu because of the nature of the work load.
You will quickly hit the point of diminishing returns when you have to flush that deep pool and refill it.
Yes and no. The big penalty for a "miss" (if I'm understanding you correctly) is what drove the HT implementation in the P4. The damn pipeline was so long that misses just utterly destroyed performance. So, how do we improve the utilization of the internal processing units during this time? Feed them an alternate instruction stream which is hopefully not stalled as well. To be fair, the P4 had many issues, but this was a reasonable attempt at handling some of them.
With the top-heavy issue vs. execution you actually mitigate misses a lot, as the execution units are still likely to get used by alternate instruction streams even if one stalls. Now of course, this requires a specialized and very parallel workload - not at all suited to consumer usage. If that's what you meant, you're spot on, and that's why consumer CPUs are the way they are. I think this is what you are getting at.
But there is an emerging middle-ground between the very strong per-thread general usage desktop CPU and the specialized but massively parallel GPU. Particularly for some scientific tasks, there are some workloads which are quite parallel but weirdly branchy and eclectic for lack of a better word, and this makes it tough to use GPUs effectively. Some of the Xeon Phi stuff is in this direction.
Mostly because Intel hasn’t designed for clock speed in a long time. The Pentium 4 was originally designed to scale to 10GHZ. Thermal solutions at the time couldn’t keep up, and Intel changed their game plan. However, a non LN2 cpu wouldn’t be out of the question with modern coolers. What would drop though is the IPC ( due to the longer and longer pipelines required) and core count.
Prescott also had the ability to run some of its ALUs at double bass clock speed. Effectively, this meant parts of the core were running at 7.6GHZ. However, it had an 31 stage pipeline that ultimately had poor error prediction.
At the end of the day, it wouldn’t be very efficient. And with computers trying to get smaller and smaller, there is no point.
My original post was a bit tongue and cheek for those who recall the Prescott days.
Honestly, most cpu advances are in the cell phone market right now.
The problem with high clock speeds is you need to have the entire clock domain in sync.
As explained before, adding more cores is limited by Amdahl law. This is not a programing model problem, but just a consequence of the sequential nature of some algorithms.
A big.LITTLE approach does not solve the performance problem, because the sequential portions of code will have to be executed on the big cores, whose performance will continue being limited by both frequency and IPC, as it happens on current cores.
Moreover, those heterogeneous approaches have the additional problem that the partition of silicon into big and small is static and made during the design phase, whereas different applications require different combinations of latency and throughput: one application would work better on a 16 BIG + 256 LITTLE configuration, whereas another application would work better on a 4 BIG + 1024 LITTLE configuration. If your heterogeneous CPU is 8 BIG + 512 LITTLE, then those two applications will run inefficiently compared to the respective optimal silicon cases.
It's not a programming model problem at all. It's that most of the problems we're solving are not widely parallel at all, even from a logical standpoint (totally independent of programming).
You either have problem sets with independent data, or you need to access common data atomically (from the perspective of core interaction) thus must serialize access. Decrying things used to serialize access as "old" or something is neither here nor there. It doesn't matter what that mechanism is, once you gate access (which you must, for data coherency), your parallelism is gone.
And sadly, most general purpose computing does not have a bunch of completely independent things to do.
Lock-free Linux kernel exists since a long time ago. Windows NT still struggles with kernel locks slowing down systems with high core counts.
You can design an algorithm and your data structures to access data without using locks. Coordination can be made only between relevant groups of cores instead of locking a full data structure.
And even better, you can use data dependencies to access data only when it is available instead of using mutexes on multiple data structures or making a team of cores to wait for the whole group to finish something.
Amdah'ls Law graphs depending on parallelizable fraction: https://askldjd.com/2010/01/30/some-plots-on-amdahls-law/
Algorithms are usually 90%+ parallelizable. Typical HPC workloads are 99% parallel.
And people try to design algorithms to have 99% parallel fractions.
Did you mean threads or did you mean logical cores? it's different. CPU's does not have threads they have logical cores that can run a thread. It's a general misconception that most people do due to bad marketing and tech sites not really being that techie anymore.
if you did mean threads I would like to see that.
if you did mean logical cores (As seen in task manager, msi afterburner etc) you have to remember that just because the load is distributed among all cores does not mean the load benefits from it or is in anyway or shape multithreaded.
due to the simple fact that you are seeing an average OVER TIME.
-snip-
Optical transistors are certainly interesting, but I'm afraid their fate is tied to the development of quantum computing which I believe is still decades off from being within practical application for us normies.My opinion is current CPU silicon based architecture/technology is hitting some asymptotic limits in terms of size, speed, power and scalability from performance and economic sense.
The future is probably something like this as one example. Perhaps we may not talk about bits but qubits, or perhaps not electrons but photons or perhaps some form combination of physical-biological computing.
Hopefully I will see at least one major evolutionary computing change in my lifetime!
The only alternative currently is IA64, else you have the exact same issue on every other uarch. But even then you wont really solve the issues either. Just postpone it a bit while you sit on 15-20 issue wide cores instead of 6-8 issue wide.
BINGO!The problem isn't that Intel and AMD can't make CPU's that perform better. Raw performance is relatively easy to achieve. The problem, is that they can't do so while maintaining a reasonable power envelopes. On the desktop this isn't really a problem, but the desktop is now a niche market product and not worth Intel or AMD spending tons of money on it.
BINGO!
Desktop PC sales are steadily declining every year and the number of people who don't even own a desktop PC and who do everything on a phone or tablet is increasing every year. I don't like it, because I have zero interest in trying to use a phone or tablet as a real computer and less than zero interest in anything with a tiny screen, but unfortunately, that's what is happening.
After decades of performance increases and everything getting bigger and faster we're now going backwards. The "computer" used by most people today has a wimpy CPU, little RAM and a tiny screen. I am constantly amazed by this.
was there more examples than video games?
In terms of? Needing faster cores? Pretty much everything outside a sub 1% section that includes renders and encoders.