AMD possibly going to 4 threads per core

Yes, Intel does. Quite a bit better.



The comparison made above was SSE / SSE2 at introduction on the Pentium III / IV versus now. AVX is still relatively new across the board, but it provides significant benefits and is absolutely usable in real-world applications and developers are including it in their code, so it has statistical relevance now. Going forward, AVX performance will likely be a differentiator for compute code that is run on the CPU.

99% of what applications? I draw comparisons to the takeup of SSE and SSE2 for a reason: they were out for five or six years before they really came into their own and became a deciding factor for CPU performance. We're what, a few years in for AVX, and we're seeing developer interest and commercial takeup? Yeah, that's the same path that SSE took, and we have no reason to believe that the market will not put it to use. To wit: AMD is including successive AVX improvements in Zen. AVX is statistically relevant.

AVX is 8 years old and has been available on everything from Sandy Bridge / Bulldozer and on. AVX2 is 6 years old and Haswell / Excavator on. AVX512 is three years old and is limited to Phi Accelerators and some very basic support on Skylake X / Cannon lake.

Nobody is claiming that it's not utilized, and AMDs implementation of AVX/AVX2 is as good or better than Intels. The outlier is AVX512, which doesn't share nearly the support base that the earlier 128/256 versions do.

So when you say that AVX performance for Intel eclipses AMD, that's only true of AVX512, a relatively new and (currently) mostly unused extension.

Excluding that extremely new extension, which is a drastic best case for Intel, shows that AMD is equal or better. Hanging your hat on AVX512 performance as either an indicator of current or future performance is a poor choice, but yours to make.


There were few core-level and IPC-level improvements, but Intel has made pretty large improvements to the package. Higher average overclocks for enthusiasts, more cores per socket in every market including mobile, and lower power parts for mobile are all improvements made during the tenure of 14nm Skylake releases. Note that AMD doesn't have a competitive mobile part that could hang with a two generation old Skylake 15w quad-core, let alone the new 10nm 15w Ice Lake quads that have graphics that match AMDs APUs :D.

When it comes to mobile, and entry-level desktops, SMT4 might actually be pretty useful in keeping cost and power draw down while still maintaining usability for productivity users.

Redirection. Nobody here is talking about mobile, is hasn't been a part of this conversation. But since we're talking about power, how about that power consumption!

I forget, is this one of those scenarios where power consumption matters or not, it's hard to keep track with you.

yjYDYVx.png
 
In order to increase IPC what needs to be done is increases the pipelines. The more pipelines then the more code you can execute and hopefully branch prediction gets right. I'm not sure how many pipelines Ryzen has so far as no amount of Googling shows it but I'd imagine it's more like Netburst. Too many pipelines means the CPU becomes less efficient at executing code, so SMT and Hyperthreading fixes this by cutting the pipelines to make more cores. I imagine AMD's next big CPU is going to have a lot more pipelines and therefore a 4 thread per core makes sense.
 
Last edited:
AVX is 8 years old and has been available on everything from Sandy Bridge / Bulldozer and on. AVX2 is 6 years old and Haswell / Excavator on. AVX512 is three years old and is limited to Phi Accelerators and some very basic support on Skylake X / Cannon lake.

Nobody is claiming that it's not utilized, and AMDs implementation of AVX/AVX2 is as good or better than Intels. The outlier is AVX512, which doesn't share nearly the support base that the earlier 128/256 versions do.

So when you say that AVX performance for Intel eclipses AMD, that's only true of AVX512, a relatively new and (currently) mostly unused extension.

Excluding that extremely new extension, which is a drastic best case for Intel, shows that AMD is equal or better. Hanging your hat on AVX512 performance as either an indicator of current or future performance is a poor choice, but yours to make.




Redirection. Nobody here is talking about mobile, is hasn't been a part of this conversation. But since we're talking about power, how about that power consumption!

I forget, is this one of those scenarios where power consumption matters or not, it's hard to keep track with you.

View attachment 190370

Source of graph please. Was a review somehere, or some AMD fan posting on another forum?

Was that 3D rendering again? AMD king of using the CPU for something better done on the GPU.
 
Source of graph please. Was a review somehere, or some AMD fan posting on another forum?

Was that 3D rendering again? AMD king of using the CPU for something better done on the GPU.

It was posted by TheStilt in the same piece that has been linked multiple times in this thread (kind of seems like people should just go read the post.)
 
Says the person who hasn't even read the post. So sure, why not. It doesn't seem like you particularly care if the data is meaningful or not since you've already formed an opinion and, by God, you're going to stick to it!

No. I just don't like biased testing.

Fan post on another forum is questionable data. Choosing 3D Rendering as the basis for a Perf/Watt test, is biased stacking of the deck to get the results he want to see.
 
Just like the SMT testing, eh?

If you are supposedly showing the "average" SMT advantage, and then you cherry pick a bunch of benchmarks that benefit more than average, while excluding those that don't, then yeah, that is also highly suspect. Much of the SMT tests that went into weren't even real applications, but small little synthetic benchmarks that showed extremely high SMT responses.
 
If you are supposedly showing the "average" SMT advantage, and then you cherry pick a bunch of benchmarks that benefit more than average, while excluding those that don't, then yeah, that is also highly suspect. Much of the SMT tests that went into weren't even real applications, but small little synthetic benchmarks that showed extremely high SMT responses.

To be fair, its hard to "highlight" AMDs SMT advantage if you show benches on the below average end.
Not everyone has time to dig up all the results and compile them in a post for you that is easy to read.
 
If you are supposedly showing the "average" SMT advantage, and then you cherry pick a bunch of benchmarks that benefit more than average, while excluding those that don't, then yeah, that is also highly suspect. Much of the SMT tests that went into weren't even real applications, but small little synthetic benchmarks that showed extremely high SMT responses.

There was no attempt to show the "average SMT advantage"; that's something you made up. They were a selection of tests chosen to show, in workloads that benefit from SMT, the difference in SMT yield between Coffee Lake-S, Skylake-X, and Zen 2. Picking a bunch of applications that don't scale with SMT or show performance regressions would be a pretty stupid thing to do in that case, don't you think? That would be like driving a couple of sports cars around in rush hour traffic and then trying to draw conclusions about their top speeds.
 
Source of graph please. Was a review somehere, or some AMD fan posting on another forum?

Was that 3D rendering again? AMD king of using the CPU for something better done on the GPU.

All that and you don't even bother to read the thread.

Stilt is an over locker / bios modder going way back. He does fairly involved review, and his IPC comparisons are very in depth.

He's not a fan, and instead of reading the thread, you dismiss any info you don't agree out of hand.
 
All that and you don't even bother to read the thread.

Stilt is an over locker / bios modder going way back. He does fairly involved review, and his IPC comparisons are very in depth.

He's not a fan, and instead of reading the thread, you dismiss any info you don't agree out of hand.

That doesn't stop him from choosing biased tests. If you base your perf/watt on something AMD is known to have an upper hand at (3d rendering), then the perf/watt results are simply going to reflect the advantage that AMD has at 3D rendering.
 
He should have chosen things like Powerpoint and Chrome and run them on laptops! Real world performance!

Snowdog is going to be really confused if Zen 3 Epycs come with SMT4 and reviewers focus a lot on multithreaded performance.
 
SMT4 sounds good for highly scalable software. but bad for mulit but low levle of threads AKA games
That just more way to have threads go into the same physical core rather then get a full physical core to each thread.


But hey no worries. as soon as this is establish as true. i will update project mercury to handle SMT4 as well for max gaming performance
 
SMT4 sounds good for highly scalable software. but bad for mulit but low levle of threads AKA games
That just more way to have threads go into the same physical core rather then get a full physical core to each thread.


But hey no worries. as soon as this is establish as true. i will update project mercury to handle SMT4 as well for max gaming performance

I believe the Windows scheduler is aware of the difference between physical and logical cores and will attempt to distribute threads among cores evenly to avoid that sort of thing.
 
That doesn't stop him from choosing biased tests. If you base your perf/watt on something AMD is known to have an upper hand at (3d rendering), then the perf/watt results are simply going to reflect the advantage that AMD has at 3D rendering.
I agree that it's not indicative of average real-world performance, but that doesn't mean it is biased. It is simply a best case scenario for multi-threaded applications, which AMD just happens to be good at at this moment.

Same reason people do CPU benchmarks on old games like Lost Planet at 720p. It is a test to show what the theoretical max performance (or max perf / per watt) would be in the best case scenario.
 
I believe the Windows scheduler is aware of the difference between physical and logical cores and will attempt to distribute threads among cores evenly to avoid that sort of thing.
Only since 1903.
and when SMT4 comes out it might take MS another 2 years to fix their scheduler
 
I believe the Windows scheduler is aware of the difference between physical and logical cores and will attempt to distribute threads among cores evenly to avoid that sort of thing.

Every testing show otherwise. and nothing from microsoft support that it does anything else beside a round robin distributing with some exception to avoid thread priority dead locks (high priority thread waiting on data from a low priority thread that is starved due to a high activity on a middle priority thread ( with the exception off the CCX related update in win10 1903 that i have not tested or seen test on)
Do you have anything that provide support for that it does take this into account ?

This takes less than 5 mins to test with 7-zip
 
Last edited:
Only since 1903.
and when SMT4 comes out it might take MS another 2 years to fix their scheduler

I believe the mentioned update was for CCX only not SMT. (I might be wrong)


But I'm updating a test system to 1903 as we speak to retest it
 
I agree that it's not indicative of average real-world performance, but that doesn't mean it is biased. It is simply a best case scenario for multi-threaded applications, which AMD just happens to be good at at this moment.

Same reason people do CPU benchmarks on old games like Lost Planet at 720p. It is a test to show what the theoretical max performance (or max perf / per watt) would be in the best case scenario.

It needs some heavy asterisks then. Perf/watt in general usage, and in 3D rendering are two different things. 3D rendering on a CPU is such a fringe activity, that is really more like a Synthetic benchmark for most people.

Yay! perf/watt champion at something only a tiny minority of people do.
 
In order to increase IPC what needs to be done is increases the pipelines.

It's been a while since it mattered, but 'pipeline stages' are what I think you're getting at. Pipelines might be... number of threads? In any case, Netburst had an absurd number of pipeline stages, as it was designed for high-bandwidth media processing, something for which it actually was very good at. Where it failed was when the branch predictor missed, and the whole pipeline had to be restarted. This negated Netbursts clockspeed advantage for the most important work a CPU does: chewing through branching code. Thus while Intel wasn't uncompetitive, and they certainly had a better platform around their CPUs at first, once AMD and their partners stabilized the platform and then moved the memory controller onboard, it was game over for Netburst. Except for Bluray players. The first ones ran Pentium IVs, because they were good at that!

With respect to 'more pipelines', if you mean more execution units per core, then that can increase IPC when properly balanced. However, that balance is important, because increasing execution units only works if they can be fed by the rest of the CPU and the system. Typically, more cache is needed, better resource allocators are needed, and even more threads given to the OS as in SMT4.

And that brings up a central question: is AMD going to outfit their SMT4-enabled cores for a variety of workloads? If they are, could that be useful to the consumer / outside of the enterprise? If not, are they going to keep the 'leaner' cores they have now and continue to improve them for the consumer?
 
It's been a while since it mattered, but 'pipeline stages' are what I think you're getting at. Pipelines might be... number of threads? In any case, Netburst had an absurd number of pipeline stages, as it was designed for high-bandwidth media processing, something for which it actually was very good at. Where it failed was when the branch predictor missed, and the whole pipeline had to be restarted. This negated Netbursts clockspeed advantage for the most important work a CPU does: chewing through branching code. Thus while Intel wasn't uncompetitive, and they certainly had a better platform around their CPUs at first, once AMD and their partners stabilized the platform and then moved the memory controller onboard, it was game over for Netburst. Except for Bluray players. The first ones ran Pentium IVs, because they were good at that!

With respect to 'more pipelines', if you mean more execution units per core, then that can increase IPC when properly balanced. However, that balance is important, because increasing execution units only works if they can be fed by the rest of the CPU and the system. Typically, more cache is needed, better resource allocators are needed, and even more threads given to the OS as in SMT4.

And that brings up a central question: is AMD going to outfit their SMT4-enabled cores for a variety of workloads? If they are, could that be useful to the consumer / outside of the enterprise? If not, are they going to keep the 'leaner' cores they have now and continue to improve them for the consumer?


agreed especially on the number of pipelines. having more does give more ipc for a given core. but you still need to have the instructions to feed it in the same thread (software). at some point you hit a big peak in diminishing returns. and you are just wasting transistor that would have been better used for more cores with fewer pipelines. Its all a balance game.
 
Only since 1903.
and when SMT4 comes out it might take MS another 2 years to fix their scheduler

I wouldn't say that Microsoft's Scheduler needs "fixed", it would just need to be made aware of the new hardware architecture.

But Uma and Numa have been around awhile now, implementing optimizations for new hardware layout would not be difficult as they are familiar with that already. It's what gave rise to needing a Scheduler.
 
It needs some heavy asterisks then. Perf/watt in general usage, and in 3D rendering are two different things. 3D rendering on a CPU is such a fringe activity, that is really more like a Synthetic benchmark for most people.

Yeah... though we're talking largely about commercial uses, doing compute-heavy work on the CPU is either a stop-gap to using dedicated hardware, is done out of convenience / laziness, or is done because using dedicated hardware would incur other penalties i.e. latency where the result is needed for further decision making on the CPU.

To wit: if either Intel or AMD (or anyone else) wanted to make a CPU better at rendering using its own ISAs, they could -- but it likely wouldn't watt-for-watt be in the same league for what CPUs are supposed to do. Essentially, AMD has struck a different balance with Zen versus Skylake, which is cool, but as seen in real-world benchmarks their balance has no meaning to the consumer on a single-core basis. AMD wins on cost, in that they provide close performance on the same number of cores as Intel at a lower level of entry, and more cores for significantly less, though that's largely not useful to consumers.

In either case, unless one needs HEDT connectivity, even more cores, or the fastest available single-core performance, on the desktop AMD now gets the recommendation.
 
It needs some heavy asterisks then. Perf/watt in general usage, and in 3D rendering are two different things. 3D rendering on a CPU is such a fringe activity, that is really more like a Synthetic benchmark for most people.

Yay! perf/watt champion at something only a tiny minority of people do.

What do you define as general usage ?
How you use you computer is most likely not the same as me. most of my cpu time is spend on highly threaded software so you suggestion would be biased in my perspective if I used the same viewpoint.
this is nothing new. benchmarks measures what you benchmark.
That you don't know how to relate it properly to your usage is on you, not the test.

if i use a screwdriver as a hammer its not a bad hammer. im just an idiot
 
It needs some heavy asterisks then. Perf/watt in general usage, and in 3D rendering are two different things. 3D rendering on a CPU is such a fringe activity, that is really more like a Synthetic benchmark for most people.

Yay! perf/watt champion at something only a tiny minority of people do.

https://www.anandtech.com/show/14605/the-and-ryzen-3700x-3900x-review-raising-the-bar/19
https://www.thefpsreview.com/2019/09/05/amd-ryzen-7-3700x-cpu-review/4/

All you have to do is look at their overall power usage under full load. The Zen2 chips pull a lot less power then the Zen+ and perform on better then Intel. There is really no doubt that AMD right now has Intel beat on performance per watt on ALL workloads. Its not like its hard to go find Zen2 reviews that include power consumption. Its pretty easy to see the 3700x draws significantly less power while being +/- a few % points on basically everything. The 3900x draws around the same power as a 9900k while powering 4 more cores and 8 more threads.

If Intel ever gets a 10nm or 7nm part out the door I have a feeling they will take the lead again in performance per watt... perhaps no doubt at that point they will probably be competing against Zen3 at 7nm+.
 
I wouldn't say that Microsoft's Scheduler needs "fixed", it would just need to be made aware of the new hardware architecture.

It's quite fair to cite Microsoft's slow reaction to both Bulldozer and Zen with respect to supporting each architecture properly. It's also understandable why that wasn't at the top of their 'list', but still, they took their damn time.
 
I wouldn't say that Microsoft's Scheduler needs "fixed", it would just need to be made aware of the new hardware architecture.

But Uma and Numa have been around awhile now, implementing optimizations for new hardware layout would not be difficult as they are familiar with that already. It's what gave rise to needing a Scheduler.

This makes no sense UMA and uma has nothing to do with SMT thread distrubtions optimization.
and the last sentence looks just wierd. the scheduler has been there since forever. You must be meaning something different than what you write
 
I believe the mentioned update was for CCX only not SMT. (I might be wrong)


But I'm updating a test system to 1903 as we speak to retest it


Great stayed an hours extra at work so i could update my work computer ( using 7 at home)
started testing with 7zip and was impressed how performances did no longer show a penalty from SMT.
Realized at my I5-4690s is 4 physical cores with no SMT :banghead:

I will test this on my laptop instead
 
Every testing show otherwise. and nothing from microsoft support that it does anything else beside a round robin distributing with some exception to avoid thread priority dead locks (high priority thread waiting on data from a low priority thread that is starved due to a high activity on a middle priority thread ( with the exception off the CCX related update in win10 1903 that i have not tested or seen test on)
Do you have anything that provide support for that it does take this into account ?

This takes less than 5 mins to test with 7-zip

Nope, that was just my belief. Doing some testing with Prime95 and Linpack doesn't seem to show that behavior (easier to test than with 7-zip), so I guess it's not a thing.

Great stayed an hours extra at work so i could update my work computer ( using 7 at home)
started testing with 7zip and was impressed how performances did no longer show a penalty from SMT.
Realized at my I5-4690s is 4 physical cores with no SMT :banghead:

I will test this on my laptop instead

Are you saying you've traditionally seen a performance penalty with HT and 7-zip?
 
Last edited:
Great stayed an hours extra at work so i could update my work computer (using 7 at home)
started testing with 7zip and was impressed how performances did no longer show a penalty from SMT.
Realized at my I5-4690s is 4 physical cores with no SMT :banghead:

I will test this on my laptop instead
I loved using your Mercury tool as it definetly helped with pre 1903.
One thing that it helped in was running Quake Live (only competitive game I play). Gotta get those 250FPS lol!
After updating to 1903, I found Mercury didn't give me any benefit in Quake Live anymore as it could already hit 250 FPS without it.
 
I wouldn't say that Microsoft's Scheduler needs "fixed", it would just need to be made aware of the new hardware architecture.

But Uma and Numa have been around awhile now, implementing optimizations for new hardware layout would not be difficult as they are familiar with that already. It's what gave rise to needing a Scheduler.

"Improved", "Fixed" whatever you want to call it...
 
It's been a while since it mattered, but 'pipeline stages' are what I think you're getting at. Pipelines might be... number of threads? In any case, Netburst had an absurd number of pipeline stages, as it was designed for high-bandwidth media processing, something for which it actually was very good at. Where it failed was when the branch predictor missed, and the whole pipeline had to be restarted. This negated Netbursts clockspeed advantage for the most important work a CPU does: chewing through branching code. Thus while Intel wasn't uncompetitive, and they certainly had a better platform around their CPUs at first, once AMD and their partners stabilized the platform and then moved the memory controller onboard, it was game over for Netburst. Except for Bluray players. The first ones ran Pentium IVs, because they were good at that!

With respect to 'more pipelines', if you mean more execution units per core, then that can increase IPC when properly balanced. However, that balance is important, because increasing execution units only works if they can be fed by the rest of the CPU and the system. Typically, more cache is needed, better resource allocators are needed, and even more threads given to the OS as in SMT4.

And that brings up a central question: is AMD going to outfit their SMT4-enabled cores for a variety of workloads? If they are, could that be useful to the consumer / outside of the enterprise? If not, are they going to keep the 'leaner' cores they have now and continue to improve them for the consumer?
I never realized that early Bluray players used P4's. I wonder how power hungry they were.... I figured they were fixed function hardware instead.
 
Nope, that was just my belief. Doing some testing with Prime95 and Linpack doesn't seem to show that behavior (easier to test than with 7-zip), so I guess it's not a thing.


Are you saying you've traditionally seen a performance penalty with HT and 7-zip?

1: im not sure if you are saying you are not seen the fixing og the SMT penalty. or if you are not seeing the penalty from sMT
how did you control the number of threads ind prime95 and linpack?


2: Yes. you can measure it on all multithreade software with threads = number of physical cores or lower. 7-zip is easy to adjust the number of threads with
You can use 7-zip- wprime, Games ( just need to analyse threads load distributions first)
Its a native penalty from how SMT works on a multicore cpu beeing it Intel or AMD so its not just HT
 
Last edited:
  • Like
Reactions: blkt
like this
Anyway my laptop is "not ready for 1903" so can run a test to see if 1903 has improved the CPU scheduler for SMT awareness
 
  • Like
Reactions: blkt
like this
1: im not sure if you are saying you are not seen the fixing og the SMT penalty. or if you are not seeing the penalty from sMT
how did you control the number of threads ind prime95 and linpack?

I'm saying that I don't see the scheduling behavior I thought was present where the scheduler would be intelligent enough to distribute threads among physical cores first before loading up each core with a second thread. I'm controlling the number of threads by specifying the number of threads to spawn in both Prime95 and LinX.

2: Yes. you can measure it on all multithreade software with threads = number of physical cores or lower. 7-zip is easy to adjust the number of threads with
You can use 7-zip- wprime, Games ( just need to analyse threads load distributions first)
Its a native penalty from how SMT works on a multicore cpu beeing it Intel or AMD so its not just HT

I think maybe I'm misinterpreting what you mean by "SMT penalty" (and I understand that it's not just HT, but you were talking about your Intel processors.) When you say "penalty" do you just mean that executing multiple threads with SMT does not scale as well as running multiple threads on physical cores or do you mean that you're actually seeing a performance regression when executing with more threads than you have cores?
 
I'm saying that I don't see the scheduling behavior I thought was present where the scheduler would be intelligent enough to distribute threads among physical cores first before loading up each core with a second thread. I'm controlling the number of threads by specifying the number of threads to spawn in both Prime95 and LinX.



I think maybe I'm misinterpreting what you mean by "SMT penalty" (and I understand that it's not just HT, but you were talking about your Intel processors.) When you say "penalty" do you just mean that executing multiple threads with SMT does not scale as well as running multiple threads on physical cores or do you mean that you're actually seeing a performance regression when executing with more threads than you have cores?


I was talking about SMT in general it just happens that the only CPU i have running in windows 10 to test the 1903 version was an intel version.
I just wnated to emphasied before the fanboys goes crazy that the SMT issue is both for Intel and AMD

anyway i did a quick test in windws 10 1803 on my laptops and it clearly show the SMT penalty


> Cinebench r15 <

4 threads / normal
357
360
363
371
366
= 363.4

4 thread / Affinity 0 2 4 6
383
383
379
383
381
= 381.8


Its clear that windows 10 1803 does NOT take SMT into account as there is a

363.4 to 381 is a slight above 5 boost from handling SMT correctly manually

side note the 2 non 383 results in cinebench was due to me not having taskmanager ready to adress affinity so it had to load taksmanager and set affinity first. which naturlala reduce benchmarkresults
so avoiding SMT thread conflicts also give way more stable results with cinebench


on your first point
If you are using taskmanager or similar to see anything about thread load distributions you are doing it wrong to begin with. This is not meant as an insult but to clarify on a very often seen misunderstanding on the CPU load scheduling topic.
this was even done by Kyle testing thread scalabilty in some games so don't feel bad about that.
This is simply just not how you see thread distribution/load as what you see is an average over time. so even though you see lets say se 8% load across all cores in taskmanger
it does not mean that more than 1 core was in use at a given time

You need to know the difference between "instant" sand "over time" situations


Some ppl might say 5% is not abig deal. but seeing how peopel went crazy about 100mhz turbo boost in ryzen. its kinda fun to se how they are losing 5% performance fomr not beeing smt aware and up to 20% for not being CCX aware



In short: windows 10 1803 CPU scheduler does not seem to be SMT aware or SMT optimal
(and you need to be using project mercury for getting max performance)

Will test 1903 as soon as I can
 
  • Like
Reactions: blkt
like this
Back
Top