AMD possibly going to 4 threads per core

Ready4Dis · Oct 3, 2019

GoodBoy said:
It's not stupid to include those workloads if AMD is trying to sell me on SMT4 in a desktop processor.... and someone writing an article on SMT should be including those cases where SMT is NOT beneficial, or not as beneficial, or a detriment, otherwise it is an incomplete picture as I stated in my previous post...

Nobody said they were trying to sell you on it. This is simply a discussion, some people may want/use it and others won't. Could be EPYC or TR only, or used for very low power devices that don't have as many physical cores to keep up. They could release a 3 core 12 thread apu for a laptop, or 2 core 8 thread and it would use the smt in a lot of current work loads. They could put it in a 64/256 package for Enterprise. There are target audiences, if you aren't one and don't care, great. I like discussing technical features and possibilities, I run servers at work and a personal dual Xeon server at my house. This interests me more than someone that just games as high of fps as possible. I don't buy a $1k GPU, but a $1k CPU I would depending on its merits.

IdiotInCharge · Oct 3, 2019

Rockenrooster said:
Most everyone here knows that most games benefit very little from SMT so why show benchies for them...

The thing is, games can benefit from SMT -- both by utilizing the extra threads themselves, and by the OS putting less important stuff on those extra threads and keeping it out of the way of game threads.

This doesn't always show in benchmarking, of course, since you're trying to isolate things: but the basic point is that not having enough threads will choke games and cause increased frametimes, and just adding SMT in this situation can help alleviate the problem.

See: 2500K vs. 2600K, 7600K vs. 7700K, and for an alternative, 9700K vs. 8700K, where it's a tossup, and 9700K vs. 9900k, where there are enough hardware cores to satisfy nearly all available gaming scenarios today.

Rockenrooster · Oct 4, 2019

IdiotInCharge said:
The thing is, games can benefit from SMT -- both by utilizing the extra threads themselves, and by the OS putting less important stuff on those extra threads and keeping it out of the way of game threads.

This doesn't always show in benchmarking, of course, since you're trying to isolate things: but the basic point is that not having enough threads will choke games and cause increased frametimes, and just adding SMT in this situation can help alleviate the problem.

See: 2500K vs. 2600K, 7600K vs. 7700K, and for an alternative, 9700K vs. 8700K, where it's a tossup, and 9700K vs. 9900k, where there are enough hardware cores to satisfy nearly all available gaming scenarios today.

I didn't say they can't, I just said most.
Although most people also have a bunch of crap installed, (like me) Adding SMT alone can help a lot of things like you said. A lot of these peple that say "I don't need more cores/threads, they're useless to me!" don't realize that reviewers use very barebones systems with no crap running in the background besides FPS capture tools, or they run barebones systems themselves. CPUs with more cores/threads have better real world performance over time as things get loaded up and bogged down, unless you reformat every 6 months... A 7700k might be faster than a 1700 for gaming.....But would it be faster in gaming on my system with all the crap I have running??? Probably not lol.

BrotherMichigan · Oct 4, 2019

GoodBoy said:
It's not stupid to include those workloads if AMD is trying to sell me on SMT4 in a desktop processor.... and someone writing an article on SMT should be including those cases where SMT is NOT beneficial, or not as beneficial, or a detriment, otherwise it is an incomplete picture as I stated in my previous post...

So what you're saying is you didn't actually read the post I linked and you're just here to whinge. Got it.

SvenBent · Oct 4, 2019

Grimlaking said:
Ok after seeing all of this conversation about your tool... I want it where is a link? And does it work on server OS's or do they not have this issue? And do you license for enterprise use if it does work?

.. i ussualy dont put in link because... i do feel like im proclaimg to much self addvertizement and i would really liek to get the technical point out first lol
anyway my file server is down ( i am awating replacment part from anothe [H] memeber)

But major geeks and softpedia should have pretty recent mirrors
however i am unable to provide direct links as both are blocked at my work

so search for Project Mercury on those 2 places and you should get it

SvenBent · Oct 4, 2019

Goodboy
Despite multiple request to explain what exactly got fixed between XP and 2k in regards to SMT you have yet to inform it
I've tried googling it but found nothing.
Can you place elaborate with technical examples of the difference?

GoodBoy · Oct 4, 2019

Windows 2000 had no awareness or support of SMT. It thought those logical cores were physical cores.

Windows XP, SMT support was included. First (desktop) SMT processor was Xeon and P4's released in 2002.

Wasn't exactly anything being "fixed", but support and awareness added, and the OS understood they were logical cores.

Between then and now, support for newer cpu's has been added, and "fixed" when needed. Server 2019, Windows 10 build 1903 properly understand the hardware of the newer AMD cpu's.

Are you trying to say that even with the latest patch in Windows 10, that there are still thread scheduling issues on AMD cpu's? Or SMT cpu's in general? (Haven't yet looked at your Project Mercury but it sounds useful)

Rockenrooster · Oct 4, 2019

GoodBoy said:
Windows 2000 had no awareness or support of SMT. I thought those logical cores were physical cores.

Windows XP, SMT support was included. First (desktop) SMT processor was Xeon and P4's released in 2002.

Wasn't exactly anything being "fixed", but support and awareness added, and the OS understood they were logical cores.

Between then and now, support for newer cpu's has been added, and "fixed" when needed. Server 2019, Windows 10 build 1903 properly understand the hardware of the newer AMD cpu's.

Are you trying to say that even with the latest patch in Windows 10, that there are still thread scheduling issues on AMD cpu's? Or SMT cpu's in general? (Haven't yet looked at your Project Mercury but it sounds useful)

He said previously that he has yet to test 1903...
I would also be interested if project mercury has any benefit still with 1903.

funkydmunky · Oct 5, 2019

SvenBent said:
.. i ussualy dont put in link because... i do feel like im proclaimg to much self addvertizement and i would really liek to get the technical point out first lol
anyway my file server is down ( i am awating replacment part from anothe [H] memeber)

But major geeks and softpedia should have pretty recent mirrors
however i am unable to provide direct links as both are blocked at my work

so search for Project Mercury on those 2 places and you should get it

Major Geeks offers only the 32-bit version, with 64-bit only at authors site which is unavailable. May I ask what the 32-bit version hampers?

SvenBent · Oct 5, 2019

GoodBoy said:
Windows 2000 had no awareness or support of SMT. It thought those logical cores were physical cores.

Windows XP, SMT support was included. First (desktop) SMT processor was Xeon and P4's released in 2002.

Wasn't exactly anything being "fixed", but support and awareness added, and the OS understood they were logical cores.

Between then and now, support for newer cpu's has been added, and "fixed" when needed. Server 2019, Windows 10 build 1903 properly understand the hardware of the newer AMD cpu's.

Are you trying to say that even with the latest patch in Windows 10, that there are still thread scheduling issues on AMD cpu's? Or SMT cpu's in general? (Haven't yet looked at your Project Mercury but it sounds useful)

Windows XP, SMT support was included. First (desktop) SMT processor was Xeon and P4's released in 2002.
Support as explainedw before does not mean proper thread scheduling to avoid thread conflicts on the physical cores
also can you back this up with anything that windows 2k does not support SMT

Wasn't exactly anything being "fixed", but support and awareness added, and the OS understood they were logical cores.
Our definition of understand the logical cores are different than cause windows XP and beyond up until window 10 1803 does NOT threat the logical core setup differently from the physical core setup in regards to thraed sheduling as proven earlier in this thread
if if it did we woul not gain a performance increase from manually assing threads correctly
So exactly like i said, you are using a different "SMT aware" than I was.

Between then and now, support for newer cpu's has been added, and "fixed" when needed. Server 2019, Windows 10 build 1903 properly understand the hardware of the newer AMD cpu's.
again you like to use vaque no technical description
as i clearly stated about 1903.
i believe it fixs the CCX issue and not the SMT., that would live up to your vaque non technical description but still lave the SMT issues where it is
i even brought up the reason to why MS would fix CCX issues before SMT

Do you have any proof on this claim that it is fixed for thread conflics with SMT ?
Just keep repeating yourself when asked for evidence does not make it right
in the contrary it tend to show that you just done have the eevidence to back it up

Are you trying to say that even with the latest patch in Windows 10, that there are still thread scheduling issues on AMD cpu's? Or SMT cpu's in general? (Haven't yet looked at your Project Mercury but it sounds useful)[/QUOTE]
As clearly writting in prior post. i tsted in 1803 because i did not have a 1903 machine avaialble
maybe try to red what people are actually writting instead of putting you fingers in the ears and just repeating yourself

it seems more and more like you ae the kind of person that want to be right on a forum rather than figuring out what is right
and instead of using time to bring forth evidence you just repeat youself.

If you have 1903 built why don you test it and show us the proof
all here are interested in figuring out.

SvenBent · Oct 5, 2019

funkydmunky said:
Major Geeks offers only the 32-bit version, with 64-bit only at authors site which is unavailable. May I ask what the 32-bit version hampers?

I originaly made in just 64 bit because ididn't think anyone interested would still have a 32bit OS
I was wrong. So I made the 32 bit version

There should no be any diffrent in functionality
Howeevr 64bit verison uses around 15% less cpu instructions in its full outer loop
this loop cna be run in full a couple of millions times a sec on a core2duo so fixing reducing it 15% in one loop that happesne very time you click a new windows is hardly even measurable
or said diffrently if you click around 1000+ windows a sec you might have sightly more cpu utilization

SvenBent · Oct 5, 2019

GoodBoy

ok lets take you earlier claim that since windows XP All SMT cpu support has been optimal.
Back then SMT was only on single core CPU's
So the thread conflicts sitaution would not be present. P$ was a horrible disng to put Hyper threading in due to it design to overcome its long pipelines. menai it had an overuses of of the piplines that prevent optimal SMT Gain
but it owuld have no penalties as the extra thread would never be in a position where it could get another physsical cores. there was only one

The SMT thrad conflcits issue only came at around 9xx I7 because it was the first multicore cpu with SMT ( in consumer markets). and the issue arrived there.
so it would be most logical ( but not 100% sure) that the fix in windowsXP does nothing compared the this SMT thread conflcits issue im talking about. ( and proven)

OutOfPhase · Oct 5, 2019

The P4 actually had HT added because of its very long pipelines. It was not a horrible decision to add it at all.

Without extra threads per core, a missed branch prediction, cache miss, etc just utterly ended performance. That's it - the core is doing nothing until the (long) pipeline flows again. With another thread however, one can stall (still bad) but then there's a good chance the other thread is not, and can thus be serviced by the core which would otherwise be doing absolutely nothing.

Remember, SMT is a tool to increase core utilization.

erek · Oct 5, 2019

http://www.redgamingtech.com/amd-confirms-zen-3-milan-details-64-cores-smt-2-and-8-cores-per-ccx/

Mega6 · Oct 5, 2019

"appears to dispell the rumours that AMD planned to release Milan with a 4x SMT implementation, which alleged that Zen 3 would offer users four threads per CPU core. It looks like the main source of performance improvements from Zen 3 will come from IPC enhancements and clock speed gains, rather than increases in core/thread count. Hopefully, this means that Zen 3 will focus on single-threaded performance and core architecture improvements."

https://www.overclock3d.net/news/cp..._architecture_details_and_zen_4_genoa_plans/1

1_rick · Oct 5, 2019

Tom's Hardware, I think it was, had an article about this too. One of the things Zen 3 will do is unify the two CCXes so that there's one 8-core complex with a unified L3 cache. That will eliminate CCX-CCX latency.

Mega6 · Oct 5, 2019

1_rick said:
Tom's Hardware, I think it was, had an article about this too. One of the things Zen 3 will do is unify the two CCXes so that there's one 8-core complex with a unified L3 cache. That will eliminate CCX-CCX latency.

Saw that too.. these Zen Cores will just keep getting better and better. AMD is consistently addressing the weakest points of the design. Unified L3 Cache coming too.

IdiotInCharge · Oct 5, 2019

Mega6 said:
Saw that too.. these Zen Cores will just keep getting better and better. AMD is consistently addressing the weakest points of the design. Unified L3 Cache coming too.

It's a very good sign -- they're not going to keep up with Intel if they don't have major refinements coming down the pipe!

SvenBent · Oct 5, 2019

PhaseNoise said:
The P4 actually had HT added because of its very long pipelines. It was not a horrible decision to add it at all.

Without extra threads per core, a missed branch prediction, cache miss, etc just utterly ended performance. That's it - the core is doing nothing until the (long) pipeline flows again. With another thread however, one can stall (still bad) but then there's a good chance the other thread is not, and can thus be serviced by the core which would otherwise be doing absolutely nothing.

Remember, SMT is a tool to increase core utilization.

From what I've read.
due to hte long piplines and thereby the big penatly for a prediction miss as you mention.,
the P4 was very aggressive filling up the pipline thereby leaving very little free space for the extra threads to be executed in.

I'll try to see if i can find the article again

If you have anything on the design of the p4 was optimal for hyperthreading i wouldb e interested in reading it

-- edit --
OH it stumbled on it on wiki

https://en.wikipedia.org/wiki/Simultaneous_multithreading
The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-threading, and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement [3] compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application-dependent; however, when running two programs that require full attention of the processor it can actually seem like one or both of the programs slows down slightly when Hyper-threading is turned on.[4] This is due to the replay system of the Pentium 4 tying up valuable execution resources, increasing contention for resources such as bandwidth, caches, TLBs, re-order buffer entries, equalizing the processor resources between the two programs which adds a varying amount of execution time. The Pentium 4 Prescott core gained a replay queue, which reduces execution time needed for the replay system. This is enough to completely overcome that performance hit

So yeah parts of the P4 was not optimal working hyperthreading but it got fixed with the prescott
core

https://en.wikipedia.org/wiki/Replay_system
The replay system came about as a result of Intel's quest for ever-increasing clock speeds. These higher clock speeds necessitated very lengthy pipelines (up to 31 stages in the Prescott core). Because of this, there are six stages between the scheduler and the execution units in the Prescott core. In an attempt to maintain acceptable performance, Intel engineers had to design the scheduler to be very optimistic.[2]

The scheduler in a Pentium 4 processor is so aggressive that it will send operations for execution without a guarantee that they can be successfully executed. (Among other things, the scheduler assumes all data is in level 1 "trace cache" CPU cache.) The most common reason execution fails is that the requisite data is not available, which itself is most likely due to a cache miss. When this happens, the replay system signals the scheduler to stop, then repeatedly executes the failed string of dependent operations until they have completed successfully.[2][3]

So I was right. according to wiki. P4 was to aggressive to fil lup the pipeline and therebu sub cpu ressources theat would have been better to free up for HT to fit in another thread

Not saying HT was not beneficial but there was part of the p4 that directly worked aganist the effectivnes of SMT.
However I did not remember that it got fixed in prescott version as wiki mentiones

GoodBoy · Oct 7, 2019

SvenBent said:
Windows XP, SMT support was included. First (desktop) SMT processor was Xeon and P4's released in 2002.
Support as explainedw before does not mean proper thread scheduling to avoid thread conflicts on the physical cores
also can you back this up with anything that windows 2k does not support SMT

SvenBent You windows 2000 question: https://devblogs.microsoft.com/oldnewthing/20040913-00/?p=37883

He also mentions the Scheduler, and this was back in the XP days.

For the majority of these SMT/HT posts, I thought we were talking about the operating system. You last post you are talking about a CPU (P4 and it's HT performance/operation). Which is it?

SvenBent said:
it seems more and more like you ae the kind of person that want to be right on a forum rather than figuring out what is right
and instead of using time to bring forth evidence you just repeat youself.

Not at all.

SvenBent said:
If you have 1903 built why don you test it and show us the proof
all here are interested in figuring out.

I do have 1903, but I run an Intel CPU not an AMD Ryzen... I haven't ran into HT issues on my CPU, but the issues you mention would need tested on an AMD processor. If something is getting lost in translation, and if I have misunderstood part of your posts, I apologize.

We all know that Windows wasn't properly scheduling threads on The first generation of the Ryzen 1/2/ Zen. As I understand it, you are trying to place this blame on the OS. And it is true, it did not understand how to be most efficient on the new layout, I believe because of the way the CCX's were arranged. But the OS has been updated to be aware of this. If you have a Ryzen CPU, why not patch your OS and test it yourself instead of singling out my posts?

AMD has redesigned the CCX layout completely in the newest Zen (see video posted above) to be more efficient. Sounds like the first design wasn't the best.

Grimlaking · Oct 7, 2019

GoodBoy said:
SvenBent I do have 1903, but I run an Intel CPU not an AMD Ryzen... I haven't ran into HT issues on my CPU, but the issues you mention would need tested on an AMD processor. If something is getting lost in translation, and if I have misunderstood part of your posts, I apologize.

I've been following this thread for a while now and the back and forth asking the same questions with insanely detailed responses.

To answer your question the Hyper-threading issue is NOT the CCX issue. The Hyper-threading issue can be experienced on ALL cpu's that offer Hyper-threading or SMT.

So to be very clear. You should be able to test if 1903 build of windows 10 is properly utilizing your Hyper-threaded cores by not unnecessarily over loading a single physical CPU when more are available. And if it is doing it intelligently by running the tests done before with the app in question that fixes scheduling. If you run the computationally heavy workload with and without the program working and see the SAME results then you are good. If you see WORSE results without it then Intel still isn't properly using hyper-threaded cores in the most optimal fashion. If it is BETTER without the app or EQUAL TO then you are set because 1903 is doing it right currently.

The CCX issue is the only AMD specific issue and that was addressed in a previous update to 1903.

os2wiz · Oct 7, 2019

IdiotInCharge said:
Do you have a few comprehensive resources for this point? I've seen it repeated and I'm a bit curious.

Specific SMT setups where the OS and hardware are developed closely in concert show linear scaling, but that's not what we're talking about here- it's just the 'best case' scenario. IBM with their own operating systems (UNIX-based) with their Power CPU line is a strong example here, and a weaker example would be Apple developing IOS, their flavor of ARM CPUs, and enforcing strict application guidelines. Not really SMT related to my knowledge, but that's the level of hardware - operating system - software coupling that's needed.

With respect to x86 as the base instruction set and Windows and Linux as the predominant operating systems, the decision point between 'wider' cores with more resources and SMT versus just adding more cores is more difficult.

If AMD wants to pile more threads on to each core, cool, but the big problem with SMT happens when a core has multiple threads fighting over exhausted resources causing overall execution speed to decline. That's where we don't want to end up.

4 way smt will NOT scale well, is totally useless for home use and will only be available for server chips. AMD is not wasting it's time on 4 way SMT for consumers .

IdiotInCharge · Oct 8, 2019

os2wiz said:
4 way smt will NOT scale well, is totally useless for home use and will only be available for server chips. AMD is not wasting it's time on 4 way SMT for consumers .

Which you can prove using multiple expert opinions that you can link...?

The same was said about HT / SMT2 when it arrived. Most theorized that it would make things worse, and well, it did in a number of cases. But it worked enough way back in 2002 that it was worth keeping around, such that AMD used SMT2 on Zen fifteen years later.

Moving from SMT2 to SMT4 and making it useful in the same manner that the move from a single-thread per core to SMT2 was useful is only a matter of making sure that the CPU cores are afforded the appropriate resources and the OS is updated to take advantage of them.

SvenBent · Oct 8, 2019

GoodBoy said:
SvenBent You windows 2000 question: https://devblogs.microsoft.com/oldnewthing/20040913-00/?p=37883

He also mentions the Scheduler, and this was back in the XP days.

For the majority of these SMT/HT posts, I thought we were talking about the operating system. You last post you are talking about a CPU (P4 and it's HT performance/operation). Which is it?

Not at all.

I do have 1903, but I run an Intel CPU not an AMD Ryzen... I haven't ran into HT issues on my CPU, but the issues you mention would need tested on an AMD processor. If something is getting lost in translation, and if I have misunderstood part of your posts, I apologize.

We all know that Windows wasn't properly scheduling threads on The first generation of the Ryzen 1/2/ Zen. As I understand it, you are trying to place this blame on the OS. And it is true, it did not understand how to be most efficient on the new layout, I believe because of the way the CCX's were arranged. But the OS has been updated to be aware of this. If you have a Ryzen CPU, why not patch your OS and test it yourself instead of singling out my posts?

AMD has redesigned the CCX layout completely in the newest Zen (see video posted above) to be more efficient. Sounds like the first design wasn't the best.

Thank you for relevant info
However a blog post is not the technical especial not when we can measure and see its not correct what you concluded from the blogpost ( in regards to the thread conflicts issues I'm bringing up)
Unless the authors is only talking about a dual socket setup. (which i believe he is from the wording)
In that case I have no testing or opinion on that and he might absolute be correct that window XP can distribute load to two different socket correctly to avoid SMT conflicts

But on a single socket setup with multiple cores and SMT the CPU scheduler pre windows10 1903 does not handling thread distribution optimally. We see this easily in testing

This is still interesting to see though, and something I would love to test. however my 4 socket citrix s server has a burned out PSU and its to expensive to get a new one

Not running into problem and if it there is 2 highly different statements
Did you do any testing of it? i would like to see the results

Again I'm not sure why you keep bringing in Ryzen into it. The thread conflicting issue I'm bring up is in regards to SMT is both Intel or AMD all the way back from the I7 9xx series
I already said from the very beginning that my current understanding ( but not tested) is that 1903 fixed the Ryzen CCX issues but NOT the SMT issues (Which is not Ryzen specific)
The Ryzen specific CCX issue is a different story.

I'm not sure why the pentium4 is confusing to you. it was a again different situation I was talking about.
I meant did spell it out for you "
Back then SMT was only on single core CPU's
So the thread conflicts situation would not be present
It seems like you are trying to hard at a "Gotcha" moment here but no taking the time to actually understand what is being said.

The P4 greedy pipe filling issue with SMT has nothing to do with the SMT thread conflicts issue.
As explained earlier the SMT threads conflicts issue can only happen on a multicore system wich is NOT the case with a P4.
The P4 issues is a DIFFERENT story.
The Northwood core was simply not getting optimal performance gains out of SMT due to its instruction scheduler design that was filling the pipe with junk

To Repeat.
Please show us the testing you have done it would be nice to see

OutOfPhase · Oct 8, 2019

IdiotInCharge said:
Which you can prove using multiple expert opinions that you can link...?

The same was said about HT / SMT2 when it arrived. Most theorized that it would make things worse, and well, it did in a number of cases. But it worked enough way back in 2002 that it was worth keeping around, such that AMD used SMT2 on Zen fifteen years later.

Moving from SMT2 to SMT4 and making it useful in the same manner that the move from a single-thread per core to SMT2 was useful is only a matter of making sure that the CPU cores are afforded the appropriate resources and the OS is updated to take advantage of them.

Exactly. While I have some skepticism about this particular implementation will be for home use, it's really just a balancing tool. If their data shows the cores aren't fully occupied a significant amount of the time, it may be worth doing. It does depend upon the workload, which is why I'm personally skeptical about the utility to the home market (fewer high-thread applications overall) - but I have zero actual data there.

I worked on research a long time ago on an architecture which had many threads and one core (technically). Now that one core as you might imagine had a lot of processing units - far more than traditional cores. It was an attempt to avoid siloing and give maximum flexibility. It didn't pan out at the time but the idea itself isn't fundamentally flawed, and things have changed.

SvenBent · Oct 8, 2019

PhaseNoise said:
Exactly. While I have some skepticism about this particular implementation will be for home use, it's really just a balancing tool. If their data shows the cores aren't fully occupied a significant amount of the time, it may be worth doing. It does depend upon the workload, which is why I'm personally skeptical about the utility to the home market (fewer high-thread applications overall) - but I have zero actual data there.

I worked on research a long time ago on an architecture which had many threads and one core (technically). Now that one core as you might imagine had a lot of processing units - far more than traditional cores. It was an attempt to avoid siloing and give maximum flexibility. It didn't pan out at the time but the idea itself isn't fundamentally flawed, and things have changed.

I for long wondered if a design like this would would be able to balance out the parallel workload ( lots of logical cores) and still maintain a high physical core speed for low threaded situation
but i guess you put in enough transistors for the massive amount of sub core components you end up the same place as with multiple cores.

What i would love to see would be some kind of cross communication SMT
Like 2 physsicla cores sharing 4 logical cores

So each of the 2 physical cores can execute thread assigned to each of the 4 logical cores.
This could improve thread balancing among the 2 physical cores. and eliminate part of the thread conflicts from SMT in hardware.

Not sure if the added managing overhead will eat into budget of transistor and/or crate to much latency or other speed penalties to be worth it.

IdiotInCharge · Oct 8, 2019

SvenBent said:
I for long wondered if a design like this would woud ble able to balance out the parrelel worklaod ( lots of logical cores) and still maintain a high physiscl core speed for low threaded situation
but iguess you put in enough transitor for the massibe amount of subcore components you end up the same palce as with multiple cores.

This is my fundamental question between multiple cores, 'hybrid' cores a la Bulldozer (two INT pipelines with one FLOAT pipeline per 'module'), and various SMT implementations.

Perhaps we'll see a 'shrink' of per-core resources, an elimination of SMT, and many more cores in future designs?

I think that all of these approaches can work very well, but that they're all dependent on the software stack above them to be useful versus other options.

SvenBent · Oct 8, 2019

IdiotInCharge said:
This is my fundamental question between multiple cores, 'hybrid' cores a la Bulldozer (two INT pipelines with one FLOAT pipeline per 'module'), and various SMT implementations.

Perhaps we'll see a 'shrink' of per-core resources, an elimination of SMT, and many more cores in future designs?

I think that all of these approaches can work very well, but that they're all dependent on the software stack above them to be useful versus other options.

The issue I personally see with brao parrallel performance (aka many cores physsicla or logical) is that some task are simply not mathematically able to split up in multiple parts

One of my favourt examples is LZMA from 7-zip every almost every pieces of calculation is depending on the results from the previous calculation.
There is not way to scale it to multiple threads

What happend in LZMA 2 was the modeling and entropy coding got seperated in cause they are saperet sub processes
Saa Data A goes to modeling (Thread 1) and then the output form that goes to Entroyp encoding (thread 2)
That as much serials splitting up of the process that could be done so only 2 threads.

So how come 7-zip lzma2 can do 16 threads

because in stead of splitting up the process. it split up the data in chunk. and working each chunk sepperatly.
Since this mean a multiple compression dictionaries, the memory consumption increased linary with the amount of threads.
it also mean ifficiany goes down as the chunk are not compared with other chunks beeing process as its now multiple parral "Channels" of chunnks" that doon see the redudancy in seperate "Channels"

so the draback are
1: increase in memory footprint
2: reduced efficiency

Bottom line is really that on a semi philosophical level: multicore is a patch solution to not getting faster core speeds

IdiotInCharge · Oct 8, 2019

SvenBent said:
Bottom line is really that on a semi philosophical level: multicore is a patch solution to not getting faster core speeds

For purely serial workloads, this will absolutely remain the case.

Multi-core / multi-threading (and supporting infrastructure) is quite useful for multiple serial workloads of course -- and that would at least include an application and supporting operating system and hardware drivers.

But the case could be made for say a very, very fast quad-core setup. It's just not feasible to manufacture such a device today

.

SvenBent · Oct 8, 2019

IdiotInCharge said:
For purely serial workloads, this will absolutely remain the case.

Multi-core / multi-threading (and supporting infrastructure) is quite useful for multiple serial workloads of course -- and that would at least include an application and supporting operating system and hardware drivers.

But the case could be made for say a very, very fast quad-core setup. It's just not feasible to manufacture such a device today .

Yeah but if you have the same total speed on one core it doesn matter if you work load is multi threade i one core cna handle multipel threads

on the other hand if you have the same tota perofrmance on a multicore cpu you NEED you software to be multithreading

offcause this is strongly theoretical on more like a phillosophical level

But it "always" better to have the same performance as as few core as possible as its easier to balance out.

and I agree it no feasable to try to do a ryzen 2700 cpu total performance into a single core cpu. but IF we could at the same price the cpu would be better (again very roughly)
but that what i mean its a patch solutions. its get to expensive/hard/complex/diffuclt to continue increasing performance with just one core so we added in another
and another and another because it was more feasable.

Grimlaking · Oct 8, 2019

SvenBent said:
Yeah but if you have the same total speed on one core it doesn't matter if you work load is multi threaded i one core can handle multiple threads

on the other hand if you have the same total performance on a multi-core CPU you NEED you software to be multi-threaded.

offcause this is strongly theoretical on more like a philosophical level

But it "always" better to have the same performance as as few core as possible as its easier to balance out.

and I agree it no feasible to try to do a Ryzen 2700 CPU total performance into a single core CPU. but IF we could at the same price the CPU would be better (again very roughly)
but that what i mean its a patch solutions. its get to expensive/hard/complex/difficult to continue increasing performance with just one core so we added in another
and another and another because it was more feasible.

This depends on your code. The odds of you having... 4... 16 GHz cores.. is VERY low. It would be very fast and snappy especially for single threaded apps.. but in theory it's performance advantage of multi-thread aware and optimized code would be minimal. The brass tax is we can talk about what WOULD be awesome as opposed to what IS bad-ass today.

It is more cost/energy efficient today to make heavily muti cored processors rather than lower count heavy GHz capable CPU's only because of the limits of modern technology. (Unless you want to talk about quantum computing but I don't think many of us could afford that cooling bill.

SvenBent · Oct 8, 2019

Grimlaking said:
This depends on your code. The odds of you having... 4... 16 GHz cores.. is VERY low. It would be very fast and snappy especially for single threaded apps.. but in theory it's performance advantage of multi-thread aware and optimized code would be minimal. The brass tax is we can talk about what WOULD be awesome as opposed to what IS bad-ass today.

It is more cost/energy efficient today to make heavily muti cored processors rather than lower count heavy GHz capable CPU's only because of the limits of modern technology. (Unless you want to talk about quantum computing but I don't think many of us could afford that cooling bill.

You are repeating the point. The Reason we got multicore is because its "Easier" than having the same performance in a single core.

It doesn't change that a single cores cpu with the same total performance as a multicore cpu with be superior. its just not feasible to make a single core cpu with the same total performance as we can get with multicore. due to cost/ material limitation etc

If we have a singlecore CPU that can DO 8x the amount of "work" that a core on an 8core cpu
the scaling would go likes this

1thread
Singlecore is 8x as fast

2threads
singlecore 4 time as fast

3 threads
single cores is 2.66 times as fast

4 threads
Single cores is 2x af fast

etc etc all the way down to

8 threads and above
Single core is 1x as fast

So yeah a singelcore cpu with the same total performance as a multicore cpu is superion in regards to scaling
since singel cores run single thread and multit equally.
However a multicore cpu will need multithreading software to get out all of its performance.

However its not feasable or cost effective or maybe even possible to make a single cores with te same total power
but that is besides the point that few cores with the same total performance, scales better across worklaod with different amount of threads

GoodBoy · Oct 9, 2019

Yeah, I think the single core performance we have now is at it's maximum already. New CPU's come out with typically single digit performance increases (IPC).

So they are giving us all these cores... we eat it up. But the gain we get from that is diminishing...

The whole '4 way SMT' (if useful) would really just mean that the core isn't being fed fast enough, and/or is complex enough that various sections (of the single cores pipeline) can both be kept occupied when 2 (3, or 4) threads running only need different sections' of the single cores pipeline. So they keep more parts of the single core busy with SMT. 2 way SMT is pretty useful with these complex cores. 4 way SMT would be less useful. It's diminishing returns to the point where it actually hinders performance. Even 2 way SMT can hinder performance (depends on the workload/code). I can't see AMD or Intel spending transistors/TDP/die space to add it for a few percent gain... paid for at the expense of performance of well written single threads, for desktop or HEDT processors. It can make sense in a server chip, software that is licensed per socket can be a money saver if you have 8 threads per socket, cloud providers might bill on a per thread or virtual core basis, they can accommodate more customers with more virtual cores so that is a win for them.

Not so much for us.

Grimlaking · Oct 9, 2019

GoodBoy said:
Yeah, I think the single core performance we have now is at it's maximum already. New CPU's come out with typically single digit performance increases (IPC).

So they are giving us all these cores... we eat it up. But the gain we get from that is diminishing...

The whole '4 way SMT' (if useful) would really just mean that the core isn't being fed fast enough, and/or is complex enough that various sections (of the single cores pipeline) can both be kept occupied when 2 (3, or 4) threads running only need different sections' of the single cores pipeline. So they keep more parts of the single core busy with SMT. 2 way SMT is pretty useful with these complex cores. 4 way SMT would be less useful. It's diminishing returns to the point where it actually hinders performance. Even 2 way SMT can hinder performance (depends on the workload/code). I can't see AMD or Intel spending transistors/TDP/die space to add it for a few percent gain... paid for at the expense of performance of well written single threads, for desktop or HEDT processors. It can make sense in a server chip, software that is licensed per socket can be a money saver if you have 8 threads per socket, cloud providers might bill on a per thread or virtual core basis, they can accommodate more customers with more virtual cores so that is a win for them.

Not so much for us.

Actually there is a case in point where SMT 4+ is useful for consumers. Especially once MS figures out how to address busy threads to physical cores over SMT or Hyperthreaded virtual cores.

That is the lower powered desktop systems or even laptops. Why put a octocore processor in a laptop if you can just a 4 core with smt 4. You get 16 threads to work with. All of your background tasks get assigned to those and your gaming or more intensive productivity apps get distributed to your physical cores. Sure it might not be a screamer for something that really needs all of the threads available... but it IN THEORY would be an amazing laptop to own. Your power use is much lower, your heat is much lower, you get a crap ton of threads to utilize so that excess power and cooling that would be used for a higher end physical core count CPU can now be allocated to cooling a better video solution. Couple that with a PCIe based storage solution and you're talking about a rather bad ass laptop and technologically speaking... not a lot of cost (in heat and or power) from the CPU at least. (Oh and don't forget about those PCIe Lanes. You can drive every kind of ultra fast port you want to on that bad boy.

GoodBoy · Oct 9, 2019

I agree with your theory in general. but I suspect what would make that seem like it works well, is when the majority of the threads are idle. And a few cores can have many idle threads running on it just fine, without SMT.
I'm not sure the end user would see or notice any real benefit. And of course in certain workloads there likely would be some improvement, but the workloads where that would be the case that I can think of, are not things people would typically do on a laptop.

For desktops I think it would be really niche workloads that might scale well with 4way SMT.

Bankie · Oct 9, 2019

Rockenrooster said:
I didn't say they can't, I just said most.
Although most people also have a bunch of crap installed, (like me) Adding SMT alone can help a lot of things like you said. A lot of these peple that say "I don't need more cores/threads, they're useless to me!" don't realize that reviewers use very barebones systems with no crap running in the background besides FPS capture tools, or they run barebones systems themselves. CPUs with more cores/threads have better real world performance over time as things get loaded up and bogged down, unless you reformat every 6 months... A 7700k might be faster than a 1700 for gaming.....But would it be faster in gaming on my system with all the crap I have running??? Probably not lol.

HardwareUnboxed or someone did this test recently. Having background processes made no difference. Unless you're doing something taxing like streaming using the CPU or encoding video it's not going to matter.

SvenBent · Oct 9, 2019

Since nobody took the time to run a simple test i guess it up to me to put out some real numbers

Windows 1903 finally installed on my laptop on this is the SMT thread conflicts test i just ran

4threada no affinity
591
591
598

4threads affinity 1357
602
599
602

4threads affinity 0246
597
599
597

There is no winners.
This does seem to indicate (again cant prove a negative) that Windows thread scheduler in 1903 have had imrovet to better understand SMT as the SMT penalty is if not gone then almost gone in this test
This is the first time this test has come with a negative result, since I've been testing SMT on the i7 920 on windows 7

Again: in my short test it seems to show that 1903 has improved scheduling for SMT CPU's

This was in intel i5 8300h

P.s. the number are higher than the previous test because i rant it with the power adapter connected

os2wiz · Oct 10, 2019

IdiotInCharge said:
Which you can prove using multiple expert opinions that you can link...?

The same was said about HT / SMT2 when it arrived. Most theorized that it would make things worse, and well, it did in a number of cases. But it worked enough way back in 2002 that it was worth keeping around, such that AMD used SMT2 on Zen fifteen years later.

Moving from SMT2 to SMT4 and making it useful in the same manner that the move from a single-thread per core to SMT2 was useful is only a matter of making sure that the CPU cores are afforded the appropriate resources and the OS is updated to take advantage of them.

The fact is there is no pressing need for 4 way smt any time soon except in the enterprise. When the need for it exists then and only then will it become a part of consumer cpus.

SvenBent · Oct 10, 2019

os2wiz said:
The fact is there is no pressing need for 4 way smt any time soon except in the enterprise. When the need for it exists then and only then will it become a part of consumer cpus.

Speak for yourself pleb.

on the serious side my CPU use more CPU hours on highly scaleabble stuff that could probably use SMT4
if SMT4 would just give me a 10% boost that would men days I get to finish something faster

Some of my scrips easily scales the works load to above 1000 threads without loosing almost any effiiciency.

IdiotInCharge · Oct 10, 2019

GoodBoy said:
Yeah, I think the single core performance we have now is at it's maximum already. New CPU's come out with typically single digit performance increases (IPC).

I think your premise here doesn't really take into account recent history -- Intel's Ice Lake architecture is somewhere around four years late. It represents a jump in IPC as tested, while Zen 2 is the culmination of AMDs first new architecture since they went backward with IPC with Bulldozer.

Let's say that Intel gets stuff back on schedule, which they're historically likely to do, and AMD continues to improve IPC with future Zen iterations (something they're not likely to do, given their history, but let's run with it), we should continue to see ~10% IPC increases every few years.

Further, what no one can really predict is where the application of machine learning to compiler and CPU design will take us. We could very well start to see large speedups just based on being smarter about how written code is compiled into machine code to be executed, using current CPU architecture paradigms, and we could also see new CPU paradigms emerge due to related research.

Bankie said:
HardwareUnboxed or someone did this test recently. Having background processes made no difference.

1. Please link.
2. 'Background processes', while broadly applicable, is also pretty nebulous in defition because it is different for every user and even different from instance to instance for most users.

And what you're really looking for is how the longest frametimes are affected, because that's what you 'feel'.

AMD possibly going to 4 threads per core

2[H]4U

NVIDIA SHILL

Gawd

Limp Gawd

2[H]4U

2[H]4U

2[H]4U

Gawd

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Supreme [H]ardness

[H]F Junkie

2[H]4U

Supreme [H]ardness

2[H]4U

NVIDIA SHILL

2[H]4U

2[H]4U

2[H]4U

Gawd

NVIDIA SHILL

2[H]4U

Supreme [H]ardness

2[H]4U

NVIDIA SHILL

2[H]4U

NVIDIA SHILL

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

2[H]4U

Gawd

2[H]4U

NVIDIA SHILL