cageymaru

Fully [H]
Joined
Apr 10, 2003
Messages
22,078
Wendell at Level1Techs has done extensive testing of AMD Ryzen Threadripper 2990WX systems in an attempt to find out what is causing performance regressions under Windows. When AMD Ryzen Threadripper 2990WX processors first arrived on the scene, the performance regressions observed were blamed on the choice to use NUMA and only using only 4 memory channels. But then more testing showed that when running apps that are native to Windows and Linux, performance regressions would show up during Windows testing, but the same system would be extremely fast under Linux. If memory bandwidth was the issue, then the performance regressions should have appeared in Linux also.

Even testing an AMD EPYC 7551 32 core/64 thread monster revealed the same performance regressions issue and it has 8 memory channels. After conferring with other hardware testers, Wendell finally believes that he has found the issue that is causing the performance regressions; the Windows kernel! Wendell and another brilliant tech enthusiast named Jeremy at Bitsum collaborated to create a utility called CorePrio to fix issues with the Windows kernel that caused it to possibly only use one NUMA node. They say it gave them double the performance in their testing with Indigio. The article is a deep dive into the technical aspects of the problem and the solution that is a highly recommended read!

This is most likely related to a bugfix from Microsoft for 1 or 2 socket Extreme Core Count (XCC) Xeons wherein a physical Xeon CPU has two numa nodes. In the past (with Xeon V4 and maybe V3), one of these NUMA nodes has no access to I/O devices (but does have access to memory through the ring bus). If that's true, then that work-around to make sure this type of process stays on the "ideal CPU" in the same socket has no idea what to do when there is more than one other NUMA node in the same package to "fail over" to.
 
Last edited:
My main complaints with this issue have been X.264/5 encoding in HandBrake and Premiere Pro. Some initial tests did not show this "fix" being effective for that. However it was on a 2990WX system set up on a previous Win 10 install for an Intel box, that was likely a month+ old. So I am doing a fresh Win 10 Pro install now to go back and check to see if this makes any difference with that. X.264/5 encode issues with the 2990WX was the only reason I recently built my new system with the 2950X. Will report back later Thursday hopefully.
 
Well you can download Handbrake for Ubuntu - might be worth firing up for an A/B test against Handbrake on a fresh W10.
Why do I need Ubuntu to tell me something is wrong with HandBrake in Windows? 2990WX is slower than a 2950X and about the same speed as a 2700X. I have been talking to AMD about this being broken since the 2990WX was launched.
 
Not shocked in the least; the Windows scheduler has been optimized a lot in recent years, and has had some really *odd* performance in NUMA systems in some situations for a while now. I outright called a NUMA scheduler problem about a month ago.
 
Interesting stuff, and kudos if they're correct. After reading some of the sly conspiracy comments on Level1 I had to smile, because it is so easy to introduce a bug like this. It's so easy it sometimes makes you wonder how anything works. But it's equally easy to read the researcher's explanations, then look at a big corporation like Microsoft and say, "How could you let something like this happen without spotting it?"

Complex systems are complex.
 
Fixing problems with quick and dirty code is so common elsewhere that I'm not surprised to see it in kernels as well.
 
Interesting stuff, and kudos if they're correct. After reading some of the sly conspiracy comments on Level1 I had to smile, because it is so easy to introduce a bug like this. It's so easy it sometimes makes you wonder how anything works. But it's equally easy to read the researcher's explanations, then look at a big corporation like Microsoft and say, "How could you let something like this happen without spotting it?"

Complex systems are complex.

I think they should release a kernel patch as a hotfix for anyone who needs this. Doing it as a regular update would not be necessary.
 
Out of schedule Windows patching is usually reserved for CVEs and beta code. I bet the fix is slated for the next "patch Tuesday".

I think they should release a kernel patch as a hotfix for anyone who needs this. Doing it as a regular update would not be necessary.
 
And this is one of many reasons I believe that Microsoft is going to make a Windows that runs Linux kernel, a Windows X. Besides the whole Embrace Extend and Extinguish, it'll probably be cheaper for them to use code that actually works and that isn't written by idiots.
 
Last edited:
Interesting stuff, and kudos if they're correct. After reading some of the sly conspiracy comments on Level1 I had to smile, because it is so easy to introduce a bug like this. It's so easy it sometimes makes you wonder how anything works. But it's equally easy to read the researcher's explanations, then look at a big corporation like Microsoft and say, "How could you let something like this happen without spotting it?"

Complex systems are complex.

MS thread scheduler is kinda horrid and does not take a lot of newer technology into account like

SMT
CMT
CCX
Chiplets

its just sees all logical cores as equal and its no always the case

E.g. on an SMT/CMT system:
If the scheduler would fill out quanta for every even cores or uneven cores first and then if there are still thread active for CPU they should fill out the rest
They would gain a close to 20-30% performance boost in CPU heavy stuff where the CPU has to many logical cores

Thats why i made project mercury it really improving min fps and average fps on a lot of games

like 23% FPS boost in battlerite on ryzen
 
Last edited:
  • Like
Reactions: DocNo
like this
Out of schedule Windows patching is usually reserved for CVEs and beta code.
Depends. Microsoft might publish a hotfix before, if pressed enough by media reports. Or it could take another round or two in the Windows 10 release cycle.

As AMD claimed to be in talks with Microsoft on the topic of 2990WX performance, it would be nice if Kyle or someone else could ask their contacts at AMD/Microsoft to which extent they were already aware of the results from Level1Techs.
 
MS thread scheudler is kinda horrid and does not take alot of newer technology into account liek

SMT
CMT
CCX
Chiplets

its jsut sees all logical cores as equal and its no alwsy the case

if the scheduler would fill out quata for every even cores or uneven cores first and then if ther eare still thread active for cpu they should fil lout the rest
They would gaing a close to 20-30% performance boost in cpu heavy stuff whre the core has to many logical cores

Thats why i made project mercery it really improving min fps and average fps on a lot of games

like 23% FPS boost inbattlerite on ryzen

Downloading now as I have an OC'd 1700!
 
As AMD claimed to be in talks with Microsoft on the topic of 2990WX performance, it would be nice if Kyle or someone else could ask their contacts at AMD/Microsoft to which extent they were already aware of the results from Level1Techs.
I had been discussing this at length with AMD and Adobe back in November, and they were not onto this line of thinking at all.
 
Oh? So, you mean optimizing for the Intel platform was a bad move after all, eh? Looks to me like the FX series would have actually performed better if they had actually optimized for them then. Things happen, people figure out what is going on, fixes occur, wash, rinse, repeat.
Nah, but it was a bit of a shortsighted optimization. Sure, more than two numa nodes might be rare, but when there are, you would probably expect to lose a good bit of performance in the target workloads of that cpu.
 
Here are my results this morning using CorePrio. Our main issues have been with X.264/5 encoding with HandBrake and Premiere Pro. With both applications, our encode times with the 2990WX are longer than when compared to the 2950X and not much faster than a 2700X. Handbrake has limited scaling above 8 cores, but using a 2990WX incurs a penalty with HandBrake in our experience. I did a fresh Install of Win 10 Pro with all updates as of 2am today. Latest versions of PP (v13.0.2) and HB (v1.2.0) were used. We also threw turning off Core 0 through the affinity settings.

You can see some of our past results here, which are NOT directly comparable to the information below, but will give you an idea of what we have been looking at since launch.

1542064030iqjgcp4v5q_3_5.png


1542064030iqjgcp4v5q_3_8.png




System reboots between all runs.


PPro - Using CorePrio here actually gave us worse encode time.

Fully stock settings: 9min 32sec / 9min 36sec.
Core Prio: 12min 24sec / 11min :43sec.
No 0 Affinity: 8min 23sec / 9min 04sec / Crash. (You can see where slipping in Ian Cuttress' "0 Affinity" method did once give us a faster time by 1min, second run was close to stock, and third run crashed.)

HandBrake - Using CorePrio made no discernible difference. -
Fully stock 5min 37sec.
CorePrio: 5min 25sec.
No 0 Affinity: 5min 33sec.

While I VERY MUCH wanted to see this work with PPro, that is just not the case. :(
 
I mean, if your CPU is performing as expected on Linux but not on Windows, then it's pretty obvious the CPU is not the weak link. Wintel is a thing
 
  • Like
Reactions: GHRTW
like this
Phoronix also tested Windows Server 2016/2019 and overall it wasn't faster.
https://www.phoronix.com/scan.php?page=article&item=windows-server-2990wx
microsoft goes out of their way to tell everyone one kernel, one whatever blah blah blah. there are some software config differences e.g. only 4 sockets are supported on workstation/server products, not win10 pro. Microsoft needs to experiment in the kernel here, is likely scared to death to do so, and will instead experiment on anyone that picks up windows 10 for workstations, would be my guess.
 
Well, there is always alternative OS's.....

Although I hope that a fix can be found. One day I would like to upgrade my TR and not worry about a performance hit.
 
Microsoft needs to experiment in the kernel here, is likely scared to death to do so, and will instead experiment on anyone that picks up windows 10 for workstations, would be my guess.
Actually, I think the problem is rather lack of robust scheduler implementation. The "ideal_cpu" hack is so typical of Microsoft culture (or DNA?). When I learned about it from your article, I was immediately reminded of many other things coming out of Microsoft in the past:

Chromium developers rocking 24 core Xeon workstations and running into all sorts of strange scaling problems: https://randomascii.wordpress.com/category/investigative-reporting/
Edge not performing well on YouTube after video gets overlaid with transparent <div>, a Google engineer responds: https://news.ycombinator.com/item?id=18702383
And it is not a recent trend. This continues down to OOXML specification lacking generality, extensibility and flexibility, and probably much further: http://www.robweir.com/blog/2007/01/how-to-hire-guillaume-portes.html
 
Is there any chance you can fiddle with the timing box in coreprio? try like.. 75ms? We noticed for apps like 7zip absurdly small numbers seemed to help, but performance was all over the place.
12min 03sec run at 75ms.

8min 40sec at 64T/500ms

also if you can use bitd -p processname to produce dumps similar to https://level1techs.com/sites/default/files/uploads/indigo-numa-slow-new.txt -- which should immediately tell us if the thread affinity is actually being reset, or if the process is reset/protecting it somehow.

C:\Program Files\coreprio>coreprio bitd -p Adobe Premiere Pro

CorePrio (c)2018 Bitsum LLC - https://bitsum.com
build 0.0.3.6 Jan 1 2019

BEGINNING THREAD MANAGEMENT - PRESS CTRL+C TO SIGNAL STOP

Refresh rate is 500 ms
Managing up to 32 software threads simultaneously
System has 64 logical CPU cores and an active mask of FFFFFFFFFFFFFFFF
Using prioritized affinity of 00000000FFFFFFFF
NUMA Fix is NOT applied

Managing 0 threads (of 1159 in 160 processes) ...
Managing 32 threads (of 1159 in 160 processes) ...
Managing 15 threads (of 1156 in 160 processes) ...
Managing 25 threads (of 1155 in 160 processes) ...
Managing 18 threads (of 1153 in 160 processes) ...
Managing 25 threads (of 1151 in 160 processes) ...
Managing 18 threads (of 1151 in 160 processes) ...
Managing 28 threads (of 1151 in 160 processes) ...
Managing 20 threads (of 1151 in 160 processes) ...
Managing 24 threads (of 1151 in 160 processes) ...
Managing 20 threads (of 1151 in 160 processes) ...
Managing 20 threads (of 1153 in 160 processes) ...
Managing 23 threads (of 1153 in 160 processes) ...
Managing 25 threads (of 1161 in 160 processes) ...
Managing 22 threads (of 1159 in 160 processes) ...
Managing 18 threads (of 1160 in 160 processes) ...
Managing 20 threads (of 1158 in 160 processes) ...
Managing 20 threads (of 1158 in 160 processes) ...
Managing 24 threads (of 1158 in 160 processes) ...
Managing 20 threads (of 1158 in 160 processes) ...
Managing 18 threads (of 1159 in 160 processes) ...
Managing 21 threads (of 1157 in 160 processes) ...
Managing 22 threads (of 1157 in 160 processes) ...
Managing 23 threads (of 1157 in 160 processes) ...
Managing 17 threads (of 1157 in 160 processes) ...
Managing 29 threads (of 1156 in 160 processes) ...
Managing 14 threads (of 1157 in 160 processes) ...
Managing 27 threads (of 1157 in 160 processes) ...
Managing 17 threads (of 1155 in 160 processes) ...
Managing 24 threads (of 1155 in 160 processes) ...
Managing 24 threads (of 1156 in 160 processes) ...
Managing 27 threads (of 1156 in 160 processes) ...
Managing 20 threads (of 1156 in 160 processes) ...
Managing 25 threads (of 1156 in 160 processes) ...
Managing 22 threads (of 1156 in 160 processes) ...
Managing 22 threads (of 1156 in 160 processes) ...
Managing 21 threads (of 1156 in 160 processes) ...
Managing 20 threads (of 1156 in 160 processes) ...
Managing 22 threads (of 1156 in 160 processes) ...
Managing 23 threads (of 1156 in 160 processes) ...
Managing 20 threads (of 1153 in 160 processes) ...
Managing 23 threads (of 1153 in 160 processes) ...
Managing 21 threads (of 1153 in 160 processes) ...
Managing 18 threads (of 1153 in 160 processes) ...
Managing 25 threads (of 1153 in 160 processes) ...
Managing 16 threads (of 1154 in 160 processes) ...
Managing 23 threads (of 1153 in 160 processes) ...
Managing 20 threads (of 1153 in 160 processes) ...
Managing 23 threads (of 1152 in 160 processes) ...
Managing 23 threads (of 1151 in 160 processes) ...
Managing 17 threads (of 1151 in 160 processes) ...
Managing 24 threads (of 1150 in 160 processes) ...
Managing 22 threads (of 1150 in 160 processes) ...
Managing 28 threads (of 1150 in 160 processes) ...
Managing 21 threads (of 1150 in 160 processes) ...
Managing 23 threads (of 1150 in 160 processes) ...
Managing 17 threads (of 1150 in 160 processes) ...
Managing 28 threads (of 1150 in 160 processes) ...
Managing 16 threads (of 1150 in 160 processes) ...
Managing 24 threads (of 1150 in 160 processes) ...
Managing 18 threads (of 1150 in 160 processes) ...
Managing 25 threads (of 1154 in 160 processes) ...
Managing 21 threads (of 1154 in 160 processes) ...
Managing 26 threads (of 1154 in 160 processes) ...
Managing 25 threads (of 1154 in 160 processes) ...
Managing 23 threads (of 1141 in 160 processes) ...
Managing 22 threads (of 1141 in 160 processes) ...
Managing 23 threads (of 1141 in 160 processes) ...
Managing 21 threads (of 1141 in 160 processes) ...
Managing 20 threads (of 1137 in 160 processes) ...
Managing 20 threads (of 1136 in 160 processes) ...
Managing 22 threads (of 1137 in 160 processes) ...
Managing 18 threads (of 1135 in 160 processes) ...
Managing 26 threads (of 1134 in 160 processes) ...
Managing 19 threads (of 1134 in 160 processes) ...
Managing 23 threads (of 1134 in 160 processes) ...
Managing 25 threads (of 1134 in 160 processes) ...
Managing 22 threads (of 1135 in 160 processes) ...
Managing 22 threads (of 1135 in 160 processes) ...
Managing 22 threads (of 1135 in 160 processes) ...
Managing 16 threads (of 1134 in 160 processes) ...
Managing 25 threads (of 1134 in 160 processes) ...
Managing 23 threads (of 1134 in 160 processes) ...
Managing 18 threads (of 1134 in 160 processes) ...
Managing 23 threads (of 1134 in 160 processes) ...
Managing 19 threads (of 1134 in 160 processes) ...
Managing 23 threads (of 1134 in 160 processes) ...
Managing 19 threads (of 1134 in 160 processes) ...
Managing 24 threads (of 1134 in 160 processes) ...
Managing 15 threads (of 1129 in 160 processes) ...
Managing 24 threads (of 1129 in 160 processes) ...
Managing 20 threads (of 1130 in 160 processes) ...
Managing 26 threads (of 1130 in 160 processes) ...
Managing 22 threads (of 1130 in 160 processes) ...
Managing 25 threads (of 1130 in 160 processes) ...
Managing 23 threads (of 1130 in 160 processes) ...
Managing 24 threads (of 1131 in 160 processes) ...
Managing 25 threads (of 1131 in 160 processes) ...
Managing 17 threads (of 1131 in 160 processes) ...
Managing 24 threads (of 1131 in 160 processes) ...
Managing 24 threads (of 1131 in 160 processes) ...
Managing 23 threads (of 1131 in 160 processes) ...
Managing 19 threads (of 1131 in 160 processes) ...
Managing 19 threads (of 1132 in 160 processes) ...
Managing 24 threads (of 1132 in 160 processes) ...
Managing 18 threads (of 1133 in 160 processes) ...
Managing 24 threads (of 1133 in 160 processes) ...
Managing 18 threads (of 1133 in 160 processes) ...
Managing 21 threads (of 1130 in 160 processes) ...
Managing 22 threads (of 1129 in 160 processes) ...
Managing 24 threads (of 1127 in 160 processes) ...
Managing 17 threads (of 1126 in 160 processes) ...
Managing 25 threads (of 1126 in 160 processes) ...
Managing 17 threads (of 1123 in 160 processes) ...
Managing 27 threads (of 1123 in 160 processes) ...
Managing 21 threads (of 1124 in 160 processes) ...
Managing 21 threads (of 1124 in 160 processes) ...
Managing 15 threads (of 1124 in 160 processes) ...
Managing 21 threads (of 1124 in 160 processes) ...
Managing 21 threads (of 1124 in 160 processes) ...
Managing 19 threads (of 1124 in 160 processes) ...
Managing 21 threads (of 1124 in 160 processes) ...
> Ctrl-C event
Managing 22 threads (of 1125 in 160 processes) ...
Unmanaging all threads
 
Dang it... there goes my master plan for getting a Threadripper at a discount price. Problem wasn't supposed to be exposed until later this year... sigh...
 
You can still get a Threadripper. Not all of them suffer from this....
 
I just wanted 6-8 core with close to 5GHz and quad channel 3600CR1. When will this happen?!



This indeed is an interesting situation. I'm very curious how AMD didn't have this all lined out with MS before release.
 
I'm very curious how AMD didn't have this all lined out with MS before release.

I'd assume that they wagered that the number of people who wanted >US$1000 24-core+ CPUs with low clockspeeds as desktop toys was fairly slim.

Even here, Kyle could dual-boot Ubuntu to get the work done in Davinci. Which is what most non-[H] people would do ;).
 
I'd assume that they wagered that the number of people who wanted >US$1000 24-core+ CPUs with low clockspeeds as desktop toys was fairly slim.

Even here, Kyle could dual-boot Ubuntu to get the work done in Davinci. Which is what most non-[H] people would do ;).

Wouldn't the fact remain that even the EPYC processors would suffer from this issue?
 
Wouldn't the fact remain that even the EPYC processors would suffer from this issue?

Do they? I thought the issue was down to the latencies involved in having CCX modules on Threadripper that don't have local memory access, whereas the EPYC CCXs all do?
 
Do they? I thought the issue was down to the latencies involved in having CCX modules on Threadripper that don't have local memory access, whereas the EPYC CCXs all do?


Ahh. I know I meed to do some more reading on them, but I thought all but that big bastard was the same as the TR
 
Wouldn't the fact remain that even the EPYC processors would suffer from this issue?
Do they? I thought the issue was down to the latencies involved in having CCX modules on Threadripper that don't have local memory access, whereas the EPYC CCXs all do?
You haven't seen the video by level1techs.com, Epyc suffers too
 
Gave CorePrio a run with my 1950x @ 3.9 and Indigo.

Setting CorePrio to NUMA Dissociator mode made took me from 1.008 to 1.7x in the bedroom scene. Dont recall other numbers as I was playing with it quite a few hours ago before I read this article/thread.

Leaving it in standard mode or any other settings would result in no change in performance from stock settings.

Also gave me some extra performance in BF5 I think, but might be kicking in a bug with Vega that I thought was dead - The card for some reason drops usage (frequency stays up but blue lights on side of card drop and performance drops as well). Used to be a reboot would fix the issue so i'm going to try that and see if its just an issue that came back from a recent patch or being caused by CorePrio.
 
Yeah, I doubt this is a big enough priority for them to do an out of schedule hotfix for a performance issue. It's not a security issue, this isn't deleting users files, and the systems _do_ work, just not as well. I highly doubt this will come out of schedule.

Depends. Microsoft might publish a hotfix before, if pressed enough by media reports. Or it could take another round or two in the Windows 10 release cycle.

As AMD claimed to be in talks with Microsoft on the topic of 2990WX performance, it would be nice if Kyle or someone else could ask their contacts at AMD/Microsoft to which extent they were already aware of the results from Level1Techs.
 
Back
Top