Looking for 2990WX Answers

Seems like that would be a software issue not utilizing the cores available, should be a easy fix.
 
Seems like that would be a software issue not utilizing the cores available, should be a easy fix.
Well, I have been pointing this out for two months now. Premeire rolled out a new version this week, about to test it. However, it seemed to greatly hobble another CPU I happened to be reviewing at the moment.
 
It looks like the 2990WX seems to using be too much those cores that only access memory thru the infinity fabric. Can you try blocking it away from those cores (cores 8-15 & 24-31 appear to be those cores with no direct memory access) with process lasso' core affinity and giving it a shot?
 
It looks like the 2990WX seems to using be too much those cores that only access memory thru the infinity fabric. Can you try blocking it away from those cores (cores 8-15 & 24-31 appear to be those cores with no direct memory access) with process lasso' core affinity and giving it a shot?
I would think this is the problem or part of the problem since the 2950X does not exhibit this. You would think after a couple months AMD would have addressed the issue that is specific to workstation usage.
 
Just checked PP v13...it is even more broken. Actually all CPU encode times up from v12.1.2.

 
  • Like
Reactions: N4CR
like this
Conspiracy or laziness? Or avoiding fix because of less market penetration?
 
Conspiracy or laziness? Or avoiding fix because of less market penetration?

I speculated before TR2 launch that there might be some workloads where there might be scaling issues due to way memory is accessed on the CCX's. Likely due to a scheduling issue being unaware of which CCX's have direct access.
 
I speculated before TR2 launch that there might be some workloads where there might be scaling issues due to way memory is accessed on the CCX's. Likely due to a scheduling issue being unaware of which CCX's have direct access.

That's exactly what the 'Dynamic local mode' does, it prioritizes memory-hungry threads to CCX clusters with direct memory access.

One would wonder if it would have been better to run one memory channel per die like 1+1+1+1 instead of the 2+0+2+0 system they have. Amazing chip, shitty software integration.
 
I speculated before TR2 launch that there might be some workloads where there might be scaling issues due to way memory is accessed on the CCX's. Likely due to a scheduling issue being unaware of which CCX's have direct access.
Of course this is the issue and it can be optimised around, so e.g. it should be no slower than a 2 die TR configuration (2950) but it doesn't explain why it has regressed further in performance with an update now...
 
Kyle try process lasso

I use it and LOVE it.....
So cutting the affinity down to the cores that touch the memory, basically turning the 2990WX into a 2950X brings it somewhat back into parity with the 2950X by 10%.


Using Process Lasso with full affinity, brings the encode time down to 665 seconds.
 
Kyle trade me your 2990wx I'll ship you my 2950x and well both be happy. Ha ha
If I had purchased one for myself already, I would. Alas I only have one to test with and I do not feel like shelling out the $1800 for the one I was going to buy to use primarily with Premiere Pro.
 
I agree with your tweets that effectively turning off half the cores of the CPU to get get at least close to the times provided by a cheaper chip is 100% unacceptable. I believe the root of the problem is that PP is using those (I like to call them orphan) cores that have no direct memory access is pretty unique to the AMD WX chips. As I understand it the big brother Epyc doesn't have that problem since all core have access to a couple channels of local memory (8 channels total). Even multi-socket systems also have their local channels of memory to work with.

PP appears to be very memory latency-tied if that us the case and I am not sure if there is a fix beyond Adobe changing PP to use those orphan core packages for tasks that are not memory latency sensitive. Honestly I doubt that will ever happen since the 2990WX is a pretty niche market chip and the R&D costs would be huge. As for AMD fixing it, the only way I see that happening is if they effectively make it an Epyc chip by turning on all 8 of the memory channels
 
That's exactly what the 'Dynamic local mode' does, it prioritizes memory-hungry threads to CCX clusters with direct memory access.

One would wonder if it would have been better to run one memory channel per die like 1+1+1+1 instead of the 2+0+2+0 system they have. Amazing chip, shitty software integration.

Memory hungry does not mean bandwidth hungry (although there is a correlation). I'm assuming its using cache hits or something similar to determine accesses, which may be inaccurate in certain situations. IMO, 1+1+1+1 is better than 2+0+2+0. Anytime where you have uneven resource access, you run into potential problems. But the problem with the former is that any 2-die TR/TR2 will only have 1+1 instead of 2+2.

Of course this is the issue and it can be optimised around, so e.g. it should be no slower than a 2 die TR configuration (2950) but it doesn't explain why it has regressed further in performance with an update now...

Speaking as a software engineer, it is extremely irritating to code around architectural deficiencies. As I've said before the TR2 launch, it was a huge oversight to NOT include full memory support. I've linked an ATX sized 1S Epyc board, so it's not even an issue of making a board layout. This is very similar (but to a less extent, because the Infinity fabric is faster than QPI) to having a 2S system, but not populating memory in the second socket. We've had customers that do that (and wondering why their apps don't scale well), and I wish I could punch them in the face remotely.

I've suggested to people interested in high core count systems to test a TR2 (4 die version) before committing to deployment, because their software may have scaling issues.
 
I dont think these problems exist on linux.

I am truly thinking this is 100% windows 10 scheduler.

I have seen reports of linux astronomically performing better across the board with the 32 core.
 
I got the workload I am using over to AMD last night. It seems after a bit of press on this they are taking this very seriously. They have Adobe involved as well. Not sure on Microsoft yet.
 
IMO the WX models are more of a "holy shit we didn't realize how well things were going and there is demand, lets see what we can shove in an existing platform".

I seriously doubt they planned for full-die TR, even TR 1xxx itself started as the engineer's love letter to enthusiasts by their own account.

I personally expect the next AMD HEDT platform to straight up use the latest Epyc socket as-is, now they know there is actual demand and growth potential there.

They can still differentiate on features, ram support etc through motherboard/chipset tiers and firmware. Intel did it for ages which some of us could make use of in cool ways (X58, X79, X99) until they started closing doors (fuck the polished turd known as LGA 2066).

Anyways back on topic, if what you need isn't on linux: Process Lasso or similar is great until windows catches up and lets you override stupid programs and has true fine-grained user control of the scheduler. Don't hold your breath too long.
 
1534110513sxihb7v6l1_3_5.png
 
Kyle. A man of action. If I could give you a fist bump all the way from Sydney, I would.
I am not so sure there are not some things "broken" in PPro, but I am going to redo my workload to get rid of some irregularities and will start out simple and see how the numbers line up again. One that is for sure, is that AMD has been able to reproduce some of my results in their lab with my PPro project that I sent over to them. So we all know that I am not crazy now....at least about this.
 
Back
Top