AMD Zen Rumours Point to Earlier Than Expected Release

razor1 · Jan 29, 2016

Pieter3dnow said:
Yeah and those points seem intentionally vague where Jim Keller says using the DNA of Jaguar and Bulldozer. And the cross talk about K12 and Zen also makes it harder to come to certain conclusions.

But he starts with saying about making a big leap , you can not make a big leap if you copy and paste things and just give it a new name .....

Its definitely not going to be a copy and paste. But AMD still has good IP they can use, its just how they can fully utilize it with the rest of the chip. Lets just take BD as an example, its fine with multithreaded code that can utilize all its cores, the only time it really falls is when IPC is needed along with multithreaded code.

ManofGod · Jan 29, 2016

Pieter3dnow said:
https://www.youtube.com/watch?v=SOTFE7sJY-Q

no

Thank you. I know there are other areas where I do not agree with you but, this is definitely one of them that I do. Folks here seem to be going off the deep end with little to no proof or validity on what they are saying. One even claims that Zen will be only 10% faster than Bulldozer although Piledriver is already beyond that.

Truly, none of us here know how it is going to turn out. Heck, that is why I decided not to wait and picked up a 6700k at home although I found it not to be faster at 4k when all is said and done. It is faster in regards to the chipset though.

Wascrash · Jan 29, 2016

I've seen that before. I laughed where he asks what drove him to come to AMD lol.

NaroonGTX · Jan 30, 2016

-Dragon- said:
I think the idea that the core count will stay the same too is exceedingly optimistic. People talk about about a high end 8C/16T part in the $400 range but I'd put money on most parts being 4C/8T just like intel with maybe a high end 6C/12T part depending on thermals. Honestly perf/W is really what AMD needs to compete since the mobile world is booming and a slow CPU with high power has a snowball effect on the quality of the rest of the system regardless of how cheap it is.

So I stand by my statement.

BD + 40% IPC + 20% SMT = super neat
BD / 2 (cores cut in half) + 40% IPC + 20% SMT - 10% (or more) clock speed = zendozer

I wish people would actually do like half a second of research before hopping on the doomsday train. It's been known for ages that Zen will basically be 8C/16T tops on the consumer platform. This is the latest roadmap from AMD and it clearly says "high core count." Unless someone will seriously try to argue that AMD thinks 4 cores is a "high" count when they spent years selling Bulldozer and marketing the fact that they had 8 cores...

Here's another roadmap from almost a year ago that explicitly says "up to 8 cores"

Also your math at the end of that post makes no sense. You can't just take "BD" and divide by two as if that would be a clear-cut comparison to a Zen core. You could take Hypothetical Zen with its 40% increase over Excavator (not Bulldozer) and even with 10% lower clocks it would still mop the floor with it, just like Intel's current stuff does.

Also mind blown @ the guy who refuted what Jim Keller himself said about Zen/K12 being clean sheet designs. You clearly know nothing about AMD's older architectures because if you did, you'd know that Greyhound (10h i.e. Phenom I &II) already topped out ages ago, and simply going back to it and buffing it up would get them nowhere. They already shrank Greyhound down to 32nm with Llano and it didn't really shatter any records. They made minor tweaks here and there and got a 7% IPC increase in best-case scenarios, but the clocks were lower than PII's were on average so it didn't matter.

Greyhound traces its lineage all the way back to K7. K8 was literally just a slightly reworked K7 core with the on-die memory controller which greatly reduced latencies amongst a few other additions. 10h was built on top of K8. So no, Zen is NOT some mythical re-worked Greyhound core. I have no idea where people are pulling a lot of the nonsense they are spouting in this thread from.

When Keller said clean sheet he meant they were making a new design with no restrictions (basically the antithesis of what Bulldozer was.) They will re-use certain IP they already had to make the design process go by quicker. That's what he meant when he said they'd take the best of both [Jaguar and Machinery Equipment.] As to what they re-used, who knows, but he stressed that they know how to do low-power and energy efficient designs as well as high-frequency designs.

I actually nailed most of BD's performance dynamics: I predicted in single threaded workloads, it would likely be slower the Phenom II (It was, and NO ONE else predicted that outcome).

Is that anything to brag about when actual AMD engineers who worked on Bulldozer came out and told everybody outright that it wouldn't be faster in terms of per-core performance than Phenom II

? I remember they said this a good number of months before the first chips even hit the market. But everybody stayed in their hype bubble as usual.

Derfnofred · Jan 30, 2016

NaroonGTX said:
Also mind blown @ the guy who refuted what Jim Keller himself said about Zen/K12 being clean sheet designs. You clearly know nothing about AMD's older architectures because if you did, you'd know that Greyhound (10h i.e. Phenom I &II) already topped out ages ago, and simply going back to it and buffing it up would get them nowhere. They already shrank Greyhound down to 32nm with Llano and it didn't really shatter any records. They made minor tweaks here and there and got a 7% IPC increase in best-case scenarios, but the clocks were lower than PII's were on average so it didn't matter.

Jim Keller is on record saying they used modules that worked and clean sheet'd what needed fixing. If I must, I'll come back and edit this post to find the quote. I suppose that's what everyone is saying, though.

JustReason · Jan 31, 2016

Derfnofred said:
Jim Keller is on record saying they used modules that worked and clean sheet'd what needed fixing. If I must, I'll come back and edit this post to find the quote. I suppose that's what everyone is saying, though.

I think what he mentioned was that it wasn't a clean new design but used existing IP to make Zen. It isn't an iteration like say bulldozer-> Piledriver-> Excavator, but more like Phenom to Bulldozer, as in a new design using preexisting elements.

Darakian · Jan 31, 2016

Derfnofred said:
Jim Keller is on record saying they used modules that worked and clean sheet'd what needed fixing. If I must, I'll come back and edit this post to find the quote. I suppose that's what everyone is saying, though.

Parts of it are not a new design. Things like a memory controller, a pci bus controller (potentially) and the general infrastructure around the "core".

Pieter3dnow · Jan 31, 2016

I'm hoping that with the launch of AM4 motherboards we get some (bite size) information every few months about Zen until launch of Zen later this year.

Derfnofred · Jan 31, 2016

Darakian said:
Parts of it are not a new design. Things like a memory controller, a pci bus controller (potentially) and the general infrastructure around the "core".

Okay, I did remember correctly. Hopefully I didn't completely miss Naroon's point. Obviously there was no way that they were going to start wholesale again--there are plenty of modules that are "good enough" to move forward and then be touched upon later. Or, frankly, sub-optimal, but not enough sub-optimal (or too large in scope for the extent of the project) to get the attention it needed. At some point you've got to sell the product and not let scope creep eat you alive (see: an AMD that was wildly unprepared for a lack of node shrinks).

I'm going to stick to my adage that, right now in the consumer space, ~4 high performance cores is really the best place to focus one's energies. Scale up/down as you need, but let those be (sensibly) compromised. Chicken/egg in terms of simultaneous threads vs. physical cores, but that's where we lie right now.

griff30 · Jan 31, 2016

I'm eondering if we'll see a new video card interface.

gamerk2 · Feb 1, 2016

griff30 said:
I'm eondering if we'll see a new video card interface.

Why would we? PCI-E 3.0 isn't even remotely tapped out yet.

JustReason · Feb 1, 2016

gamerk2 said:
Why would we? PCI-E 3.0 isn't even remotely tapped out yet.

We cant be too sure of this yet. HSA for one could make use of the extra bandwidth. And we cant be too sure of DX12 not needing it as it allows for simultaneous communication between multiple cores and the GPU.

Pieter3dnow · Feb 1, 2016

JustReason said:
We cant be too sure of this yet. HSA for one could make use of the extra bandwidth. And we cant be too sure of DX12 not needing it as it allows for simultaneous communication between multiple cores and the GPU.

HSA compatible hardware can use HSA software. HSA does not bridge PCIE ports.

So far HSA is made for single cpu/apu solution not a single cpu/apu to (multiple) gpu solution.

Mackintire · Feb 1, 2016

I think Zen is going to be an unusual release.

AMDs probably going to release 4,6 and 8 core processors that are priced to beat Intel's at the same performance level. They'll have some sort of turbo function as well and will probably top out on turbo at or below 4Ghz.

What will be interesting is that I fully expect AMD to put the hurt on i3 processors by selling cheap superior quad core processors. AMD will fully leverage that in the mobile market basically bringing native quad core i5s down to the $650-$750 market. Intel will be faster, but AMD will lower the entry point for pricing. I expect Intel to pivot a little and start selling cache reduced native quad core i5 processors with slightly lower turbo speeds in the mobile sector, and high turbo quad/hex core + hyperthreading chips at the higher end.

Later on, I expect AMD to release both an 8 core and 16 core+SMT processor in the dual socket market priced to cause Intel price pain.

Darakian · Feb 1, 2016

Pieter3dnow said:
HSA compatible hardware can use HSA software. HSA does not bridge PCIE ports.

So far HSA is made for single cpu/apu solution not a single cpu/apu to (multiple) gpu solution.

HSA is actually designed for multiple different chips connected to each other and sharing memory. Currently the only implementation is with AMDs APUs, but the end goal is to allow arm, x86, and maybe gpu to all be connected and share resources. It's possible that a later implementation will allow different cores to talk to different GPUs and for GPUs to talk directly to one another.
http://www.hsafoundation.com/html/H...ion%201.0.1%20|Chapter%201.%20Overview|_____1

JustReason · Feb 1, 2016

Darakian said:
HSA is actually designed for multiple different chips connected to each other and sharing memory. Currently the only implementation is with AMDs APUs, but the end goal is to allow arm, x86, and maybe gpu to all be connected and share resources. It's possible that a later implementation will allow different cores to talk to different GPUs and for GPUs to talk directly to one another.
http://www.hsafoundation.com/html/H...ion%201.0.1%20|Chapter%201.%20Overview|_____1

Thank you. I didn't have to point it out. HUMA/NUMA (never can remember which one) would make a lot of CPU to GPU(discreet) latency decrease by a substantial amount.

griff30 · Feb 1, 2016

Mackintire said:
Later on, I expect AMD to release both an 8 core and 16 core+SMT processor in the dual socket market priced to cause Intel price pain.

Dual socket, if even hinted at release, will make my decision to hold on a little longer much easier.
I would love to see a Dual socket 16 core ZEN motherboard.
Especially with that uber memory bandwidth.

Darakian · Feb 1, 2016

JustReason said:
Thank you. I didn't have to point it out. HUMA/NUMA (never can remember which one) would make a lot of CPU to GPU(discreet) latency decrease by a substantial amount.

So that's actually a false statement (and it's HUMA you're thinking of). Huma gives a single address space to all memory rather than having one for main memory, one for each gpu, one for any other accelerator card, etc..

In the case of an APU where the gpu memory is the main memory then the cpu passes a pointer and the gpu has direct access. No copy required and thus faster memory access. In the case of a GPU in a Huma system you still have one address space, but either the gpu needs to copy/map what it needs from main memory or it needs to do memory access over the pci bus. The first case is what we have today and the second case is much worse. The benefit of huma in the gpu case would be strictly ease of programming (for compiler writers).

Edit:
On the dual socket thing. I have no doubt that AMD will be targeting dual socket systems (that's where the margins are).

drescherjm · Feb 1, 2016

griff30 said:
Dual socket, if even hinted at release, will make my decision to hold on a little longer much easier.
I would love to see a Dual socket 16 core ZEN motherboard.
Especially with that uber memory bandwidth.

I can't afford a $6+ thousand dollar workstation even at work..

Pieter3dnow · Feb 2, 2016

Darakian said:
HSA is actually designed for multiple different chips connected to each other and sharing memory. Currently the only implementation is with AMDs APUs, but the end goal is to allow arm, x86, and maybe gpu to all be connected and share resources. It's possible that a later implementation will allow different cores to talk to different GPUs and for GPUs to talk directly to one another.
http://www.hsafoundation.com/html/H...ion%201.0.1%20|Chapter%201.%20Overview|_____1

The ARM HSA compatible chips are due this year. But what I meant (maybe not clear enough) by single cpu/apu is just a single chip solution rather then something with separate gpu.

Btw HSA is not limited to any platform it is just that some of the design has to be streamlined in hardware before you can get HSA software to work.

gamerk2 · Feb 2, 2016

Darakian said:
So that's actually a false statement (and it's HUMA you're thinking of). Huma gives a single address space to all memory rather than having one for main memory, one for each gpu, one for any other accelerator card, etc..

In the case of an APU where the gpu memory is the main memory then the cpu passes a pointer and the gpu has direct access. No copy required and thus faster memory access. In the case of a GPU in a Huma system you still have one address space, but either the gpu needs to copy/map what it needs from main memory or it needs to do memory access over the pci bus. The first case is what we have today and the second case is much worse. The benefit of huma in the gpu case would be strictly ease of programming (for compiler writers).

Edit:
On the dual socket thing. I have no doubt that AMD will be targeting dual socket systems (that's where the margins are).

Yeah, for Discrete GPUs, the only real benefit of HUMA is a minor simplification of code compiler side. The benefits have been oversold by AMD.

I also argue that the Discrete GPU case is the RIGHT way to handle rendering anyways; you are always going to have power/thermal constraints with large on-die GPUs, not to mention the memory bus becomes a major problem, as you can optimize for bandwidth OR latency, not both. And CPU access really cares about low latency. It's not even like the copy across PCI-E is hurting you, given most of the data you need is transferred ahead of time.

Darakian · Feb 2, 2016

gamerk2 said:
Yeah, for Discrete GPUs, the only real benefit of HUMA is a minor simplification of code compiler side. The benefits have been oversold by AMD.

I also argue that the Discrete GPU case is the RIGHT way to handle rendering anyways; you are always going to have power/thermal constraints with large on-die GPUs, not to mention the memory bus becomes a major problem, as you can optimize for bandwidth OR latency, not both. And CPU access really cares about low latency. It's not even like the copy across PCI-E is hurting you, given most of the data you need is transferred ahead of time.

HSA has huge benefits just not for gaming with a discrete GPU (fun story there's more to computing than that

). Also I disagree with a discrete gpu being the 'right' way to do graphics and if you look at either of the major consoles you'll see that integration is what everyone else thinks is 'right'.
If you want low latency gaming you want an HSA system with integrated graphics.

alxlwson · Feb 2, 2016

Correct me if I'm wrong:
The most correct and fastest solution is to have everything directly connected to everything else. CPU straight to RAM. GPU straight to RAM and CPU. Minimize the physical distance a signal has to travel. Basically a seriously badass APU with on-board HBM segmented for purpose (nGB acting as standard RAM, and nGB for VRAM), yet directly accessible by all nodes on the package.

Pieter3dnow · Feb 2, 2016

In the future yes, for now the bottlenecks are not holding back performance ...

Darakian · Feb 2, 2016

Correct. A fully integrated solution will minimize communication overhead. One step better than HBM would be to have ram on die as well (we'll get there someday I hope).

alxlwson · Feb 2, 2016

Well, HBM can be integrated on die. Licensing though... that's the bitch! Also, yields. That would suck to have good yields on the processing side, then have some stupid yield problems on the storage side. I imagine we are quite some distance away from that. 32GB of L5 cache basically. lol

razor1 · Feb 2, 2016

Intel has been using edram (small amounts) for the past 2 or 3 gens (only one cpu's with integrated GPU's), putting HBM on die is kinda pointless( with the complexity and die size they will need to have a good amount of ram to access).

Deathroned · Feb 2, 2016

AMD Zen Architecture Supports Up to 32 Cores per Socket: Leaked Linux Patch.
http://www.techpowerup.com/219749/a...o-32-cores-per-socket-leaked-linux-patch.html

NaroonGTX · Feb 2, 2016

No point in having HBM on-die, especially on a consumer part. They're gonna be putting them on interposers just like what we saw with Fury.

gamerk2 · Feb 3, 2016

alxlwson said:
Correct me if I'm wrong:
The most correct and fastest solution is to have everything directly connected to everything else. CPU straight to RAM. GPU straight to RAM and CPU. Minimize the physical distance a signal has to travel. Basically a seriously badass APU with on-board HBM segmented for purpose (nGB acting as standard RAM, and nGB for VRAM), yet directly accessible by all nodes on the package.

Actually, connecting everything to main memory causes a host of problems that currently don't exist.

Problem 1 is you now have two separate computing devices each potentially trying to access main memory at the same time. As anyone who works software knows, this is a no-no; someone has to wait in order to ensure data integrity. You also need to spend significant time to ensure the CPU and GPU side of the house remain in perfect sync. Throw in the fact the CPU and GPU portions of your device will have different memory access units (they work different workloads, so you won't have one unit for both), and you have a potentially costly loss in performance in shared workloads.

Problem 2 is simply memory size requirements. There's a reason why GPUs are starting to have 4GB+ VRAM built in: They need that much space. Now, if you suddenly have GPUs using main system memory instead, take your normal memory requirements, and optimistically, double them. 16GB wouldn't be excessive anymore, it would be standard. As a result, your PC cost just went up a few hundred dollars.

Problem 3 is the Bandwidth/Latency problem. You can optimize memory access to either be super-low latency (required for CPUs), or high bandwidth (required for GPUs). Improving one aspect makes the other worse. This is why Intel slaps very expensive EDRAM on their integrated GPUs, as they are basically using it as a super fast L4 cache. Problem is, you need to manage pre-loading it ahead of time, which ironically is EXACTLY what discrete GPUs are already doing. And no, you won't ever see HBM as main system memory, simply because of the previously mentioned latency concerns, never mind the cost is significantly higher the DDR based RAM.

The way dedicated GPUs work now is about as optimized as you can get, performance wise. Vector workloads are generally easy to predict ahead of time, so in most instances, you can preload a GPUs VRAM with whatever data you need. The only cost of doing this transfer is the inherent latency writing back to RAM once you are finished, but when your target render time is 16ms (an eternity from a computing standpoint), the few extra us it takes to do this copy is insignificant. There's a reason why APUs are always starved for data (main memory simply doesn't have the necessary bandwidth), while dedicated GPUs can easily be pushed to 99% loading (because VRAM does have the necessary bandwidth).

So yeah, the actual Software Engineer here has a slightly different opinion then the talking heads.

JustReason · Feb 3, 2016

You left out a lot of what is necessary to make these work. You speak of the here and now nut act as if there is no other way. There is a reason for HSA, HUMA, and Numa: each negating problem 1. Problem 2 really isn't a problem as most power users are using 16Gb+. Problem 3 is only a problem with certain usage requirements. Say one is buying a laptop for gaming then the latency would be negligible, whereas using it for a business workhorse might not be as wise.

PROBLEM: every person, mostly self-proclaimed engineer/software/dev, claims it isn't done that way. Sure maybe not now, but we have seen how HSA, Mantle/DX12 change the landscape which then would require CHANGE to how it has been done. But that's ok, keep doing it the way you have been, give new meaning to how you coin software engineer= dinosaur.

razor1 · Feb 3, 2016

JustReason said:
You left out a lot of what is necessary to make these work. You speak of the here and now nut act as if there is no other way. There is a reason for HSA, HUMA, and Numa: each negating problem 1. Problem 2 really isn't a problem as most power users are using 16Gb+. Problem 3 is only a problem with certain usage requirements. Say one is buying a laptop for gaming then the latency would be negligible, whereas using it for a business workhorse might not be as wise.

PROBLEM: every person, mostly self-proclaimed engineer/software/dev, claims it isn't done that way. Sure maybe not now, but we have seen how HSA, Mantle/DX12 change the landscape which then would require CHANGE to how it has been done. But that's ok, keep doing it the way you have been, give new meaning to how you coin software engineer= dinosaur.

Err not really, problem 1 will always be a problem lol, there is no way around that.

Problem 2, there is a issue with this aswell, everything that is using memory on a graphics card also has to be stored in the system memory. Now if you are using a collected pool of memory for both GPU and CPU, at times you will need to access the same memory space by both processors and this is a no no, its better to have different pools. (this is just a specific example, something that can happen quite often though, there are other issues too, many others). Another problem is the bit size of the memory as well, both processors if using the same pool of memory have to be tailored for similar package transfer and access sizes, this is something the silicon has to be designed for since the registry is also a part of this too, that is a fixable problem, but not always that easy to mesh, because you also have to look at what is better for performance and scaling for each processor type.

Problem 3: he isn't talking about the same latency you are talking about....

JustReason · Feb 4, 2016

razor1 said:
Err not really, problem 1 will always be a problem lol, there is no way around that.

Problem 2, there is a issue with this aswell, everything that is using memory on a graphics card also has to be stored in the system memory. Now if you are using a collected pool of memory for both GPU and CPU, at times you will need to access the same memory space by both processors and this is a no no, its better to have different pools. (this is just a specific example, something that can happen quite often though, there are other issues too, many others). Another problem is the bit size of the memory as well, both processors if using the same pool of memory have to be tailored for similar package transfer and access sizes, this is something the silicon has to be designed for since the registry is also a part of this too, that is a fixable problem, but not always that easy to mesh, because you also have to look at what is better for performance and scaling for each processor type.

Problem 3: he isn't talking about the same latency you are talking about....

You didn't even come close to what I was talking about and yes you are wrong. You keep talking about how it is done = dinosaur. HSA isn't how it is done, it is how it will be. I am sorry that you cant step out of yesterday and better understand that you are constrained by archaic standards. Memory access of HSA is the big selling point and where it garners it greatest advantage.

Problem 2 isn't a problem even with the way it has been done as I said most power users have 16 Gb min.

Problem 3 latency I understand quite well as Vram/HBM are bandwidth-centric and Ram is latency-centric. This doesn't change my argument and why I said a Game laptop wont care a great deal about latency whereas a business/workhorse laptop would.

You need to stop telling us how it is done when everyone is talking about how it will or could be done.

razor1 · Feb 4, 2016

JustReason said:
You didn't even come close to what I was talking about and yes you are wrong. You keep talking about how it is done = dinosaur. HSA isn't how it is done, it is how it will be. I am sorry that you cant step out of yesterday and better understand that you are constrained by archaic standards. Memory access of HSA is the big selling point and where it garners it greatest advantage.

Problem 2 isn't a problem even with the way it has been done as I said most power users have 16 Gb min.

Problem 3 latency I understand quite well as Vram/HBM are bandwidth-centric and Ram is latency-centric. This doesn't change my argument and why I said a Game laptop wont care a great deal about latency whereas a business/workhorse laptop would.

You need to stop telling us how it is done when everyone is talking about how it will or could be done.

Well you just need to read more then, there is something called google, and you can google many of these things if you know the correct key words, which I doubt you do, because you have no idea of what you are posting about.

Latency, has two components, vram latency vs inter-chip latency, go read up.

Now 16 gig minimum, I thought you were talking about HBM in a integrated solution, at least that what you sounded like, so..........

Bitrate rate of vram and bit size of registry space and how they have to correlate with each other, go read up.

You need to stop posting. Or at least post after reading or at least post your posts as a question instead of a statement because the statements you make are ludicrous.

JustReason · Feb 4, 2016

razor1 said:
Well you just need to read more then, there is something called google, and you can google many of these things if you know the correct key words, which I doubt you do, because you have no idea of what you are posting about.

Latency, has two components, vram latency vs inter-chip latency, go read up.

Now 16 gig minimum, I thought you were talking about HBM in a integrated solution, at least that what you sounded like, so..........

Bitrate rate of vram and bit size of registry space and how they have to correlate with each other, go read up.

You need to stop posting. Or at least post after reading or at least post your posts as a question instead of a statement because the statements you make are ludicrous.

Actually I do know, you keep trying to pull anything out of your south end make you seem relevant. You don't specifically reference any point of mine but bring up non-HSA aspects to counter my HSA aspects. I don't care about how you have been doing it, people like you are why most think devs are lazy. You wont think outside the box, you live comfortably in it. Maybe later today I will waste more time on you but for now I am out of it.

gamerk2 · Feb 4, 2016

JustReason said:
You didn't even come close to what I was talking about and yes you are wrong. You keep talking about how it is done = dinosaur. HSA isn't how it is done, it is how it will be. I am sorry that you cant step out of yesterday and better understand that you are constrained by archaic standards. Memory access of HSA is the big selling point and where it garners it greatest advantage.

HSA offers ZERO performance benefit by itself. The only thing it does is eliminate a copy to a dedicated GPUs VARM, which is essentially free since it's done ahead of time. The only real advantage is you remove the need for APUs to carry on-die memory, with the obvious disadvantage you are going to hammer the main memory bus half to death and likely stall out the CPU anyways.

Problem 2 isn't a problem even with the way it has been done as I said most power users have 16 Gb min.

"Power Users" make up 0.01% of the market. Sorry, the needs of the 0.01% aren't going to drive the industry. The last thing the PC market needs is baseline PC prices increasing $150, given how sales are already falling off a cliff.

Problem 3 latency I understand quite well as Vram/HBM are bandwidth-centric and Ram is latency-centric. This doesn't change my argument and why I said a Game laptop wont care a great deal about latency whereas a business/workhorse laptop would.

Yes it will, and the fact you say latency doesn't matter for a gaming laptop shows you have zero idea how computers actually work internally.

Think of latency as a constant delay built in to memory access. No matter how much data you want to access at a time, there's a minimum delay built in.

Now, remember you have several thousand threads running at a time on Windows, each one which will need to do at least some memory access when they execute. In addition, every kernel thread, of which there are easily several hundred, will interrupt any user space thread, without exception. Which means that if a Kernel thread gets scheduled on a CPU core that your application is running on, your application gets bumped and stops running until it gets scheduled again [This is why threads jump across CPU cores so much on Windows]. And most of the time, that thread will be re-scheduled on a different CPU core then the one it was on before, which often means the contents it needs to execute aren't in the local L2 cache, which means another memory read needs to be performed before the application can continue to run again.

Now you triple your memory access latency, and wonder why your game just lost half it's performance.

Access latencies to CPU cache is one reason why AMD's BD line sucks so much compared to Intel, and those are minute compared to the latencies involved when accessing main system memory. CPUs require low latency; GPUs require high bandwidth. Sure, you can double GPU performance by using high bandwidth system RAM, but you'll halve CPU performance in the process.

But hey, I'm just a CS who's got over a decade of experience creating software for and designing new embedded systems.

razor1 · Feb 4, 2016

JustReason said:
Actually I do know, you keep trying to pull anything out of your south end make you seem relevant. You don't specifically reference any point of mine but bring up non-HSA aspects to counter my HSA aspects. I don't care about how you have been doing it, people like you are why most think devs are lazy. You wont think outside the box, you live comfortably in it. Maybe later today I will waste more time on you but for now I am out of it.

HSA is only a software component, it has nothing to do with the hardware...... Hardware has to be tailored to get the most out of HSA isn't just about programming. Going forward (not this gen but in the future sooner than later) there is in all likelyhood hardware will be created to take full advantage of HSA but in the interm legacy hardware will still have issues with everything you stated because a programmer will not have access to what they need to do everything, with what we have been chatting about.

alxlwson · Feb 4, 2016

Falling off a cliff is quite dramatic. You should read about PC sales more in depth. Facts are that there was growth this past year. PCs like the Surface and other comparable systems were not included with desktop and traditional laptop sales, and were counted in the same category as Android and iOS tablets. Once this adjustment is made, there was growth.
Even if these are unaccounted for, a 2% decline is nowhere near the dramatic words you chose.

razor1 · Feb 4, 2016

JustReason just a little tidbit for you, might want to read press releases by HSA foundation

http://www.hsafoundation.com/hsa-an...w-guide-to-heterogeneous-system-architecture/

This is from December of last year

How performance-bound programming algorithms and application types can be significantly optimized by using HSA hardware and software features;
Ideal mapping of processing resources from CPUs to many other heterogeneous processors, in compliance with HSA specifications ;
Clear and concise explanations of key HSA concepts and fundamentals provided by expert HSA specification contributors.

What does the red mean specifically purple mean, Hardware has to be tailored to specs to get the most out of HSA.

http://www.hsafoundation.com/standards/ read up there is one for systems architects, and one for programmers they are not the same.

You are using lazy developers as an excuse for not understanding what others have been talking about.

Now I can pull up exactly what that statement means in engineering papers, but I think you should look it up as you don't seem to want to do any leg work and want to be spoon feed while talking smack.

Anarchist4000 · Feb 4, 2016

gamerk2 said:
Problem 3 is the Bandwidth/Latency problem. You can optimize memory access to either be super-low latency (required for CPUs), or high bandwidth (required for GPUs). Improving one aspect makes the other worse. This is why Intel slaps very expensive EDRAM on their integrated GPUs, as they are basically using it as a super fast L4 cache. Problem is, you need to manage pre-loading it ahead of time, which ironically is EXACTLY what discrete GPUs are already doing. And no, you won't ever see HBM as main system memory, simply because of the previously mentioned latency concerns, never mind the cost is significantly higher the DDR based RAM.

Memory access could be both low latency and high bandwidth. It's more a question of waste. 1024bits is obviously overkill if you only need a 32bit value. If your CPU was doing 1024 bit math within a single thread HBM would likely be low latency. Multiple threads and some register space you could relax those requirements a bit the same way GPUs do. The signaling for HBM vs DDR is obviously more complex than just that, but the argument is there. The differences are in power consumption from aggressive clocks. I'd imagine the thermal characteristics of 8 high HBM2 with high clocks and low latency would be unappealing.

For Zen I'm sort of expecting HBM to be integrated, or nearby, acting as a L4 cache like what Intel did. Even for the CPU, dumping pages 1024 bits at a time would have it's advantages. 32GB of HBM2 acting like L4 on a high end server CPU would be interesting for a lot of database apps and some scientific workloads.

In regards to replacing system memory, that may be a more complex discussion. Intel's 3D XPoint and some other memory technologies might allow for some changes to conventional wisdom there. System memory only exists because massive L1 caches aren't practical.

razor1 said:
Problem 2, there is a issue with this aswell, everything that is using memory on a graphics card also has to be stored in the system memory. Now if you are using a collected pool of memory for both GPU and CPU, at times you will need to access the same memory space by both processors and this is a no no, its better to have different pools.

That's not 100% accurate as GPU memory doesn't have to be mirrored in system memory. That only happens because most workloads won't fit in GPU memory. Same thing with all data ideally fitting in a L1 cache. Ideally a video card would have enough memory to hold all your resources(textures, meshes) without streaming them on and off all the time. The only moving data should be something required by another processor. This is one of the things DX12/Vulkan should help.

And in the process of being picky, synchronization is only a concern if more than one thread will modify it, not necessarily access.

AMD Zen Rumours Point to Earlier Than Expected Release

[H]F Junkie

[H]F Junkie

Gawd

Gawd

Gawd

razor1 is my Lover

Supreme [H]ardness

Supreme [H]ardness

Gawd

Supreme [H]ardness

2[H]4U

razor1 is my Lover

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

razor1 is my Lover

Supreme [H]ardness

Supreme [H]ardness

[H]F Junkie

Supreme [H]ardness

2[H]4U

Supreme [H]ardness

You Know Where I Live

Supreme [H]ardness

Supreme [H]ardness

You Know Where I Live

[H]F Junkie

Gawd

Gawd

2[H]4U

razor1 is my Lover

[H]F Junkie

razor1 is my Lover

[H]F Junkie

razor1 is my Lover

2[H]4U

[H]F Junkie

You Know Where I Live

[H]F Junkie

[H]ard|Gawd