Project Denver Maxwell

wilson502 · Sep 16, 2014

I was curious, since i havent read anything about it in the recent leaks of the GTX 9xx series that are the GTX 970 and 980 going to have the ARM CPU on board, or is this delayed till the refresh?

Liger88 · Sep 16, 2014

Nope. Basically turned into a dud. Project Denver is basically dead as a concept and or has been pushed to Pascal, which will most likely included an ARM processor for some use in the Unified Memory Architecture.

Stoly · Sep 16, 2014

AFAIK there was no nvidia announcement of Maxwell having an ARM core. It was only a rumor.

And project denver lives as TegraK1 dual denver cores 64bit.

LordEC911 · Sep 16, 2014

Stoly said:
AFAIK there was no nvidia announcement of Maxwell having an ARM core. It was only a rumor.

And project denver lives as TegraK1 dual denver cores 64bit.

Ummm... it was in the old roadmaps but got scrapped, for obvious reasons.
Also Echelon shows they were actively working on it, along with a slight variant to that project.

jwcalla · Sep 17, 2014

It's not clear if they scrapped the idea altogether or have delayed it. Giving up on the idea is giving up on unified memory altogether, which puts them at a disadvantage.

Then again it was intended for compute purposes so this could be a sign that they're not seeing a strong future for them in that market segment.

LordEC911 · Sep 17, 2014

jwcalla said:
It's not clear if they scrapped the idea altogether or have delayed it.

Sorry, I meant that they scrapped unified memory for Maxwell, seeing as how UM got pushed back to Pascal and merged with Volta.

limitedaccess · Sep 17, 2014

Does anyone actually have a source, that isn't just a rumor source speculating, that actually pointed out that project denver will be more than what it ended up being (Nvidia's new ARM CPU architecture for it's Tegra line).

I've tried to find old roadmaps, and from what I've found and recall, nothing explicitly states that there would've been an ARM CPU and GPU pairing for a performance consumer desktop GPU product.

This is what Nvidia's CEO said in 2011 -

The second thing we announced was Project Denver. We’ve been working on a CPU internally for about three and half years or so. It takes about five years to build any full custom CPU. And Project Denver has a few hundred engineers working on it for this period of time and our strategy with Project Denver was to extend the reach of ARM beyond the mobile, the handheld computing space. To take the ARM processor, partner with them to develop a next-generation 64 bit processor to extend it so that all of computing can have the benefits of that instruction set architecture. It is backward-compatible with today’s ARM processors.

Otherwise it just could just be people "reinterpreting" (hearing what they want to hear) for the consumer GPU market or a lost in translation type situation.

For example -

1) ARM CPU paired with GPU. Denver K1 is a SoC which uses Kepler cores for the GPU (first Nvidia SoC sharing it's core GPU architecture).
2) Maxwell refresh will have ARM CPU. Denver CPU will ship with Kepler cores than be updated to Maxwell cores.
3) Translating code for the GPU for efficiency reasons. Denver K1 does translation/code morphing for the CPU for efficiency reasons.

LordEC911 · Sep 17, 2014

limitedaccess said:
I've tried to find old roadmaps, and from what I've found and recall, nothing explicitly states that there would've been an ARM CPU and GPU pairing for a performance consumer desktop GPU product.

Otherwise it just could just be people "reinterpreting" (hearing what they want to hear) for the consumer GPU market or a lost in translation type situation.

What do you think Unified Memory is?
Why do you think they developed NVLink?

Nvidia doesn't split their designs for desktop/consumer and HPC.
The GeForce line is what allows them to develop these massive ASICs to target HPC.
This is detailed in an interview with Scotty, also states that they are going to merge CPUs with GPUs for Tesla.

For a source, read just about any of their research papers detailing Echelon.
Any talks with Bill Dally on the future of GPU computing, aka Exascale.

limitedaccess · Sep 17, 2014

Yes I know they don't split between consumer and HPC (same chips). I'm referring to people interpreting HPC oriented developments from a consumer angle. For example do you recall hearing that speculation where Maxwell will include ARM CPU cores to provide lower API overhead, mimicking the benefits of lower level APIs (such as Mantle), specifically for gaming benefits?

They do talk about the eventual convergence of a CPU and GPU but I'm asking about specific referrals to "Project Denver" or "Maxwell" achieving these goals. This might be a misunderstanding here but I am specifically referring to Denver and Maxwell only, since this is being brought up (and has several times recently) with people wondering what happened to Denver or Maxwell. I'm not arguing or questioning an eventual CPU/GPU convergence.

For example in the first link you posted -

Scott: Our Denver project is really aimed at putting out a high-performance ARMv8 processor. Our Denver 64-bit ARM core will be higher performance than anything you can buy from ARM Holdings. That core is going to show up in Tegra, but it won't show up in all of the Tegra processors. We will still have Tegra processors that use stock ARM cores as well, like we use Cortex-A9 cores today, but Denver will show up in the high end.

As an architecture licensee, the thing to remember is that you can tweak an ARM core to change its performance, but you can't change the architecture one lick. You have to conform to the ISA, and they are quite disciplined about that.

This specifically singles out Project Denver as an ARM CPU core project to be used in Tegra.

Also regarding Maxwell -

So we can now develop the "Maxwell" family of GPUs, and that will go into the Tesla line and into the "Parker" family of Tegra processors.

Research papers detailing Echelon (only had time to skim through this so far)-

At a 10 nm process technology in
2017, the Echelon project’s initial perfor-
mance target is a peak double-precision
throughput of 16 Tflops, a memory band-
width of 1.6 terabytes/second, and a power
budget of less than 150 W.
In this time frame, GPUs will no longer
be an external accelerator to a CPU; instead,
CPUs and GPUs will be integrated on the
same die with a unified memory architecture.

wilson502 · Sep 17, 2014

Well thats disappointing that it got pushed back into the next architecture most likely. I plan on getting a GTX 970 or 980 anyway to replace my aging GTX 280.

LordEC911 · Sep 17, 2014

limitedaccess said:
This might be a misunderstanding here but I am specifically referring to Denver and Maxwell only, since this is being brought up (and has several times recently) with people wondering what happened to Denver or Maxwell. I'm not arguing or questioning an eventual CPU/GPU convergence.

Denver specifically? No, it was assumed since it kept getting delayed and was "ready" in the same time frame as Maxwell.

Maxwell, yes. Unified Memory means a Nvidia CPU which meant that they had to put ARM(ETA- or PowerPC) cores somewhere, whether in the same ASIC, on the same package/substrate or some other solution.

There is a reason Maxwell got pushed back 1.5years and we aren't seeing GM10x, other than GM107.

wilson502 · Sep 17, 2014

LordEC911 said:
Denver specifically? No, it was assumed since it kept getting delayed and was "ready" in the same time frame as Maxwell.

Maxwell, yes. Unified Memory means a Nvidia CPU which meant that they had to put ARM(ETA- or PowerPC) cores somewhere, whether in the same ASIC, on the same package/substrate or some other solution.

There is a reason Maxwell got pushed back 1.5years and we aren't seeing GM10x, other than GM107.

Ya if it got delayed a full 18 months and no ARM CPU onboard as planned, that would be pretty lame if that ends up being the case where this unified architecture gets pushed into the next generation Pascal or Maxwell refresh.

limitedaccess · Sep 18, 2014

What exactly are your expectations for ARM cores on a GPU and unified memory for a discrete GPU? I mean from a desktop consumer standpoint and not from a HPC stand point.

LordEC911 said:
Denver specifically? No, it was assumed since it kept getting delayed and was "ready" in the same time frame as Maxwell.

Maxwell, yes. Unified Memory means a Nvidia CPU which meant that they had to put ARM(ETA- or PowerPC) cores somewhere, whether in the same ASIC, on the same package/substrate or some other solution.

There is a reason Maxwell got pushed back 1.5years and we aren't seeing GM10x, other than GM107.

I understand the inherent synergy of unified memory with CPU/GPU integration as they would be physically accessing the same physical memory. What I'm not understanding is the hard link/requirement here for the GPU and CPU to be on same silicon or even the same board as a prerequisite for unified memory. And in turn why the inference that Maxwell would have integrated ARM cores on silicon simply due to unified memory support (well "unified virtual memory" as a feature point on an old roadmap).

Unified memory relates to having to manage one shared pool of memory as opposed to multiple pools of separate memory. Nvidia has already implemented unified memory in Cuda 6 for Keplar and newer GPUs -
http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/
From what I recall the blurb under Maxwell was "Unified Virtual Memory," so was this not implemented?

Yes there can be hardware developments, such NVLink, designed to support unified memory by lowering the actual technical limitations for transferring data. One of the specific reasons for NVLink with Pascal is to give the GPU the same speed of access to system memory as the CPU -

http://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data/

Starting with CUDA 6, Unified Memory simplifies memory management by giving you a single pointer to your data, and automatically migrating pages on access to the processor that needs them. On Pascal GPUs, Unified Memory and NVLink will provide the ultimate combination of simplicity and performance. The full-bandwidth access to the CPU’s memory system enabled by NVLink means that NVIDIA’s GPU can access data in the CPU’s memory at the same rate as the CPU can. With the GPU’s superior streaming ability, the GPU will sometimes be able to stream data out of the CPU’s memory system even faster than the CPU.

So in turn there does not seem to be an indication of ARM integration with Pascal either. If anything something like ARM core integration seems like it would be a rather important feature point that would be mentioned separately.

AMD's heterogeneous unified memory as part of its HSA initiative does not preclude support for separate discrete GPUs (or other hardware) and will not be only limited to its APUs either.

Arcygenical · Sep 18, 2014

Hehe, and people laughed at me when my day one power color 290 unlocked into a 290x for 399$ with a free game, as "maxwell bruh!!!."

Oh well. Realistically maxwell is a letdown for most things except its awesome power consumption. But that alone is a feat. And the pricing sucks here in Canada. Used 280s and 290s go for 300$ less off the bat.

jwcalla · Sep 18, 2014

limitedaccess said:
So in turn there does not seem to be an indication of ARM integration with Pascal either. If anything something like ARM core integration seems like it would be a rather important feature point that would be mentioned separately.

Agreed. In the more recent presentations at GTC, etc. it seems like this Project Denver was boxed up as a Tegra SoC announcement only and mysteriously ended there. I mean the TK1 is nice and all, but making a custom ARM core is not exactly innovative business.

We got "unified virtual memory" in a previous CUDA release, which is nothing more than some nice programmer shorthand. There is, of course, no performance benefit.

NVLink is an interesting technology but I suspect nowhere near as well performing as a truly unified memory would be. If you think of a CUDA application, one of the biggest bottlenecks is when you need to take data you're working with (in GPU memory) and send it back to the CPU to do a little processing that only the CPU can handle, then bring it back to the GPU. NVLink would be much faster than PCIe but I can't imagine it being faster than accessing GPU memory. And then there is the overhead of doing a memory copy from the process on the GPU to the process running on the CPU and then back. A completely inefficient waste of time, just because you need to do some doofy serial operation or whatever that the GPU can't handle. It's not just about the GPU accessing system memory, the CPU needs to access what's in GPU memory also.

It would be amazing if we could have a single process with a single memory space that both the CPU and GPU can work with. Obviously they will have to coordinate access so that one doesn't stomp on the other, so there needs to be some serious microcode and driver management. Just the software / firmware side of it would take an enormous technological brainstorm. I'm sure AMD is going through all of this too.

Whether or not there would be any benefits for game programmers, I'm not familiar with game development so I wouldn't have a clue.

But if NVIDIA has truly given up on unified memory then I take that as a pretty clear signal that they intend to leave the HPC market and do not wish to invest further resources in the project. It could be that they see an onslaught of FPGAs coming in the future and decided their offerings will not be competitive, or could not be sold at margins that make their efforts worthwhile.

Just a couple years ago all we would hear about is HPC, CUDA and how GPU accelerators were going to change the world. Now all I'm hearing about from the company is this Shield gaming platform. I dunno.

limitedaccess · Sep 20, 2014

I'm not seeing this as being related to Nvidia giving up on HPC.

HPC growth as a business for Nvidia was huge with the launch of Kepler, I believe they experienced more than double the growth in revenue compared to Fermi.

Also as stated they are developing to better support a unified memory architecture with things such as NVLink which will allow the GPU to access system memory much faster (and vice versa). Progress does need to be a step at a time. With NVLink and stacked memory Pascal would also be comparable with Intel's upcoming Knights Landing co-processor (stacked memory and OPCIe?).

Shield and Tegra though are more consumer facing which is why you'll see more consumer oriented media coverage for them. Also right now that's the next market they are trying to grow.

jwcalla · Sep 20, 2014

NVLink isn't a step towards unified memory though. It's just a path faster than PCIe. You still have two disjoint memory spaces, which means you have to copy all the data back and forth between the two processors, which is a real ball-buster. Even if you can make it faster, it's not going to be anywhere near as fast as just sharing the same chunk of memory between the two.

We really want it to be like the mobile SoCs are today. There's one chunk of memory shared between the CPU and GPU.

limitedaccess · Sep 21, 2014

Well unified memory doesn't preclude situations where the memory pool itself is not physically unified.

For HPC applications currently GPUs are used a coprocessor (there are other types of coprocessors as well) essentially to accelerate workloads that are best done on them. This is why something like NVLink giving the GPU faster access (so the actually link from the memory to the GPU is not the bottleneck) is important. This compares favorably to current limitations of slow access over PCIe or copying over to GPU memory (also PCIe speed limitations).

Also from a practical perspective there are (and might always will be) trade offs in terms of different memory systems, and so some might be a better fit than others dependent on the application. If you look at the current situation GPU memory is very high bandwidth (compared to what serves the CPU) while lacking in capacity and latency by comparison. It isn't so simple as saying one is better than the other, rather that one is more suitable for their given application.

Project Denver Maxwell

wilson502

Limp Gawd

Liger88

2[H]4U

Stoly

Supreme [H]ardness

LordEC911

[H]ard|Gawd

jwcalla

2[H]4U

LordEC911

[H]ard|Gawd

limitedaccess

Supreme [H]ardness

LordEC911

[H]ard|Gawd

limitedaccess

Supreme [H]ardness

wilson502

Limp Gawd

LordEC911

[H]ard|Gawd

wilson502

Limp Gawd

limitedaccess

Supreme [H]ardness

Arcygenical

Fully [H]

jwcalla

2[H]4U

limitedaccess

Supreme [H]ardness

jwcalla

2[H]4U

limitedaccess

Supreme [H]ardness