Dual-Interlagos Benchmarks

pxc · Mar 23, 2011

sirmo said:
savantu said:

4x Xeon 2GHz = 32 int cores/32 FPUs 13.47
4x MC 1.9GHz = 48 int cores/48 FPUs 15.43
2x BD 1.8GHz = 32 int cores/16 FPUs 25.97
2x MC 1.9GHz = 24 int cores/24 FPUs 30.61

Click to expand...

Fixed.

Fixed.

The Xeon in the link is a 6 core Westmere-based model. And it essentially shows what's been benchmarked over and over: It takes roughly 2 K10.5 cores to equal a Nehalem core across a wide range of benchmarks, especially when the clock deficit is taken into account on the 10+ core chips.

C-Ray scaling *per FPU unit*, *per clock*, normalized to MC 1.9GHz:
1.9x 2xBD 1.8GHz
1.5x 4xXeon 2GHz
1.0x 4xMC 1.9GHz
1.0x 2xMC 1.9GHz

Meaning: not much.

BD clock speed is not announced yet, 10 core Xeon (Westmere, Nehalem-based) is coming soon and new 8 core Sandy Bridge-based Xeons will be out later this year. But it does show that BD got some beefier FPUs and each FPU unit essentially surpasses the performance of Westmere's FPU (SB improves on Westmere FP performance even without AVX). Actual performance in a 2S server or workstation has other variables. What's clear though, to be competitive with SB in FPU performance, is that BD needs to at least match SB's frequency. Integer performance doesn't have a prayer.

I can compare the integer benchmarks the same way, but it isn't going to be pretty for BD.

jeremyshaw · Mar 23, 2011

http://ark.intel.com/Product.aspx?id=46498

it's an 8 core...

anyhow, who put 16FPU for dual interlagos? 16 256bit FPU, or 32 128bit FPU... since we don't have any known AVX programs to take advantage of the entire 256bit FPU to begin with.o.0

pxc · Mar 23, 2011

jeremyshaw said:
anyhow, who put 16FPU for dual interlagos?

Because that's how AMD defines it: one FPU per BD module, which is 4 FPUs per BD die. 4 BD die are in a 2S Interlagos system. 4 x 4 = 16.

Topweasel · Mar 23, 2011

pxc said:
Because that's how AMD defines it: one FPU per BD module, which is 4 FPUs per BD die. 4 BD die are in a 2S Interlagos system. 4 x 4 = 16.

No they define it as two FPU modules that can be combined to run 256bit AVX code.

Digital Viper-X- · Mar 23, 2011

if you look at it @ per socket, what would the performance be? In the end, the max you can get from the xeon is 32 cores, in 4 sockets(8x4), however, you can get 64 cores from AMD in the same socket count, so when you compare those 2, I bet price and power will be similar, which one will perform better?

drescherjm · Mar 23, 2011

which one will perform better?

And the answer is most likely it depends on the application.

I mean with HT the xeons would have the same 64 threads with AMDs physical cores being faster than the xeon's virtual cores but slower than Intels physical cores.

ritchan · Mar 23, 2011

Topweasel said:
No they define it as two FPU modules that can be combined to run 256bit AVX code.

Yeah, you must've read John Fruehe's stuff. He's a marketing guy after all. Go to realworldtech.com and read the analysis there instead.

kirtar · Mar 23, 2011

ritchan said:
Yeah, you must've read John Fruehe's stuff. He's a marketing guy after all. Go to realworldtech.com and read the analysis there instead.

Even if he's a marketing guy, he can't flat out lie or make statements with no base (due to legal team vetting).

Topweasel · Mar 23, 2011

ritchan said:
Yeah, you must've read John Fruehe's stuff. He's a marketing guy after all. Go to realworldtech.com and read the analysis there instead.

I read the article, the 6 month old article where the writer finishes it up stating that a lot of it is still conjecture and that their are still tons of mysteries to be solved in regards to BD. But maybe its my lack of specific FP communication knowledge but I didn't read anything that implied that the FP or ALU units wouldn't work similarly to any other CPU, certainly nothing as drastic as half the modules being useless as posters earlier implied. There was a lack of the writers understanding of the actual execution of 256bit AVX code, but as far as I know this Hypothetical benchmark isn't touching on AVX anyways.

I am trying to read more on the discussions portion but the flash is freaking out on my work computer so I am going to hold off on saying anything else before I have a chance to read this at home.

pxc · Mar 23, 2011

Topweasel said:
No they define it as two FPU modules that can be combined to run 256bit AVX code.

Nothing has changed recently: http://tech.icrontic.com/uploads/2009/11/amd_bulldozer_2010-2.png and http://blogs.amd.com/work/2011/02/21/amd-at-isscc-bulldozer-design-solutions/

It was and is called a single shared or dedicated "Floating Point Unit" in the BD module. You're confusing the FPU with the pipelines inside it. The shared FPU contains 2 pipelines. But AMD does not call it 2 FPUs per BD module.

Digital Viper-X- said:
if you look at it @ per socket, what would the performance be? In the end, the max you can get from the xeon is 32 cores, in 4 sockets(8x4), however, you can get 64 cores from AMD in the same socket count, so when you compare those 2, I bet price and power will be similar, which one will perform better?

Yeah, it's important to look at the whole system. Since you talk about 64 cores for BD (unreleased), why not bring in the 10 core Westmere EX that should be launching a bit before BD?

Westmere EX supports up to 8 sockets (glueless), which is a maximum of 80 cores/160 threads per system. The problem of course is that a BD core is not equal to a Westmere core; it won't take 8 Westmere EX sockets to match 4 BD sockets.

Having more cores is good (cheap GFLOPS and bandwidth for HPC, java/web servers) and sometimes it's bad (taking 2x more silicon to match the competition or pay more in licensing costs for server software licensed per core). I think there's a general appeal to having "more" even if it isn't best for desktop performance. AMD desktop marketing has its work cut out for sure. AMD never had a core so much better suited to servers over desktops than BD.

Talk about MP servers is pretty irrelevant to most people here. What's important is how much performance you'll get from one socket in desktop apps/games and what kind of performance will be available in mobile parts. Looks like mobile is getting a Fusion half-dozer next year.

JMccovery · Mar 23, 2011

Math Fail

savantu · Mar 23, 2011

pxc said:
C-Ray scaling *per FPU unit*, *per clock*, normalized to MC 1.9GHz:

What the hell is this ? I have yet to see a a more meaningless metric. It's like saying HP/cylinder. What does that tell you ?

But it does show that BD got some beefier FPUs and each FPU unit essentially surpasses the performance of Westmere's FPU (SB improves on Westmere FP performance even without AVX).

You do realize that there is no such thing as BD FPU unit since there are 2 of them ? Each BD module has 2 128bit FPUs, which can perform 4 DP Flops per cycle in SSE mode and 8 with AVX/FMA. Same as Westmere ( for the SSE part ) and SB.

If I divide by 2, as you should, things don't look very special at all :

1.5x 4xXeon 2GHz
1.0x 4xMC 1.9GHz
1.0x 2xMC 1.9GHz
0.95x 2xBD 1.8GHz

Which means that in SSE code, BD's FPUs are more or less the same performance as MC ( accounting for the speed difference ). Extra performance can be gained by using FMA, AVX and/or increasing frequency.

Actual performance in a 2S server or workstation has other variables. What's clear though, to be competitive with SB in FPU performance, is that BD needs to at least match SB's frequency. Integer performance doesn't have a prayer.

I'd say it is the other way round. It is more likely that BD will trail SB FP performance especially in AVX code ( FMA are pretty rare ) and will have more chances in INT since there it has a real, tangible advantage in execution units.

pxc · Mar 23, 2011

savantu said:
What the hell is this ? I have yet to see a a more meaningless metric. It's like saying HP/cylinder. What does that tell you ?

It tells you the throughput per FPU, normalized for frequency. You can use it for a rough relative estimate in heavy FP apps.

If I divide by 2, as you should, things don't look very special at all :

True, but chopping a FPU in half tells you even less.

Micro-benchmarks are more than slightly interesting. Just look at this thread.

The take away is that Interlagos has 8 FPUs/16 FP pipelines and when clock normalized perform a bit better than the 8 in Westmere. SB is a different beast since the 8 core model will run at higher clocks and the FPUs are a bit faster in C-Ray (quad core SB gets ~1.75 in the chart I made above).

Digital Viper-X- · Mar 23, 2011

pxc said:
Nothing has changed recently: http://tech.icrontic.com/uploads/2009/11/amd_bulldozer_2010-2.png and http://blogs.amd.com/work/2011/02/21/amd-at-isscc-bulldozer-design-solutions/

It was and is called a single shared or dedicated "Floating Point Unit" in the BD module. You're confusing the FPU with the pipelines inside it. The shared FPU contains 2 pipelines. But AMD does not call it 2 FPUs per BD module.

Yeah, it's important to look at the whole system. Since you talk about 64 cores for BD (unreleased), why not bring in the 10 core Westmere EX that should be launching a bit before BD? Westmere EX supports up to 8 sockets (glueless), which is a maximum of 80 cores/160 threads per system. The problem of course is that a BD core is not equal to a Westmere core; it won't take 8 Westmere EX sockets to match 4 BD sockets.

Having more cores is good (cheap GFLOPS and bandwidth for HPC, java/web servers) and sometimes it's bad (taking 2x more silicon to match the competition or pay more in licensing costs for server software licensed per core). I think there's a general appeal to having "more" even if it isn't best for desktop performance. AMD desktop marketing has its work cut out for sure. AMD never had a core so much better suited to servers over desktops than BD.

Talk about MP servers is pretty irrelevant to most people here. What's important is how much performance you'll get from one socket in desktop apps/games and what kind of performance will be available in mobile parts. Looks like mobile is getting a Fusion half-dozer next year.

is the intel 8p platform even ready? the 4p platform for bulldozer is , so lets use the 10c chips, 40/80 vs 64/64, I'm simply speaking from a platform perspective, people who buy a 4 socket system, are going to shell out a shitload of money, so when you compare a 4p system vs a 2p system, there is a huge difference in cost, which is why i put it as 4p vs 4p.

and on the core vs silicon, the whole theory we were fed by AMD is, it takes less silicon to get 2 of these cores, vs 2 traditional cores, and the cost to implement the cores this way, is much less. So we'll see if BD lives up to that promise.

will BD be supporting 8 socket systems like the current 4/6 core chips do?

AMD_Gamer · Mar 23, 2011

Lots of Intel trolls in this thread trying to spread bad info.

mesyn191 · Mar 23, 2011

ritchan said:
Whereas a core in the traditional sense would mean integer and FP execution units plus lots of supporting logic to buffer, decode, reorder and execute instructions, a core in the case of Bulldozer is basically an integer scheduler, behind which a group of integer execution units and AGUs sit. Basically it's just an ALU.

What? You might want to look at AMD's diagrams again, RWT has a great article that goes over BD and in no way shape or form is a core "basically an ALU". FPU, branch predictor, pipeline layout, and caches are all detailed as well as the integer scheduler. Hell they even had some nifty pics to compare it to current AMD core and Intel's Westmere.

A module is 2 cores, not ALU's, its just doing some really cool resource sharing.

ritchan said:
In the case of Bulldozer, one core actually has less ALUs and AGUs than K10 (aka Istanbul), so integer throughput per Bulldozer core is less than K10.

Theoretical integer throughput you mean right? x86 is inherently ILP limited so the doubling of the ALU/AGU's doesn't get you anywhere near double the performance, not even halfway there. IIRC it was maybe 5-10% more performance. For instance I think the original Athlon had something like 1-1.5 effective IPC depending on what code you were running, yet it had 3 integer pipelines. The L2 and L3 cache latency/associativity will have a bigger impact on performance than the number of ALU/AGU pipelines, and JF has already stated it will be faster per clock than PhenomII. So yea, you're at best being waaay to simplistic in your "analysis".

ritchan said:
If it's an FP op, the instruction is "outsourced" - i.e. the core does not handle the FP op, but instead gives it to a black box to handle.

Dude you're making stuff up. There is no "outsourcing" going on. BD has a 256 bit FPU pipeline that is shared across a single module, it can do 1x 256 bit op or 2x 128 bit ops or 4x 64 bit ops per clock.

ritchan said:
This facilitates Fusion, i.e. the replacement of the traditional cluster of 128bit FP execution units with a GPU shader array.

More making stuff up. This has nothing to do with Fusion, which is something entirely different. A Fusion CPU or "APU" as AMD likes to call them will still have "traditional" x86 integer and FPU pipes but will also have a integrated GPGPU for the stupidly parallel problems as well as graphics work.

That is it.

AMD knows that they will probably never end up being able to replace the x86 FPU, or SSE2/4/etc., with a GPGPU nor will they probably ever try since it would be stupid to. GPGPU's are good at stupidly parallel stuff and that is it. While SIMD and classic x87 FPU's are slower they're much more flexible and will be necessary no matter what for both reasons of backwards compatibility and because somethings just run like crap on a GPGPU even if you hand code it.

pxc · Mar 23, 2011

Digital Viper-X- said:
is the intel 8p platform even ready?

8 core Westmere-EX came out a year ago, and there are many 8 way glueless servers on the market since introduction. That's not Intel's first 8 way platform either. The 10 core Westmere-EX should drop in the same LGA 1567 socket that 8 core Westmere-EX uses.

I know it spoils the fantasy now that benchmarks are out, but BD looks like it's not the miracle some were expecting, especially in integer performance. It got a respectable boost and the extra cores are nice if you have use for it on the desktop. Otherwise, it's looking bleh in the face of modern competition.

lightp2 · Mar 24, 2011

1. This post is NOT to add speculation.
2. This post does not give answer.
3. This post gives an example scenario to suggest various understandings.
4. This post is based on published result on website, though I am 100% not qualified on any technical discussion, just to focus on certain aspect of the discussion.

5. Core i5-750 (4C/4T) C-RAY benchmark. No Hyperthreading on i5-750 so you need not worry about HT effect.
5a. Based on the result, a 32-thread C-RAY benchmark run on the i5-750 implies it runs on all 4-cores. The result score is 90.
5b. Based on "simple, maybe questionable math", if we reduced back to 1-thread, meaning force it to run only on 1-CPU-core, likely score around 90x4 = 360. Same workload, no parallel processing.
5c. Luckily the same user also posted the 1-thread C-RAY benchmark run under similar circumstances. 1-thread score is 305. So a difference of 360-305 = 55.
5d. So the delta of 55 is the overhead of parallel processing, probably as a combination of many factors, including OS, libraries, compiler switches, memory, speed, misc.

6. What I am trying to say is

6a. When you have many processor cores for this particular benchmark, there are probably many factors involved.

6b. Using simple reverse logic requires careful assessment. It is not wrong for the stated published 32-core score. However, it may not be 100% correct in reversing to single core deduction due to so many factors involved.

6c. I agree it does give a hint. The accuracy depends on knowing all the involved hardware and software factors, which are not published in this case for the supposed BD unofficial sample results.

6d. I suppose still need to wait until actual product availability time to have performance understanding.

Cheers

jeremyshaw · Mar 24, 2011

lightp2 said:
1. This post is NOT to add speculation.
2. This post does not give answer.
3. This post gives an example scenario to suggest various understandings.
4. This post is based on published result on website, though I am 100% not qualified on any technical discussion, just to focus on certain aspect of the discussion.

5. Core i5-750 (4C/4T) C-RAY benchmark. No Hyperthreading on i5-750 so you need not worry about HT effect.
5a. Based on the result, a 32-thread C-RAY benchmark run on the i5-750 implies it runs on all 4-cores. The result score is 90.
5b. Based on "simple, maybe questionable math", if we reduced back to 1-thread, meaning force it to run only on 1-CPU-core, likely score around 90x4 = 360. Same workload, no parallel processing.
5c. Luckily the same user also posted the 1-thread C-RAY benchmark run under similar circumstances. 1-thread score is 305. So a difference of 360-305 = 55.
5d. So the delta of 55 is the overhead of parallel processing, probably as a combination of many factors, including OS, libraries, compiler switches, memory, speed, misc.

6. What I am trying to say is

6a. When you have many processor cores for this particular benchmark, there are probably many factors involved.

6b. Using simple reverse logic requires careful assessment. It is not wrong for the stated published 32-core score. However, it may not be 100% correct in reversing to single core deduction due to so many factors involved.

6c. I agree it does give a hint. The accuracy depends on knowing all the involved hardware and software factors, which are not published in this case for the supposed BD unofficial sample results.

6d. I suppose still need to wait until actual product availability time to have performance understanding.

Cheers

doesn't lynnfield have a decent turbo boost? so single threaded score would be higher as a result?

lightp2 · Mar 24, 2011

Hi, I will try to give another comparison to illustrate the reasoning. based on same website results.

These two processors are close enough on architecture
1. Q6600 4C/2.4GHz/2x4MB-cache 64-thread score-161 | simple-calc 1-core estimate 644
2. C2D 2C/2.4GHz/3MB-cache 32-thread score-234 | simple-calc 1-core estimate 468

even though we know estimate is unrealistic, the result is still out of expectation in this case because with higher processor cache, we always expect the Q6600 to have better simple-calc score but it is the reverse and the difference is large 644-468 = 176 delta

They have very different hardware/software combination, especially software.

I am just saying there are many hardware and critically software factors still not known, and they may impact score.

jeremyshaw · Mar 24, 2011

that's due to the FSB bottleneck of Kentsfield. Two C2D dies on the same package arguing for the shared comm line with the chipset. Though normally, it doesn't show up in anything but memory intensive applications.

lightp2 · Mar 24, 2011

jeremyshaw said:
that's due to the FSB bottleneck of Kentsfield. Two C2D dies on the same package arguing for the shared comm line with the chipset. Though normally, it doesn't show up in anything but memory intensive applications.

Perfect explanation, and that's a possibility for such situation on multi-socket system with wrong combination on macro scale. When you move back to more refined system with less contention, the calculated performance may appear much better.

With so many factors unknowns, is there a reason why there is a rush for performance conclusion?

jeremyshaw · Mar 24, 2011

probably cause it's still so far away from launch. Some of us are just itching to know what the next best thing from AMD is

Then we'll move on to arguing over LGA2011 leaked benchmarks

OFaceSIG · Mar 24, 2011

lightp2 said:
With so many factors unknowns, is there a reason why there is a rush for performance conclusion?

Exactly.

kirtar · Mar 24, 2011

lightp2 said:
With so many factors unknowns, is there a reason why there is a rush for performance conclusion?

yarly. Just wait until reviews start coming out before making any solid declarations -_- (this goes in either direction)

Digital Viper-X- · Mar 24, 2011

pxc said:
8 core Westmere-EX came out a year ago, and there are many 8 way glueless servers on the market since introduction. That's not Intel's first 8 way platform either. The 10 core Westmere-EX should drop in the same LGA 1567 socket that 8 core Westmere-EX uses.

I know it spoils the fantasy now that benchmarks are out, but BD looks like it's not the miracle some were expecting, especially in integer performance. It got a respectable boost and the extra cores are nice if you have use for it on the desktop. Otherwise, it's looking bleh in the face of modern competition.

well in that case , 8p vs 8p

80cores/160 threads, vs 128cores/128 threads. should be an interesting show down, I guess Anandtech will do something like that.

sirmo · Mar 24, 2011

jeremyshaw said:
that's due to the FSB bottleneck of Kentsfield. Two C2D dies on the same package arguing for the shared comm line with the chipset. Though normally, it doesn't show up in anything but memory intensive applications.

Except that the C-Ray benchmark doesn't depend on the FSB much. Since the entire workload fits in CPU cache. This is according to the C-Ray author himself.

teletran8 · Mar 24, 2011

This thread should go in the trashcan, I didn't learn a darn thing from it.

pxc · Mar 24, 2011

Digital Viper-X- said:
well in that case , 8p vs 8p 80cores/160 threads, vs 128cores/128 threads. should be an interesting show down, I guess Anandtech will do something like that.

Um, show me an 8 way MC/Interlagos system. I'll help because there are none that I can find on spec.org. This "8S" Opteron server is the closest thing: the Dell PowerEdge C6145. Why the quotes? Because it has 2 4-way boards in one server chassis.

http://www.youtube.com/watch?v=sCtxfk2Vk6s&feature=player_embedded Cool HPC node, but it's not an 8-way server. If you want to count clusters as a single n-way server, that's avoiding the problem that AMD doesn't support 8-way systems.

However you can buy a 16 socket Westmere-EX server is you really want. http://www.app3.unisys.com/7600r/features.html That system isn't glueless, but it is single server with all CPUs available locally to one another. Intel licenses glue logic to 3rd party manufacturers for custom chipsets. There are truly massive custom systems that are not clusters on the Intel side. Term part of the day: RAS. The S stands for scalability.

Sorry, an AMD node is only going to have 4S and 64 cores when Interlagos is released.

ecmaster76 · Mar 24, 2011

pxc said:
Um, show me an 8 way MC/Interlagos system. I'll help because there are none that I can find on spec.org. This "8S" Opteron server is the closest thing: the Dell PowerEdge C6145.

The key thing is that there aren't any 8S Magny-Cours servers.

The reason is because a 4S MC system actually has 8 processors in 4x MCM packages.

There are several 8S Istanbul systems, but they are all older DDR2 designs as MC has made 8S AMD systems obsolete for the time being.

Digital Viper-X- · Mar 24, 2011

pxc said:
Um, show me an 8 way MC/Interlagos system. I'll help because there are none that I can find on spec.org. This "8S" Opteron server is the closest thing: the Dell PowerEdge C6145. Why the quotes? Because it has 2 4-way boards in one server chassis. http://www.youtube.com/watch?v=sCtxfk2Vk6s&feature=player_embedded Cool HPC node, but it's not an 8-way server. If you want to count clusters as a single n-way server, that's avoiding the problem that AMD doesn't support 8-way systems.

However you can buy a 16 socket Westmere-EX server is you really want. http://www.app3.unisys.com/7600r/features.html That system isn't glueless, but it is single server with all CPUs available locally to one another. Intel licenses glue logic to 3rd party manufacturers for custom chipsets. There are truly massive custom systems that are not clusters on the Intel side. Term part of the day: RAS. The S stands for scalability.

Sorry, an AMD node is only going to have 4S and 64 cores when Interlagos is released.

http://www.amd.com/us/products/server/processors/opteron/Pages/3rd-gen-server-model-numbers.aspx

? not MC/Interlagos, but existing 8p platform

I just assumed this will continue to carry over into the future opteron platforms. Maybe JF-AMD can comment on that one.

I'm not hyping up BD, I just want to see which approach ends up working better

I'm still on a an i7 920, and rendering video for me is a big drag still

ME WANTS MORE POWAR!

maxius · Mar 24, 2011

CaptNumbNutz said:
4 module, 8 cores. That would be my assumption. The Phoronix tests were with the MCM Dual Interlagos (32 actual cores, 16 BD modules).

LOL. I really over estimated then. Bulldozer is definitely hurting when compared core to core against Intel. We will see though, that is only one test, there could be tests out there where HT gives a gain closer to 30%. Either way it isn't pretty.

New estimate:
4 core bulldozer 3.6ghz vs. 4 core 2600K 3.4ghz vs. i7 950 3.06ghz, both intel's with HT turned off. Assume HT adds 5% performance gain. This also assumes perfect multi threaded scaling as before.
2600k stock = 59 seconds
4 core BD at 3.6ghz= 100 seconds
i7 950 stock = 75 seconds

That means that Sandy Bridge i7 Intels are 41% faster clock for clock, and older i7's are 25% faster clock for clock. Of course this all assumed perfect scaling and this is only one test (c-ray) so far... so these guesses are still nothing more than guesses.

Nice forum title

jeremyshaw · Mar 24, 2011

Digital Viper-X- said:
http://www.amd.com/us/products/server/processors/opteron/Pages/3rd-gen-server-model-numbers.aspx

? not MC/Interlagos, but existing 8p platform I just assumed this will continue to carry over into the future opteron platforms. Maybe JF-AMD can comment on that one.

I'm not hyping up BD, I just want to see which approach ends up working better I'm still on a an i7 920, and rendering video for me is a big drag still ME WANTS MORE POWAR!

yeah, as mentioned earlier, Mangy Cours is actually a MCM package with two Istanbul chips, with internal HT links between the chips, so it is indeed an "8CPU" setup, but not in the traditional sense.

JMccovery · Mar 24, 2011

pxc said:
Um, show me an 8 way MC/Interlagos system. I'll help because there are none that I can find on spec.org. This "8S" Opteron server is the closest thing: the Dell PowerEdge C6145. Why the quotes? Because it has 2 4-way boards in one server chassis. http://www.youtube.com/watch?v=sCtxfk2Vk6s&feature=player_embedded Cool HPC node, but it's not an 8-way server. If you want to count clusters as a single n-way server, that's avoiding the problem that AMD doesn't support 8-way systems.

However you can buy a 16 socket Westmere-EX server is you really want. http://www.app3.unisys.com/7600r/features.html That system isn't glueless, but it is single server with all CPUs available locally to one another. Intel licenses glue logic to 3rd party manufacturers for custom chipsets. There are truly massive custom systems that are not clusters on the Intel side. Term part of the day: RAS. The S stands for scalability.

Sorry, an AMD node is only going to have 4S and 64 cores when Interlagos is released.

Actually the upcoming Supermicro H8QGL-6F+ has a HTX 3.0 slot which will allow either a 2 socket or 4 socket daughterboard. It is not on their site, but Anand got an early look at it: http://www.anandtech.com/show/4208/cebit-2011-some-quick-server-related-impressions

ritchan · Mar 24, 2011

mesyn191 said:
What? You might want to look at AMD's diagrams again, RWT has a great article that goes over BD and in no way shape or form is a core "basically an ALU". FPU, branch predictor, pipeline layout, and caches are all detailed as well as the integer scheduler. Hell they even had some nifty pics to compare it to current AMD core and Intel's Westmere.

A module is 2 cores, not ALU's, its just doing some really cool resource sharing.

It isn't, but did you really expect me to type out "2 ALUs behind a decoded macro-op buffer which leads to an integer and fp scheduler" every time I describe a single core?

mesyn191 said:
Theoretical integer throughput you mean right? x86 is inherently ILP limited so the doubling of the ALU/AGU's doesn't get you anywhere near double the performance, not even halfway there. IIRC it was maybe 5-10% more performance. For instance I think the original Athlon had something like 1-1.5 effective IPC depending on what code you were running, yet it had 3 integer pipelines. The L2 and L3 cache latency/associativity will have a bigger impact on performance than the number of ALU/AGU pipelines, and JF has already stated it will be faster per clock than PhenomII. So yea, you're at best being waaay to simplistic in your "analysis".

JF can say whatever he want, but then look at all the "bad math" examples in this thread that compare a BD core to other traditional cores. I really doubt single threaded IPC would be faster, but then I may be wrong. I've only looked at C-Ray results after all.

mesyn191 said:
Dude you're making stuff up. There is no "outsourcing" going on. BD has a 256 bit FPU pipeline that is shared across a single module, it can do 1x 256 bit op or 2x 128 bit ops or 4x 64 bit ops per clock.

More making stuff up. This has nothing to do with Fusion, which is something entirely different. A Fusion CPU or "APU" as AMD likes to call them will still have "traditional" x86 integer and FPU pipes but will also have a integrated GPGPU for the stupidly parallel problems as well as graphics work.

That is it.

AMD knows that they will probably never end up being able to replace the x86 FPU, or SSE2/4/etc., with a GPGPU nor will they probably ever try since it would be stupid to. GPGPU's are good at stupidly parallel stuff and that is it. While SIMD and classic x87 FPU's are slower they're much more flexible and will be necessary no matter what for both reasons of backwards compatibility and because somethings just run like crap on a GPGPU even if you hand code it.

That's what you see now. But then explain the increased level of abstraction around the FPU. If they are going to provide GPGPU capabilities through a totally separate part, why is there such a shift? The most telling part of it is "Once macro-ops have been placed into the scheduler, there is no longer a distinction between macro-ops from the two cores."

It is a logical progression, given that many FP intensive applications also happen to be embarassingly parallel, and anybody who writes FP intensive code these days doesn't use x87 (unless you're nVidia). Kanter thinks so too. "One advantage of this more formalized separation is that the floating point cluster might eventually be replaced or supplemented by a GPU shader array, an evolution of Bulldozer to fit the ‘Fusion’ mold."

mesyn191 · Mar 25, 2011

ritchan said:
It isn't, but did you really expect me to type out "2 ALUs behind a decoded macro-op buffer which leads to an integer and fp scheduler" every time I describe a single core?

Well if you wanted to get into the details every time it'd be more like 4x integer pipelines behind various caches, decoders, and branch predictors that can act as 1 monolithic core or 2 leaner cores depending on workload, but you're getting closer.

ritchan said:
JF can say whatever he want, but then look at all the "bad math" examples in this thread that compare a BD core to other traditional cores. I really doubt single threaded IPC would be faster, but then I may be wrong. I've only looked at C-Ray results after all.

JF doesn't have to do math, just look at internal benchmark results to form an opinion. Until he is shown to do be lying its probably best to take him at his word.

ritchan said:
But then explain the increased level of abstraction around the FPU.

What abstraction? Its just one big wide FPU pipe instead of multiple smaller ones. AMD has already said something to the effect that they did that to give what they felt was the best balance of performance for existing and future code, with emphasis on existing code since AVX will probably take years to really take off.

ritchan said:
The most telling part of it is "Once macro-ops have been placed into the scheduler, there is no longer a distinction between macro-ops from the two cores."

That detail of the decoder has nothing to do with offloading work to a GPGPU or any other piece of hardware outside the CPU itself so I don't know what you're trying to point out.

ritchan said:
It is a logical progression, given that many FP intensive applications also happen to be embarassingly parallel.

There are lots of stuff that are and aren't embarrassingly parallel. A one size fits all approach won't be able to fly. You also have to consider legacy code, which will probably not run well or even at all on a GPGPU with out a recompile or more likely a rewrite. That is not gonna happen.

ritchan said:
"One advantage of this more formalized separation is that the floating point cluster might eventually be replaced or supplemented by a GPU shader array, an evolution of Bulldozer to fit the Fusion mold."

He also mentions "might" and "supplement" in that same sentence, you sure he isn't just tossing out future possibilities to help fill out the article?

ritchan · Mar 25, 2011

mesyn191 said:
What abstraction? Its just one big wide FPU pipe instead of multiple smaller ones. AMD has already said something to the effect that they did that to give what they felt was the best balance of performance for existing and future code, with emphasis on existing code since AVX will probably take years to really take off.

That detail of the decoder has nothing to do with offloading work to a GPGPU or any other piece of hardware outside the CPU itself so I don't know what you're trying to point out.

It is not a detail of the decoder. Actually read the article, or use Ctrl-F to find out which page it's on.

mesyn191 said:
There are lots of stuff that are and aren't embarrassingly parallel. A one size fits all approach won't be able to fly. You also have to consider legacy code, which will probably not run well or even at all on a GPGPU with out a recompile or more likely a rewrite. That is not gonna happen.

That's right, one size fits all doesn't usually work. In the same line of thought, you cannot optimize for every type of code. What are your customers doing with your FPU? Running x87 code on it? There is a time when a line has to be drawn concerning performance optimization for legacy code, and more silicon must be dedicated to increasing what your customers are more likely to be running these days.

A x86 decoder splits an x86 op into macro ops. The front end abstracts the back end. Why do you assume that it is impossible to create such a front end for the FPU? What if it is possible to achieve even finer grained parallelism by further splitting x87 ops?

EDIT: There are several incredibly good reasons why AMD would to do such a thing, despite the complexity of such a solution. That'll be your job to figure out. The first answer is: people with embarassingly parallel workloads tend to buy more hardware.

savantu · Mar 25, 2011

sirmo said:
Except that the C-Ray benchmark doesn't depend on the FSB much. Since the entire workload fits in CPU cache. This is according to the C-Ray author himself.

That would make the Stream results irrelevant. Do you have a link for that claim ?

mesyn191 · Mar 25, 2011

ritchan said:
It is not a detail of the decoder. Actually read the article, or use Ctrl-F to find out which page it's on.

I read the article, you're not making any sense. You know the macro ops are internal CPU only right? You also know that even if you could dump them to cache to read out to main memory or to a on die GPU that the latency hit would probably kill performance as well? We're talking about dozens or even hundreds of cycles of extra latency here. And that there are all sorts of ugly and unsolvable cache coherency issues that will further impact performance on top of all of that?

ritchan said:
In the same line of thought, you cannot optimize for every type of code.

True, but you can apparently run most everything with OK performance, which is what current FPU's can do just fine. And there is lots and lots of stuff where that is all that is needed, and trying to get that sort of thing working on a GPGPU would be a waste of time and money in comparison.

ritchan said:
Why do you assume that it is impossible to create such a front end for the FPU? What if it is possible to achieve even finer grained parallelism by further splitting x87 ops?

Its not that its impossible its that its slow, power hungry, and very difficult to do at all. AMD and Intel already have a heck of a time getting the integer x86 decoder to work fast enough and you want to double their work load?

ritchan said:
EDIT: There are several incredibly good reasons why AMD would to do such a thing, despite the complexity of such a solution. That'll be your job to figure out.

Hhahahah, my job? Buddy you made the claim, now you have to back it up.

ritchan said:
The first answer is: people with embarassingly parallel workloads tend to buy more hardware.

People with embarrassingly parallel work loads already have a perfectly viable solution that works much much faster than any integrated GPGPU could: discrete PCIe GPGPUs like AMD's Firestream and nV's Quadro products.

I mean I like the idea of GPGPU's and all but the ugly truth is there is next to no need for that sort of thing outside of the HPC scientific community. One of the few things where it would work well for the average Joe 6 Pack, video encoding, Intel already does faster than any GPGPU could. And for very little cost in terms of die space and heat too.

Until that killer app pops up GPGPUs are the next PPU.

JF-AMD · Mar 25, 2011

pxc said:
Um, show me an 8 way MC/Interlagos system. I'll help because there are none that I can find on spec.org. This "8S" Opteron server is the closest thing: the Dell PowerEdge C6145. Why the quotes? Because it has 2 4-way boards in one server chassis. http://www.youtube.com/watch?v=sCtxfk2Vk6s&feature=player_embedded Cool HPC node, but it's not an 8-way server. If you want to count clusters as a single n-way server, that's avoiding the problem that AMD doesn't support 8-way systems.

However you can buy a 16 socket Westmere-EX server is you really want. http://www.app3.unisys.com/7600r/features.html That system isn't glueless, but it is single server with all CPUs available locally to one another. Intel licenses glue logic to 3rd party manufacturers for custom chipsets. There are truly massive custom systems that are not clusters on the Intel side. Term part of the day: RAS. The S stands for scalability.

Sorry, an AMD node is only going to have 4S and 64 cores when Interlagos is released.

The 4P market today is ~4% of the overall x86 server volume. The 8P market is ~.1% of the overall x86 server market. And despite Intel throwing lots of platforms at it in the past year, it is not really growing.

8P is for a handful of applications and is Intel's only hope at getting people off of Itanium and staying on their platform (but there are sw issues that might make RISC a better bet).

As for the rest of this thread, I don't know anything about those benchmarks, but to date every bulldozer benchmark that someone has pointed to online is not representational of the actual performance. Then, in addition, all of the speculation and extrapolation of results clouds the picture even more.

Dual-Interlagos Benchmarks

Extremely [H]

[H]F Junkie

Extremely [H]

[H]ard|Gawd

[H]F Junkie

[H]F Junkie

Weaksauce

Gawd

[H]ard|Gawd

Extremely [H]

2[H]4U

[H]ard|Gawd

Extremely [H]

[H]F Junkie

Fully [H]

2[H]4U

Extremely [H]

Gawd

[H]F Junkie

Gawd

[H]F Junkie

Gawd

[H]F Junkie

2[H]4U

Gawd

[H]F Junkie

Gawd

2[H]4U

Extremely [H]

[H]ard|Gawd

[H]F Junkie

2[H]4U

[H]F Junkie

2[H]4U

Weaksauce

2[H]4U

Weaksauce

[H]ard|Gawd

2[H]4U

I Know Bulldozer