45nm with Hyper threading

enumae · Jan 26, 2007

VR Zone has an article worth reading.

If the article is accurate what do you think, will it help?

*edit* Sorry guys, they added another article above the one I had wanted to link, here is the correct link... VR Zone */edit*

brucedeluxe169 · Jan 26, 2007

hyperthreading would be cool,

if I had to choose between a dual core with HT (4 threads) that retailed in the 100's, or a quad core with HT (8 cores) in the upper 200's or 300's, i'd go for the dual with HT.

So heres hoping that the "budget" solution for 45nm is a dual with HT

b/c thats all I want, and I'm sure it would be an awesomely cool running chip

Dan_D · Jan 26, 2007

I don't see how they can add HT to a Core 2 architecture based chip. As I understand it, Hyperthreading requires a deeper pipeline in order to be effective. The pipeline in the Core 2 is shallow compared to Prescott or Northwood based Pentium 4 CPUs.

savantu · Jan 26, 2007

Dan_D said:
I don't see how they can add HT to a Core 2 architecture based chip. As I understand it, Hyperthreading requires a deeper pipeline in order to be effective. The pipeline in the Core 2 is shallow compared to Prescott or Northwood based Pentium 4 CPUs.

Umh , you got it wrong.

You can do 2 things with SMT : better use available resources and mask latency.

P4 had a long narrow pipeline and few execution units while Core has a shallow pipeline , is very wide and has a lot of execution units.

Ironically , Core is better suited to implement HT because it has so many resources that it is unlikely a thread will ever use all.
Power 5 also has a short pipeline and yet it experienced bigger gains than P4 from HT.

pxc · Jan 26, 2007

HT needs unused execution units to be effective, not a deep pipeline.

Intel had mentioned before that HT would likely return to NGMA.

Dan_D · Jan 26, 2007

savantu said:
Umh , you got it wrong.

You can do 2 things with SMT : better use available resources and mask latency.

P4 had a long narrow pipeline and few execution units while Core has a shallow pipeline , is very wide and has a lot of execution units.

Ironically , Core is better suited to implement HT because it has so many resources that it is unlikely a thread will ever use all.
Power 5 also has a short pipeline and yet it experienced bigger gains than P4 from HT.

pxc said:
HT needs unused execution units to be effective, not a deep pipeline.

Intel had mentioned before that HT would likely return to NGMA.

Ok. I stand corrected. I seem to recall in earlier discussions that the longer pipeline was required. The discussion stemmed from someone asking about HT in the Pentium M processors.

mwarps · Jan 26, 2007

This IMHO is pretty dumb. You already have two execution cores, or even 4. What's the point?

Denki · Jan 26, 2007

simulated cores on the cheap? what's not to like... and it shouldn't cost much since its not exactly new tech

mwarps · Jan 26, 2007

Denki said:
simulated cores on the cheap? what's not to like... and it shouldn't cost much since its not exactly new tech

Eh. Must be the stigma from the P4. HT for that core was like re-arranging deckchairs on the titanic. Maybe it will actually provide a good boost for core2.

Eva_Unit_0 · Jan 26, 2007

Dan_D said:
Ok. I stand corrected. I seem to recall in earlier discussions that the longer pipeline was required. The discussion stemmed from someone asking about HT in the Pentium M processors.

you're probably thinking of the fact that HT requires the trace cache idea from the P4. Most CPU architectures decode x86 instructions "on the fly" and feed them directly into the pipeline...in order to implement HT, you need to decode them independantly and store the decoded instructions somewhere while both threads are scheduled.

Dan_D · Jan 26, 2007

Eva_Unit_0 said:
you're probably thinking of the fact that HT requires the trace cache idea from the P4. Most CPU architectures decode x86 instructions "on the fly" and feed them directly into the pipeline...in order to implement HT, you need to decode them independantly and store the decoded instructions somewhere while both threads are scheduled.

That's it. I rememberd something about needing all the data to be stored and scheduled.

DemonDiablo · Jan 26, 2007

mwarps said:
This IMHO is pretty dumb. You already have two execution cores, or even 4. What's the point?

Seriously? Did you just ask "whats the point?"

Whats the point of making a faster processor? Whats the point of dual core? Whats the point of Quad core? Whats the point of having ever implementing HT on the P4? Why dont we all just go back to our x286 pc's and have a ball?

At least come up with a reasonable response as to why you think its "pretty dumb". Do you not live in the land of America? The land of "super size me!" If I can double the number of my theoritcal processors for another 20 bucks then by all means, super size my Intel order please Newegg.

mwarps · Jan 26, 2007

DemonDiablo said:
Seriously? Did you just ask "whats the point?"

Whats the point of making a faster processor? Whats the point of dual core? Whats the point of Quad core? Whats the point of having ever implementing HT on the P4? Why dont we all just go back to our x286 pc's and have a ball?

At least come up with a reasonable response as to why you think its "pretty dumb". Do you not live in the land of America? The land of "super size me!" If I can double the number of my theoritcal processors for another 20 bucks then by all means, super size my Intel order please Newegg.

Well, frankly, HT sucked wang on the P4, and I can't imagine it not sucking wang on a Core2 due to that stigma. As I alluded to in my last post, the Core2 may indeed be something different, though, and maybe it will be a pleasant suprise, until then I'm not holding my breath, and I'm more than satisfied with the blazing speed of my dual core chips.

Yeah, supersize all ya want, you'll be dead by 40 from heart disease

Eva_Unit_0 · Jan 26, 2007

don't forget that we're entering an age where effeciency matters. The new name of the game in the computer industry is "power effeciency," and hyperthreading is ideal for that purpose. Hyperthreading makes each individual core more effecient (5-20% depending on app, in the p4's case) which is definitely a plus. Adding hyperthreading to intel's existing architecture hardly requires any changes, so why not do it in the interest of effeciency? When intel added the additional scheduler and associated infrastructure for HT to the p4 it only added about 5% additional die space, so why not?

Scali2 · Jan 26, 2007

Dan_D said:
Ok. I stand corrected. I seem to recall in earlier discussions that the longer pipeline was required. The discussion stemmed from someone asking about HT in the Pentium M processors.

The longer pipeline isn't required as such... But because the pipeline is longer, there's more chance of units being idle.
The Core2's pipeline is 'wider' (more units in parallel), which is also a reason why there's more chance of units being idle.
Either way, HT will work fine.

Scali2 · Jan 26, 2007

Eva_Unit_0 said:
you're probably thinking of the fact that HT requires the trace cache idea from the P4. Most CPU architectures decode x86 instructions "on the fly" and feed them directly into the pipeline...in order to implement HT, you need to decode them independantly and store the decoded instructions somewhere while both threads are scheduled.

This isn't true.
All x86 processors since PPro have out-of-order execution which consists of a system that first decodes x86 instructions into uOps, then stores them in a buffer, and lets the out-of-order logic pick instructions when they can be executed by the backend. Then the executed instruction is stored in the reorder-buffer, and finally the reordering station will write back the results in the proper order to ensure the correct in-order behaviour.

Trace cache is only different in that the code cache never actually stores the x86 instructions, but rather the decoded uOps.
This is not strictly required for HT, because HT only affects the out-of-order logic, really. It just has to keep track of 2 core states rather than 1.
Trace cache may improve performance if your x86-decoder itself isn't fast enough to decode the x86-code for 2 threads at the same time... but I'm not so sure if that is the case. After all, the efficiency of HT isn't *that* high, so there won't be all that many extra instructions that have to be decoded for the second thread.
Alternatively, you could just use a second x86-decoder instead of trace cache.

So it's not required (and I don't think IBM uses it in its version of HT/SMT). Nevertheless, tracecache was a nice idea, and can even benefit single-threaded systems. I've said it before, tracecache is one of the technologies of P4 that I think is most likely to be re-used in future processors.
Namely, there are all sorts of penalties when decoding x86-code. If you are running a loop (and most time of most programs is generally spent in loops), a regular CPU will get these penalties at every iteration, because it just redecodes the same instructions in the same way. With trace-cache only the first iteration will be slower, but after that, you feed the uOps directly, which have no decoding penalties.

Eva_Unit_0 · Jan 26, 2007

Scali2 said:
This isn't true.
All x86 processors since PPro have out-of-order execution which consists of a system that first decodes x86 instructions into uOps, then stores them in a buffer, and lets the out-of-order logic pick instructions when they can be executed by the backend. Then the executed instruction is stored in the reorder-buffer, and finally the reordering station will write back the results in the proper order to ensure the correct in-order behaviour.

Trace cache is only different in that the code cache never actually stores the x86 instructions, but rather the decoded uOps.
This is not strictly required for HT, because HT only affects the out-of-order logic, really. It just has to keep track of 2 core states rather than 1.
Trace cache may improve performance if your x86-decoder itself isn't fast enough to decode the x86-code for 2 threads at the same time... but I'm not so sure if that is the case. After all, the efficiency of HT isn't *that* high, so there won't be all that many extra instructions that have to be decoded for the second thread.
Alternatively, you could just use a second x86-decoder instead of trace cache.

So it's not required (and I don't think IBM uses it in its version of HT/SMT). Nevertheless, tracecache was a nice idea, and can even benefit single-threaded systems. I've said it before, tracecache is one of the technologies of P4 that I think is most likely to be re-used in future processors.
Namely, there are all sorts of penalties when decoding x86-code. If you are running a loop (and most time of most programs is generally spent in loops), a regular CPU will get these penalties at every iteration, because it just redecodes the same instructions in the same way. With trace-cache only the first iteration will be slower, but after that, you feed the uOps directly, which have no decoding penalties.

thanks for clearing that up...at least I was semi-correct. I guess this explains why the p4's decoder can decode more instructions per clock than the scheduler can issue (IIRC it can decode 4 per clock and the scheduler can issue 6 per 2 clocks).

Anyway, I definitely agree that despite how often people want to bash the netburst architecture, it had some brilliant ideas--many of which were crucial to the current core2 arch.

Scali2 · Jan 26, 2007

Eva_Unit_0 said:
thanks for clearing that up...at least I was semi-correct. I guess this explains why the p4's decoder can decode more instructions per clock than the scheduler can issue (IIRC it can decode 4 per clock and the scheduler can issue 6 per 2 clocks).

Well, not entirely... x86 decoding is quite hairy... The specs probably say it can decode *up to* 4 instructions per clk. That would be the best case scenario. If I'm not mistaken, the decoder looks 16 bytes ahead when decoding. Some x86-instructions are so long that you can't fit anywhere 4 of them in 16 bytes. And some instructions are very complex (eg all sorts of prefix bytes, one after another) that they simply take multiple cycles to decode.

So they're talking about peak-performance... The average performance will be lower, and they need that extra margin to try and stay ahead of the rest of the pipeline at all times (which still doesn't always work).
It's quite common for x86 processors to have decoders with higher peak performance than the rest of the pipeline can handle.
It's also common to have more execution units than the number of instructions that can be retired.... because you can't keep all units busy at all times anyway.

Anyway, I definitely agree that despite how often people want to bash the netburst architecture, it had some brilliant ideas--many of which were crucial to the current core2 arch.

I'd even go as far as to say that we may see a Netburst-like architecture again in a few years, if and when silicon manufacturing has matured enough to make 5+ GHz speeds possible... which is what Netburst was originally designed for.
A Core2 with tracecache and HT would already be quite similar... all it would need is a longer pipeline... and I do think that the pipeline will get longer in the future.
The key is finding the sweet-spot between IPC and clockspeed, and the right pipeline-length is crucial to that.
Going from P3 to P4 was a huge leap, and Intel overshot the sweet-spot by miles... Currently they seem to play it safe and take it one step at a time... The current Core2 will probably get us past 4 GHz... and then Intel will probably want to see if 5+ GHz is possible, with a longer pipeline.

savantu · Jan 26, 2007

Scali2 said:
...

I'd even go as far as to say that we may see a Netburst-like architecture again in a few years, if and when silicon manufacturing has matured enough to make 5+ GHz speeds possible... which is what Netburst was originally designed for.
A Core2 with tracecache and HT would already be quite similar... all it would need is a longer pipeline... and I do think that the pipeline will get longer in the future.
The key is finding the sweet-spot between IPC and clockspeed, and the right pipeline-length is crucial to that.
Going from P3 to P4 was a huge leap, and Intel overshot the sweet-spot by miles... Currently they seem to play it safe and take it one step at a time... The current Core2 will probably get us past 4 GHz... and then Intel will probably want to see if 5+ GHz is possible, with a longer pipeline.

I don't think getting more IPC than Core/K8L is worthy with x86.The complexity of the decoder and prefetch part will become true bottlenecks sooner or later.

Netburst was brilliant in guessing the future : with limited IPC that you can extract , better have a simpler , narrower core that can run at huge frequencies.Too bad manufacturing couldn't keep up ...

savantu · Jan 26, 2007

Eva_Unit_0 said:
don't forget that we're entering an age where effeciency matters. The new name of the game in the computer industry is "power effeciency," and hyperthreading is ideal for that purpose. Hyperthreading makes each individual core more effecient (5-20% depending on app, in the p4's case) which is definitely a plus. Adding hyperthreading to intel's existing architecture hardly requires any changes, so why not do it in the interest of effeciency? When intel added the additional scheduler and associated infrastructure for HT to the p4 it only added about 5% additional die space, so why not?

5% more die space and accounted for 95% of the validation time.SMT is a bitch to implement and a true whore to debug.Few dared to tackle it , most simply ignored it.

Scali2 · Jan 26, 2007

savantu said:
I don't think getting more IPC than Core/K8L is worthy with x86.The complexity of the decoder and prefetch part will become true bottlenecks sooner or later.

Netburst was brilliant in guessing the future : with limited IPC that you can extract , better have a simpler , narrower core that can run at huge frequencies.Too bad manufacturing couldn't keep up ...

Yea, that trend already started with Pentium Pro, really...
A friend of mine was quite shocked when he ran his Pentium MMX-code on a PII for the first time, and found it to be much slower.

USMC2Hard4U · Jan 26, 2007

More threads in the pipe = Better.

I dont beleive hyper threading needs a longer pipeline to be effective.. why?

I for one would like to see this as well.... That way when we have a Native 4 Core CPU (and 8 core will not be out yet for a while) we still can have 8 threads... sweet.

firewolf · Jan 27, 2007

But you still have 4 cores so what's the point?

Manny Calavera · Jan 27, 2007

Better performance is the point.

ribs1 · Jan 27, 2007

firewolf said:
But you still have 4 cores so what's the point?

There is no point. There is never any point to having faster equipment for less money.

firewolf · Jan 27, 2007

ribs1 said:
There is no point. There is never any point to having faster equipment for less money.

Nothing uses more than four cores anyway, no multitasker can load 8. What % does HT help anyway, I remember it being -5 - 10%.

Scali2 · Jan 27, 2007

firewolf said:
Nothing uses more than four cores anyway, no multitasker can load 8. What % does HT help anyway, I remember it being -5 - 10%.

What do you mean 'nothing'?
Some applications can use as many cores as you want... Like for example 3d rendering such as 3dsmax, or video encoders.
Loading 8 cores is not a problem at all.

JVC · Jan 27, 2007

what kind of performance will we see on a new c2d with ht, compared with the old with no ht? (both with two cores)

FrgMstr · Jan 27, 2007

mwarps said:
This IMHO is pretty dumb. You already have two execution cores, or even 4. What's the point?

Well, keep in mind what HT actually does, it basically just impacts the Windows scheduler so that the user gets a smoother experience for what he wants to do immediately. It in no way made the processor do more work. Now lets think about having two cores. With these first units, we have seen the same smooth user experience while the system is getting a lot more work done. Now let's take a look at quad core, the user experience stays smooth and the applications get a lot more work done. NOW...let's start throwing in the fact that software devs are producing desktop applications that an take advantage of two or four cores fully, now what if you are using two or three of these and producing a workload for the processors in each. We saw this with our 3DStudio Max tests in the 4x4 article we did. Having HT would have likely made our user experience smoother when wanting to use the box for something else while it was rendering a scene. This might seem like a uber high end example now, but as HD video and editing becomes more of a normal desktop app as well as other rendering tools, I can surely see where HT could make a comeback and be useful.

I hope that made some sense as it was just stream of consciousness ...

FrgMstr · Jan 27, 2007

Scali2 said:
Well, HT can make the processor do more work aswell... just not as much as an entire second core.
But still I've seen up to about 20% gain on some of my multithreading code with HT-machines... code that was actually meant for dualcores.

No, HT does not make a processor do more work, it simply can make more efficient use of the cycles at given times. So if you want to split hairs, yes, you can see areas where there are MINUTELY small workload advantages because the OS scheduler is simply more efficient.

As for your 20% comments, I would like to see the data to back that up.

mwarps · Jan 27, 2007

Scali2 said:
Run the program in this thread on a P4 with HT enabled and disabled... I don't guarantee that you'll get 20% gain, but I'm quite sure it will run noticeably faster with HT enabled, when you run it with 2 threads.
On a P4 3.0 HT I got 135 fps with the singlethreaded version, and 145 fps with HT enabled and 2 threads. So quite some improvement...
http://www.hardforum.com/showthread.php?t=1149750

So 7% is 20%? Wow. Great math there.

Instead of making outlandish claims, you should focus on running the test first and then posting what the actual results are.

Scali2 · Jan 27, 2007

mwarps said:
So 7% is 20%? Wow. Great math there. Instead of making outlandish claims, you should focus on running the test first and then posting what the actual results are.

I never said the 20% was with THIS program, or on THIS system.
But I don't currently have any other programs to test with, or any other results at hand.
So stop harassing me.

xxGriff · Jan 27, 2007

mwarps said:
Well, frankly, HT sucked wang on the P4, and I can't imagine it not sucking wang on a Core2 due to that stigma. As I alluded to in my last post, the Core2 may indeed be something different, though, and maybe it will be a pleasant suprise, until then I'm not holding my breath, and I'm more than satisfied with the blazing speed of my dual core chips.

Yeah, supersize all ya want, you'll be dead by 40 from heart disease

the same could be said for MMX and SSE by some folks too. the implementation of them is what sucked (IMO). they all hold/held great potential, but with faster silicon always on the horizon, just code for muscle seems to be the method of choice. timing-getting to market and other factors contribute just as much as anything does (maybe more)

it is good to hear that the 45nm and the new process in using halfnium etc, is gonna be available to 775

mwarps · Jan 27, 2007

xxGriff said:
the same could be said for MMX and SSE by some folks too. the implementation of them is what sucked (IMO). they all hold/held great potential, but with faster silicon always on the horizon, just code for muscle seems to be the method of choice. timing-getting to market and other factors contribute just as much as anything does (maybe more)

it is good to hear that the 45nm and the new process in using halfnium etc, is gonna be available to 775

I like SSE! It makes my Folding go faster!

(yeah, I know *nothing* about SSE just that it accelerates gromacs units on F@H)

Scali2 · Jan 27, 2007

SLee said:
Here's a thread with some user benchmarks with HT.

http://forums.anandtech.com/messageview.aspx?catid=28&threadid=1180277&enterthread=y&arctab=y

20% is not unusual when two CPU intensive threads or apps are running simultaneously.

Thanks, now I don't need to look for some of my own code that got 20% gain, to keep these non-believers of my back

Tripp17 · Jan 27, 2007

Im all for the new processors, i hope there more than just clock increases with extra cache. I do however have a thought about them i dont like. I want to see a native dual core AND four core processors from Intel. I also want memory controllers on them as well. I really dont care for the "bolt" two cores to a chip and call it dual core. I really thought after the pentium d that core 2 duos would be native and i really thought that the new 45nm would be but alas they arent. Im sure they will be fun, but could be much better. Im looking foward to AMDs native 4 core processors due out soon. Word is they are 40% stronger than current intel 4 core processors. Ill sell my e6600 in a second if it holds true.

pxc · Jan 27, 2007

Tripp17 said:
I really dont care for the "bolt" two cores to a chip and call it dual core.

I think you haven't been keeping up for more over a year. Core and Core 2 CPUs are native dual core. Even better, they share the L2 cache and Core 2 cores can snoop on one another's L1 cache. That's a higher level of integration than AMD has in their "native" dual core CPUs.

There are reasons Intel is staying multi chip module with the quad core chips (lower defect rate and fewer manufacturing products). CSI and ODMC are coming to Intel chips next year, but FSB isn't going away. Even with ODMC and HT, AMD is getting soundly beaten by current Intel CPUs.

Scali2 · Jan 28, 2007

Tripp17 said:
Im all for the new processors, i hope there more than just clock increases with extra cache. I do however have a thought about them i dont like. I want to see a native dual core AND four core processors from Intel. I also want memory controllers on them as well. I really dont care for the "bolt" two cores to a chip and call it dual core. I really thought after the pentium d that core 2 duos would be native and i really thought that the new 45nm would be but alas they arent. Im sure they will be fun, but could be much better. Im looking foward to AMDs native 4 core processors due out soon. Word is they are 40% stronger than current intel 4 core processors. Ill sell my e6600 in a second if it holds true.

As far as I know, 45 nm will bring a native quadcore... it might also bring a 'bolt-on' octacore.
But what if they don't, and the K8L turns out to be slower still... then you still won't buy it?

savantu · Jan 28, 2007

Tripp17 said:
Im all for the new processors, i hope there more than just clock increases with extra cache. I do however have a thought about them i dont like. I want to see a native dual core AND four core processors from Intel. I also want memory controllers on them as well. I really dont care for the "bolt" two cores to a chip and call it dual core. I really thought after the pentium d that core 2 duos would be native and i really thought that the new 45nm would be but alas they arent. Im sure they will be fun, but could be much better. Im looking foward to AMDs native 4 core processors due out soon. Word is they are 40% stronger than current intel 4 core processors. Ill sell my e6600 in a second if it holds true.

WTF do you care if its native or not ? Stops you from sleeping at night or what ?

You should care about 3 things only :
1.Performance
2.Price
3.Power consumption

Even if there are midgets with handhelds inside you shouldn't give a ratt's bottom about that..

As for the 40% faster claim , rest assure , you'll eat your words and keep your 6600.

Scali2 · Jan 28, 2007

Yea, the claim is that it's 40% faster than current CPUs in SpecFP (and that's only a performance estimate, not an actual benchmark result!)...
Which means we immediately think 2 things:
1) This CPU will be released in 6 months time, by which Intel might already have closed much of the 40% gap by simply increasing clockspeed.
2) SpecFP only measures the theoretical speed of floating point operations. There are several different categories, one of which is peak performance... AMD didn't specify which, but it's probably peak, which means they are allowed to use any kind of optimizations with their compiler. This may not be representative for real-world applications.
Secondly, it only measures FP performance, which is not a direct indicator for real-world performance. In many real-world applications, only part (or even none) of the code uses floating point operations. So a 40% improvement would be smaller in practice, if the rest of the CPU isn't 40% faster aswell.

In other words: take the 40% with a healthy helping of salt.
I'd even like to go as far and interpret it in a negative fashion.... About 6 months before Core2 was released, there was also talk of 20-40% faster CPUs... difference was that they were talking about real-world applications, and backed it up with actual real-world benchmarks, rather than purely synthetic tests.
If we're not going to see any real-world benchmark figures from AMD, chances are there's not much to tell about them.

45nm with Hyper threading

Weaksauce

2[H]4U

Extremely [H]

[H]ard|Gawd

Extremely [H]

Extremely [H]

Supreme [H]ardness

Gawd

Supreme [H]ardness

[H]ard|Gawd

Extremely [H]

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

Supreme [H]ardness

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Just Plain Mean

Just Plain Mean

Supreme [H]ardness

2[H]4U

[H]ard|Gawd

Supreme [H]ardness

2[H]4U

Gawd

Extremely [H]

2[H]4U

[H]ard|Gawd

2[H]4U