AMD HyperThreading?

Phantom

n00b
Joined
Sep 18, 2004
Messages
51
I am wondering why AMD dose not use HyperThreading or there own version of it on there processors. Wouldn't implementing some form of this help in multitasking and other applications that the P4 takes advantage of, would the shorter pipelines make the amd perform slower or is it just that Intel has it so they cant use it.
 
HyperThreading is probably a patented Intel (*cough* marketing term *cough*) technology.
 
just so nobody gets confused by the first reply.... hyper transport != Hyper Threading
 
intel needs hyperthreading on their p4's because their pipeline sucks. amd could put it on its a64's if they wanted to.. but it is not needed.
 
why bother wasting r&d on "ht" like technology when you can really get near double performance with dual core :)
 
HyperTransport has nothing to do with HyperThreading. Si, señor! HyperThreading is of course dealing with asyncronous multitasking or some highbrow Intel BS. HyperTransporting Deals with the a64 architecture reducing the number of buses and making higher speeds and lower latencies.

Edit: Me agrees with da FLECOM! DC all the way!
 
hyperthreading can give some extra preformance I dont see why they dont add it in but I dont care either since it dosent help gaming preformance.
 
Agreed. If at 2 Ghz a simple 200 Mhz core bump on an A64 rig would out class a 2 Ghz rig with HyperThreading.
 
ideally yes you'd like to see SMT on Opteron, and possibly A64. In situations where you have multithreading, it's free performance.
Intel's specific implementation is bit dated in terms of features implemented.
You want to look at Power5 to see what SMT should look like if being implemented now,
the ability to dynamically enable and disable SMT automatically on the fly as the software load dicates,
the ability to restrict resources (including decoder time, thus limited the number of instructions) based on software thread priorities. SO a high priority thread (say a game) doesn't get equal CPU resources with a low priority thread (like Seti).

A dual core chip obviously provides better performance than an SMT core, when dealing with 1 or 2 high demand threads.
In servers, like web servers and database servers whre you may have many many threads, dual core, each with 2 thread SMT is certainly an advantage, any way to provide a higher degree of parallelism. Of course taken to the extreme you get SUN's Niagra chip, 8 independant very basic cores, each supporting 4 threads in SMT. Designed basically as a web server on a chip.

I really doubt we'll see SMT from AMD though. Dual core chips extend basic workstations to 2 or 4P system, and servers to 16P, it just doesn't make sense for AMD to go back and make signifigant architectual changes when they can allready put that much processing power, that much parallelism into a box. If we were going to see SMT, we would have seen it from the outset of K8, or we'll have to wait and see with K9.
 
Phantom said:
I am wondering why AMD dose not use HyperThreading or there own version of it on there processors. Wouldn't implementing some form of this help in multitasking and other applications that the P4 takes advantage of, would the shorter pipelines make the amd perform slower or is it just that Intel has it so they cant use it.


Because hyperthreading offers no real world performance.
 
Mushroom Prince said:
Because hyperthreading offers no real world performance.

not *entirely* true, most in-depth studies ive seen show a 3-5% increase, but like others said, its basically a hack to help Intels uber-long pipeline when it stalls (which is quite frequently)
 
Steel Chicken said:
not *entirely* true, most in-depth studies ive seen show a 3-5% increase, but like others said, its basically a hack to help Intels uber-long pipeline when it stalls (which is quite frequently)

stall as in branch misprediction?
 
Mushroom Prince said:
Because hyperthreading offers no real world performance.

depends alot on the app, and the usage.
10-15% is pretty doable with decent multithreading of workstation type apps..

Single user / single app enviorment, I really agree with you. But in multitasking multiuser envorments like 2P workstations, 2/4P servers SMT can be a stunning improvement.



RancidWAnnaRIot said:
stall as in branch misprediction?

Stall means the next instruction(s) in the thread can't be exectued.
Flushing after a branch mis-prediciton is one common cause of a stall.
cache miss is the other big one.
dependancy on an in flight instruction can cause a stall. Though modern CPUs are pretty good at looking ahead to find instructions that can be executed out of order.
 
RancidWAnnaRIot said:
stall as in branch misprediction?

yes, I am no expert, but as far as I understand it, thats correct
hyperthreading minimizes the time lost when you have empty spots in the pipeline, from branch mis-predictions.
 
It depends a lot on the chip architecture whether or not it benefits much from simultaneous multithreading. With the P4, if it experiences a pipeline stall, that means all its execution units are unused at the moment of the stall, which is means it's wasting all its resources. So having two threads means that if thread no. 1 stalls, thread no. 2 can use the processor's resources while thread no. 1 is figuring out what the heck is going on. So SMT can increase efficiency, if you have a processor that suffers from a lack of it. Since AMD's chips are pretty good in that department, they probably didn't think it was worth the effort (and $$$) to R & D it, so they skipped to the next logical step after SMT, which is symmetric multiprocessing, or SMP, on a single chip.
 
Jason711 said:
i just hope dual core starts to push smp mainstream...
Thats one thing I want to know.

Will these dual cores function the same as two individual chips?
 
In all honesty, look back at the P4 from start to current, throw in the roadmaps for dual core, I think HT was a good portion of the P4's initial design.

Taking notes from numerous reports, interviews, etc, Intel told the engineers MHz matters, so they designed for it. HT and a massive FPU were in the initial design. I think HT was the cheap way of testing the SMT/dual core market viability. It all looks like it was in there from the start (which has been said) and it all flows to a natural evolution of the Intel line.
 
Yes that's the whole point. One chip acting as two. The real reason for this is the chip makers are starting to hit the ceiling on core speeds that they can produce without cooling becoming a show stopper so they are going to increase cores to compensate. No one is talking much about how we will handle the heat of two cores on a single chip but I suspect slower speeds will have to happen.....
 
Steel Chicken said:
yes, I am no expert, but as far as I understand it, thats correct
hyperthreading minimizes the time lost when you have empty spots in the pipeline, from branch mis-predictions.

Basically correct.

The P4 has a very high branch mis-prediction and large amount of pipelines. As we all know the Presott P4 has 31 pipelines, which will allow it to scale clock speed higher but the branch mis-predictions go up with this. This is why it also has an increased L2 cache size. The K8 core on the other hand only has 12 data pipelines and a very low mis-prediction because of this.

HyperThreading works by using the unused pipelines in the P4 since they can sit there unused a lot of the time. With the K8 all of the pipelines are generally being used all of the time as the chip is a lot more efficient over the P4. So even if AMD did implement SMT is would not do anything because AMD is already using their CPU as efficient as they can.
 
I use a 1700 xeon processor every day. It is a dual box with 1 processor. The hyperthreaded xeon does feel smoother than the xp3200 sitting next to it. For example, when I am preforming an edit on a 200MB photoshop image, or a 1GB postscript file; the xp3200 gets it finished much faster, but dont plan on doing anything else. The xeon box I can still work on publishing/websurfin/general farting around and forget that the processor is totally tapped.

I believe in real world that hyperthreading gives a smoothness to people who are using their computers to do extended calculations while handling lighter tasks in the foreground.
 
Hyperthreading will benefit any CPU, but the return on investment is low on short pipelines.
A long pipeline will benefit much more, so the P4 is a good candidate.
 
M1ster_R0gers said:
I use a 1700 xeon processor every day. It is a dual box with 1 processor. The hyperthreaded xeon does feel smoother than the xp3200 sitting next to it. For example, when I am preforming an edit on a 200MB photoshop image, or a 1GB postscript file; the xp3200 gets it finished much faster, but dont plan on doing anything else. The xeon box I can still work on publishing/websurfin/general farting around and forget that the processor is totally tapped.

I believe in real world that hyperthreading gives a smoothness to people who are using their computers to do extended calculations while handling lighter tasks in the foreground.

I'll have to agree there. A while ago I was playing some UT2k4 with my roomate and we took turns hosting the server. I have an AthlonXP at 2.6 and he has a 2.4c...he was able to host the server and play the game far better than I was (at the time we were both using 9700pro's and using the same settings). I actually had to lower some settings for smooth gameplay when I was the host. While benchmarks show relatively small gains from hyperthreading, the real difference seems to be in multitasked environments and how responsive the computer feels.
 
No matter how you look at it, having HT *available* is a good thing. AMD wont bother with it because they are going dual core soon enough and it would be a waste of development money. Dual core will make HT look like child's play, but AMD will more than likely not jump on the HT wagon. I do however see intel pushing HT even on dual core, it should have the same effect of smoothing out general performance. Two physical CPUs, two logical.

As for the P4's branch prediction, I think intel is somewhere around 98-99% accuracy with prescott, I think that is pretty damn good.
 
A real efficient Processor and truly well designed had not the time to be 2 fractional virtual processors; it is busy enough having sufficient time and internal resources just to be an efficient CPU.

A half-assed broken processor design will have the time and free internal hardware resources left over to be 2 fractional virtual processors.

A64/Opterons had no needs nor use of HyperThreading support from programmers. All fragments processing are handled tranparently without external supports.
 
nam-ng said:
A real efficient Processor and truly well designed had not the time to be 2 fractional virtual processors; it is busy enough having sufficient time and internal resources just to be an efficient CPU.

A half-assed broken processor design will have the time and free internal hardware resources left over to be 2 fractional virtual processors.

A64/Opterons had no needs nor use of HyperThreading support from programmers. All fragments processing are handled tranparently without external supports.


Now try and explain away why HT or dual proc for that matter, makes a system much more responsive in a multitasking environment. A single P4 with HT will do better than a FX55 in a multitasking environment with a heavy load. HT simply allows for a more responsive system because other processes can get cpu time.
 
Dew said:
Now try and explain away why HT or dual proc for that matter, makes a system much more responsive in a multitasking environment.
More responsive? You'd never seen a real Dually stuttered while a single CPU, single processor version didn't? If you had not learned about that one yet, go learn about it first. Try for hands-on type learning like designing Duallys or debugging Duallys, a lot of dumbshits on the net are too stupid to know expertise can't be achieved just by readings. Users of Duallys are not the same as Designers of Duallys.

I actually met a dumbshit whom had read 2 chapters about monitors from an A+ book and considered himself a monitor expert.

HT isn't like a regular Dually... more like a broken Dually, it is Intel implementation of non-inclusive fragment virtual processors.
A single P4 with HT will do better than a FX55 in a multitasking environment with a heavy load. HT simply allows for a more responsive system because other processes can get cpu time.
First P4 systems with HT will have to wait for more programmers who actually could use real dually efficiently in their programs before they learn to deal with broken Duallys.

Currently most programmers can't even use Dually properly and efficiently much less broken Duallys.
 
I’m going to do a “mass answer”

Mushroom Prince said:
Because hyperthreading offers no real world performance.
Hyperthreading makes a lot of difference in real world. I know this has been answered way back, but it splits the load. The Pentium 4 has a massive execution power and HT makes the idle execution units which are around 65% of them and they put them to good use.

0ldman said:
In all honesty, look back at the P4 from start to current, throw in the roadmaps for dual core, I think HT was a good portion of the P4's initial design.

Taking notes from numerous reports, interviews, etc, Intel told the engineers MHz matters, so they designed for it. HT and a massive FPU were in the initial design. I think HT was the cheap way of testing the SMT/dual core market viability. It all looks like it was in there from the start (which has been said) and it all flows to a natural evolution of the Intel line.
While its been said that HT was partially present as far back as in the copermine cores, I don’t think that Intel designed the Pentium 4 to use HT specifically. The design of the initial Pentium 4 was a series of performance/cost tradeoffs. That’s why the L1 cache was so small yet fast. Intel said that a larger cache would be necessary because it would cost a lot (thanks to the inclusive cache structure) and it would have higher latencies. I think the FPU design was just to increase performance. Intel optimized the Pentium 4 for SSE2 operations and double clocked the ALU. I think HT was somewhat of an added bonus.

CentronMe said:
Basically correct.

The P4 has a very high branch mis-prediction and large amount of pipelines. As we all know the Presott P4 has 31 pipelines, which will allow it to scale clock speed higher but the branch mis-predictions go up with this. This is why it also has an increased L2 cache size. The K8 core on the other hand only has 12 data pipelines and a very low mis-prediction because of this.

HyperThreading works by using the unused pipelines in the P4 since they can sit there unused a lot of the time. With the K8 all of the pipelines are generally being used all of the time as the chip is a lot more efficient over the P4. So even if AMD did implement SMT is would not do anything because AMD is already using their CPU as efficient as they can.

HT is better off with a longer pipeline not so it can take advantage of the extra pipelines but because it needs the pauses between each stage of the pipeline. The reason why HT shouldn’t work on an AMD is because of its shorter pipeline. Another thing that I think is often neglected is the fact that HT needs a lot of execution units to work properly cause it takes a longer pipeline and a powerful execution units to make it work. A processor with a pipeline length of 14 stages can generally support HT as IBMs processors have a HT similar thing for a 14 stage processor. But here’s when it starts getting fuzzy. The dual core Itanium2 have HTs; the problem is that the Itanium2 has a pipeline length of 10 stages…so I think Intel modified their HT to work on processors with lower pipeline lengths. I think there was talk about tri-threading in Tejas, but I don’t know if that’s going to be used…imaging, a dual core processor with two logical processors for every logical one…a nice six CPUs!

Sorry if this post is somewhat confusing, I haven’t read a processor architecture paper in about a month

And nam-ng, the reason why HT is good is because its like free performance. Sure it doesn’t compare to a second processor, but its taking unused power and redirecting it to do something productive. That’s why its good and it makes the processor more efficient. Having dual processors doesn’t make a processor more efficient, having HT does.
 
nam-ng said:
A real efficient Processor and truly well designed had not the time to be 2 fractional virtual processors; it is busy enough having sufficient time and internal resources just to be an efficient CPU.


you don't think Power5 or USpark V or Niagara are well designed CPUs?

A half-assed broken processor design will have the time and free internal hardware resources left over to be 2 fractional virtual processors.

please, any superscalar pipelined processor leaves a substantial portion of it's availible in flight 'slots' unused. (ture VLIW machines would be the exception, where schedualling is handled at compile time, not run time).
 
FreiDOg said:
you don't think Power5 or USpark V or Niagara are well designed CPUs?



please, any superscalar pipelined processor leaves a substantial portion of it's availible in flight 'slots' unused. (ture VLIW machines would be the exception, where schedualling is handled at compile time, not run time).

I wouldn't include Ultrasparc V in that list - Sun recently
cancelled it because it was just pointlessly out of hand.
They're teaming up with Fujitsu to use their 64 bit SPARC
implimentation instead.
I definately agree about pipelines & unused slots though.
Pipelining is tricky - it gets you a higher peak performance
that is harder to reach. Look at how notorious the i860 was.
A true VLIW machine would be great. If Intel hadn't run the
Itanium idea through 20 marketing committees before handing
it to the engineers, we'd be reminiscing about Pentiums and
AMD now, rather than being impressed by them...
 
nam-ng said:
<some random tangent>
WTF are you talking about? You sound like you think I was talking about utilization of SMP, I'm not. I am not a dually designer by any means, but that is irrelevant to my argument. I am talking about multitasking as a user in a dually environment. I have owned both dually systems and HT systems. Why you even attacked me on the issue has me baffeled. I did not once suggest that AMD even consider implimenting HT. I personally let the operating system worry about process affiliation. My point is that a dually setup will be more responsive because even with one processor under full load, the OS will tell processes to execute on the other cpu. HT works well with the P4 because it allows otherwise wasted cycles to be used. I think you were under the impression that I run all SMP-aware software. Why you argue against HT I don't even know. If someone dislikes the marginal single process speed loss from a HT enabled CPU they can disable HT in the bios.

The fact is this, HT and Dually setups are more responsive because other applications can get cpu time in even when a single process would otherwise bring the system to a stand-still. I used to play Quake3 and encode videos at the same time on my dually rig. Good luck on a single cpu system.
 
Dew said:
Now try and explain away why HT or dual proc for that matter, makes a system much more responsive in a multitasking environment. A single P4 with HT will do better than a FX55 in a multitasking environment with a heavy load. HT simply allows for a more responsive system because other processes can get cpu time.

In your specific case, P4 HT vs AMD, you notice a large difference mostly because Windows has truely awful multitasking.
 
Originally Posted by Duke3d87

And nam-ng, the reason why HT is good is because its like free performance. Sure it doesn’t compare to a second processor, but its taking unused power and redirecting it to do something productive. That’s why its good and it makes the processor more efficient. Having dual processors doesn’t make a processor more efficient, having HT does.

****************************************
Originally Posted by FreiDOg

please, any superscalar pipelined processor leaves a substantial portion of it's availible in flight 'slots' unused. (ture VLIW machines would be the exception, where schedualling is handled at compile time, not run time).

If ATI can sell broken 2D mapping Anisotropic to fools as "smart and adaptive Anisotropic", Intel certainly can continue to build single processors with the same identical wasted unused hardware resources, and the same unused flight "slots", selling them to fools wanting 2 broken fractional virtual processors.

Hey, If HT really catch on with a huge number of fools, Intel may even add extra broken unused resources for even more fractional virtual processors.
 
lack of hyperthreading or anything that comes close is why I've been buying intel chips since HT came out. I multi task a lot (that includes long compiles in the background that take up a lot of processor). Having the machine responsive even when I'm doing that is an absolute necessity. When dual cores come out I might finally consider buying an AMD again. I used to buy them all the time but lack of HT has kept me out of the AMD market.
 
nam-ng said:
If ATI can sell broken 2D mapping Anisotropic to fools as "smart and adaptive Anisotropic", Intel certainly can continue to build single processors with the same identical wasted unused hardware resources, and the same unused flight "slots", selling them to fools wanting 2 broken fractional virtual processors.

Hey, If HT really catch on with a huge number of fools, Intel may even add extra broken unused resources for even more fractional virtual processors.

Hey, Niagara only issues 1 in order instruction per clock cycle (per core), but they've apparently got plenty of 'broken unused resources' because each core has up to 4 simultaneous threads..
I bet Power5, despite beign a good 10% faster clock for clock than P4+, has a bunch of those 'broken unused resources' as well, otherwise why would IBM bother to bring (a pretty modern and feature rich) SMT to their already dual core Power line?



in any given CPU what do you think costs it more wasted cycles, a cache miss, or a Branchmisprediction?
in most CPUs, what do you think happens more often, a cache miss or a branchmisprediction?



A. L2 miss, by far. A CPU has to flush between a handful and (theoretically for P4) 128 instructions to from the pipeline. A depndancy (at it's a safe bet that in shorrt order the thread will depend on that instruction, OOE is nice, but it can't get around the fact most instructions depend on atleast 1 previous result) on an L2 miss costs a CPU between about 160 on up to a 400 clock cycle stall. Now, whatever could we do with a couple of hundred CPU cycles where most or all of a process is blocked, oh I know, why don't we continue simultaneous execution of a second thread.

A. Obviously depends alot on the code you're running, but modern branch prediction is extrodinarily accurate, P4 tends to have accuracies in the upper 90s, Athlon perhaps a bit lower, but still keeping well into the mid to upper 90s.
Cache is a bit more ficle because it requires good code and good compilers, good cache architecture, and code capable of suiting it.
great tightly woven code with strong locality can have >99% caceh hit rate.
of course large amounts of data with very weak locality can easily have 20-30% hit rates.
While you rarely see very bad code (largely because modern compilers are very good at optimizing for cache usage), it's much rarer to see great code.
So, on aveage, again, the L2 answer is the correct one.

What was the point of this little quiz?
just to point out that the perils of deep pipelineing are the second, or even third most important reason to implement SMT.

SMT doesn't exist to hide bad engineering. (Though I suppose it could to an extent, two threads running poorly is still faster than 1 thread running poorly).
SMT exists to hide inherent peformance limitations in modern CPUs.
And while you're very fond of pointing out that P4 has larger issues surrounding speculative execution than most any other chip around. You seem to glaze over both that SMT is being used in a number of other very well designed CPUs, and that there are a multitude of very good reasons to implement SMT even in very well engineered chips, even in chips with very short pipelines.
 
nam-ng said:
A half-assed broken processor design will have the time and free internal hardware resources left over to be 2 fractional virtual processors.

Are you calling a Pentium 4 a half-assed broken processor design? :p :D ;)
MD_Willington said:
AMD has a hyperthreading patent. US patent # 5944816
So does IBM,

Link:

http://patft.uspto.gov/netacgi/nph-...5944816&FIELD1=&co1=AND&TERM2=&FIELD2=&d=ptxt

Enjoy

MD

Just because someone has a patent does not mean they are ever going to use it.
 
Back
Top