I7-3770k higher IBT Performance with HyperThreading Off

SonDa5

Supreme [H]ardness
Joined
Aug 20, 2008
Messages
7,437
Got a new 3770k. It's my first one.
Batch 3226C840 I7-3770k

Did some testing before I delid this.

[email protected] vCore 1.15v
4.5GHZ
Water cooled with UT-60 420mm RAD and DT Sniper water block.
TIM IC Diamond
Not Delidded
Ambient Temps 21C
http://valid.canardpc.com/2595435


HT on

3770kbasetest4p5HTonIBT.jpg


Package Min 26c Max 66C


HT off

HToff3770k4p5baseIBTtest.jpg


Package Min 23C Max 66C



Temps are almost the same but the Gflop performance during IBT stress test is much more efficient with only 4 cores running with HT off.

Just wanted to share my findings.
 
I have seen that as well on the i7 970 6 core / 12 threaded CPU. I believe you get a similar effect if you cut the threads down.
 
Wow, nearly 20%, that's significant. I'm going to try that, it's going to have to wait a few days though, my chip is doing tasks. :eek:
 

This.

Its just how IBT runs, set it to 4 threads instead of all. This has always been the case with HT enabled chips. Your speed is not actually improving you are just not forcing it to try and run 8 full threads on a quad core with HT.
 
This.

Its just how IBT runs, set it to 4 threads instead of all. This has always been the case with HT enabled chips. Your speed is not actually improving you are just not forcing it to try and run 8 full threads on a quad core with HT.


So running 4 threads with AVX enabled is more efficient than running 8 threads with AVX enabled?
 
It worked. I turned HT back on though, my Audio Apps get a 20-25%+/- boost with HT. :cool:
 
So HT is good for apps that use it. IBT doesn't really benefit from it.

Its just a stress test, why would it need to benefit from it? I've heard the explanation before that IBT treats each thread as if it were a logical core so the HT chip basically has to double its work per core.

Running my 3570k and 3770k at the same clock speed, they are equal to each other with 4 threads selected but at 8 threads the 3770k is slower by 10-15 gflops. Gets nice and hot though, good way to check for maximum temps. P95 doesn't even come within 10*C of IBT in my IB machines.
 
Yea if you are gamer. HT does nothing for you period.

I don't think iv ever used HT at all. Even with my old I7 920/950. It just produces more heat, and needs more volts for overclocking.

for me to hit 5ghz on my 2700k with HT on I need about .22 more volts....just not worth it.
 
If you're going for bench scores under stress, disable HT. Otherwise, unless you need it off for a game, keep it enabled. Much smoother overall.
 
Yea if you are gamer. HT does nothing for you period.

I don't think iv ever used HT at all. Even with my old I7 920/950. It just produces more heat, and needs more volts for overclocking.

for me to hit 5ghz on my 2700k with HT on I need about .22 more volts....just not worth it.

This is pretty short sighted. A ton of other apps as well as background tasks etc. can take advantage of HT.

I'd rather have HT running when iTunes decides to update than not.
 
Last edited:
Explain to me how that graph shows you get better framerate?

Remember the 2600k, has more L3 Cache and is clocked 100mhz more then a 2500k, That graph shows the difference. Nothing to do with HT on or off

Note the 66Mhz slower i7 975 that is also a ~15% slower IPC has 22 more average FPS.

Also take a look at the 6-core AMD having 12 more average FPS than the similar quad core.

I just posted the image as an example. I don't need it myself as I have an i7 and I can clearly see the difference myself when I enable and disable hyper threading when playng BF3. I just wanted to post something more substantial than my own words.

BF3 can use all 8 threads effectively when there are all those calculations to work on in a full 64-player battle.
 
Note the 66Mhz slower i7 975 that is also a ~15% slower IPC has 22 more average FPS.

Also take a look at the 6-core AMD having 12 more average FPS than the similar quad core.

I just posted the image as an example. I don't need it myself as I have an i7 and I can clearly see the difference myself when I enable and disable hyper threading when playng BF3. I just wanted to post something more substantial than my own words.

BF3 can use all 8 threads effectively when there are all those calculations to work on in a full 64-player battle.

So do you know if this benchmark is MP or SP, Also why are they running on Medium settings.

There is no way to say if HT is helping them, specially when they dont even state it.....

Need Proof, not some german graph with no details.
 
Last edited:
Note the 66Mhz slower i7 975 that is also a ~15% slower IPC has 22 more average FPS.

Also take a look at the 6-core AMD having 12 more average FPS than the similar quad core.

I just posted the image as an example. I don't need it myself as I have an i7 and I can clearly see the difference myself when I enable and disable hyper threading when playng BF3. I just wanted to post something more substantial than my own words.

BF3 can use all 8 threads effectively when there are all those calculations to work on in a full 64-player battle.

http://www.tomshardware.com/reviews/battlefield-3-graphics-performance,3063-13.html

Ok so HT adds a whole 1fps in BF3. Man really does make the game smoother?

Here is a forum user with tests showing HT off giving him better performance in BF3.

http://www.overclock.net/t/1151970/my-own-bf3-benchmark-hyperthreading-on-vs-off

Another review showing HT doing completely nothing for bf3

http://www.bit-tech.net/hardware/2011/11/10/battlefield-3-technical-analysis/7
 
This is pretty short sighted. A ton of other apps as well as background tasks etc. can take advantage of HT.

I'd rather have HT running when iTunes decides to update than not.

Which is why the first thing I said was "If you are gamer", which means games don't benefit from HT.

I never said anything about Apps or itunes LOL
 
You don't seem to understand benchmarks. The game is run at medium to eliminate the GPU from the equation thus showing the effect of the CPU.

Looking at that bit tech benchmark graph makes it clear to me that you don't seem to really understand the CPUs job and the GPUs job in running a video game. When you increase the resolution and graphic settings you are adding a pretty insignificant amount of the work to the CPU. You are really only giving the GPU more work. In fact, increasing resolution alone has no impact on the work the CPU needs to do.

In that bit tech test they are only getting ~30s FPS because they are running the game at ultra on a 6950... The dual core CPU could be calculating and sending 35fps to the GPU. While the 4 core could be calculating and sending 50fps and the 4 core with hyper threading could be sending 65 fps. But of course you don't see a difference, because the GPU can only process ~30s of those frames.

If instead they had a faster GPU like a 680 or 7970 or especially a dual card setup. Then you would see the significant differences between the 2 threads and 8 threads because the GPU would be able to process more than say the 65fps it might be receiving from the fastest CPU.

You can also bring this difference to light by lowering the ammount of work the GPU needs to do by lowering the resolution and graphic details.

Also, only multiplayer shows any difference as it is significantly more CPU intensive. There is a ton of work in looping through the 64 player struct and computing what all the players are doing at any given moment. Processing that is nonexistent in single player.

In multiplayer where it matters the difference is large. On my desktop with my i7, at maximum settings I my FPS dips down into the lower 50s during fights where most of the 64 players are involved. when I enable HT though I never drop below the mid 60s and I see the majority of the 8 threads being utilized.

I'm not sure how you want me to prove this to you, but I can assure you it is a real and significant difference.

Without HT I'm getting subpar performance. And with it on its much smoother. HT is the only variable I'm changing.


The point is, games do benefit from hyper threading if 1 thing is true. If the game engine can actually make use of more than 4 threads. There are very few games that can right now. But my point was thy do exist, and BF3 in a full 64 player server is one such example. And as a gamer I spent 90% of my time playing in 64 player BF3 servers, so Hypernthreading is useful to me as a gamer.

I also stream my games and Hypernthreading is pretty much necessary as my computer needs enough CPU power to encode a 1080p h.264 video at 30fps while also running a game. But that whole situation is beside my point.
 
Last edited:
You don't seem to understand benchmarks. The game is run at medium to eliminate the GPU from the equation thus showing the effect of the CPU.

Looking at that bit tech benchmark graph makes it clear to me that you don't seem to really understand the CPUs job and the GPUs job in running a video game. When you increase the resolution and graphic settings you are adding a pretty insignificant amount of the work to the CPU. You are really only giving the GPU more work. In fact, increasing resolution alone has no impact on the work the CPU needs to do.

In that bit tech test they are only getting ~30s FPS because they are running the game at ultra on a 6950... The dual core CPU could be calculating and sending 35fps to the GPU. While the 4 core could be calculating and sending 50fps and the 4 core with hyper threading could be sending 65 fps. But of course you don't see a difference, because the GPU can only process ~30s of those frames.

If instead they had a faster GPU like a 680 or 7970 or especially a dual card setup. Then you would see the significant differences between the 2 threads and 8 threads because the GPU would be able to process more than say the 65fps it might be receiving from the fastest CPU.

You can also bring this difference to light by lowering the ammount of work the GPU needs to do by lowering the resolution and graphic details.

Also, only multiplayer shows any difference as it is significantly more CPU intensive. There is a ton of work in looping through the 64 player struct and computing what all the players are doing at any given moment. Processing that is nonexistent in single player.

In multiplayer where it matters the difference is large. On my desktop with my i7, at maximum settings I my FPS dips down into the lower 50s during fights where most of the 64 players are involved. when I enable HT though I never drop below the mid 60s and I see the majority of the 8 threads being utilized.

I'm not sure how you want me to prove this to you, but I can assure you it is a real and significant difference.

Without HT I'm getting subpar performance. And with it on its much smoother. HT is the only variable I'm changing.


The point is, games do benefit from hyper threading if 1 thing is true. If the game engine can actually make use of more than 4 threads. There are very few games that can right now. But my point was thy do exist, and BF3 in a full 64 player server is one such example. And as a gamer I spent 90% of my time playing in 64 player BF3 servers, so Hypernthreading is useful to me as a gamer.

I also stream my games and Hypernthreading is pretty much necessary as my computer needs enough CPU power to encode a 1080p h.264 video at 30fps while also running a game. But that whole situation is beside my point.

http://forums.anandtech.com/showthread.php?t=2274887

conclusion: "The benefit of HT is more pronounced on slower CPUs than on faster CPUs and on dual-cores than on quad-cores"
 
HT apps = on, HT games = off / on debatable. You have the option, just like a light switch, to turn it on or off. Experiment, stop bickering. LOL.
 
HT apps = on, HT games = off / on debatable. You have the option, just like a light switch, to turn it on or off. Experiment, stop bickering. LOL.

That is the way that I look at it.


A theory that I just came up with is that I think that over clocking effects the stability of HT.

My 3770k is stable at 5GHZ with HT on or with HT off but it feels faster with HT off and with HT off I can lower voltage about .08v on vcore so it runs cooler and feels faster to me.


IBT run at 5ghz with HT off.

HToff5GHZdelidi73770kIBT.jpg


http://valid.canardpc.com/2600539
 
HT does affect oc, you will get a higher clock / lower temp / possible lower voltage without it usually.
 
this is because you are allowing HT's 2mb of L3 cache to be used for the physical cores when HT is off.

i would never use HT in any game if your CPU is already overclocked.

rememberer, 3570= 6mb cache, 3770= 8mb cache(due to HT)
 
Last edited:
The real reason why you get a decrease in performance is CACHE THRASHING.

More threads = more data overwritten in cache = more memory requests. AVX is already pushing the processor to the limits as far as fetch bandwidth goes, so if you reduce the hit-rate on cache it WILL affect you final results. And nowhere will you see the results of cache thrashing more than with large vector operations.

Given that this is a SIMPLE TEST CASE (no branches, nothing flashy) where all you're running are simple AVX operations, it's far easier for the on-chip scheduler to handle these requests (instead of falling-back to HT to let a separate thread fill "unused" pipes). When the pipes are already full, HT does little good.

And yes, it's very possible. Sandy Bridge was already too bandwidth constrained to take full-advantage of the dual AVX units it sports, and although Ivy Bride alleviates this quite a bit, it's still not swimming in bandwidth to feed that monster dual vector engine.

Don't get me wrong: hyper-threading can add value in many real-world places (for example, it improves FAH performance significantly, and it makes any heavily-loaded system smoother). But a simple stress test is not one of them :D
 
Last edited:
SB has L3 cache that runs at full speed, that lowers (substantially!) the penalty for HT induced cache thrashing.
It also makes HT more effective in general, due to how HT operates.

Makes sense.
 
The real reason why you get a decrease in performance is CACHE THRASHING.

More threads = more data overwritten in cache = more memory requests. AVX is already pushing the processor to the limits as far as fetch bandwidth goes, so if you reduce the hit-rate on cache it WILL affect you final results. And nowhere will you see the results of cache thrashing more than with large vector operations.

Given that this is a SIMPLE TEST CASE (no branches, nothing flashy) where all you're running are simple AVX operations, it's far easier for the on-chip scheduler to handle these requests (instead of falling-back to HT to let a separate thread fill "unused" pipes). When the pipes are already full, HT does little good.

And yes, it's very possible. Sandy Bridge was already too bandwidth constrained to take full-advantage of the dual AVX units it sports, and although Ivy Bride alleviates this quite a bit, it's still not swimming in bandwidth to feed that monster dual vector engine.

Don't get me wrong: hyper-threading can add value in many real-world places (for example, it improves FAH performance significantly, and it makes any heavily-loaded system smoother). But a simple stress test is not one of them :D


You make good points, but see above. SB/IB implementation with L3 has pretty much alleviated any cache thrashing. The days of nahalem and dogged performance with HT are long past.
 
You make good points, but see above. SB/IB implementation with L3 has pretty much alleviated any cache thrashing. The days of nahalem and dogged performance with HT are long past.

Sandy retains the same exact 64KB L1 and 256KB L2 cache as Nahalem (*slower* L2 actually, 12 cycles vs 10), and although it implements the trace cache, in reality the big load on the bandwidth is DATA, so that won't benefit at all. L2 is MUCH faster than L3 (but much smaller), so that is where I'm concentrating my analysis. The L2 cache is also exclusive to a single CPU core (or two threads).

The larger the vector operation, the more data that has to be loaded into registers from ram (and ultimately saved). More threads accessing different sets of data means they could stomp on each-others cache entries, in which case it would be faster for a single thread to complete the set of operations. Despite all the cache on the processor, the optimal load will always be from the (tiny) 32KB L1 data cache, and every time you get a cache miss it will go out to L2 or L3 or memory, and overwrite the corresponding L1 cache line that maps to that memory location. Since you can have several hundred memory locations mapped to a single L1 cache line, you can understand the impact of two separate threads working on data from different portions of a data set.

Sandy Bridge L3 is much faster than Nahalem, but it's not the limiting factor here.
 
Last edited:
Again, just showing what I posted. I understand what you're saying but you somewhat keep implying that there is cache thrashing on Sandy and Ivy and it's just not much the case. The cache thrashing by definition was causing the slower performance with it enabled in the old days, and clearly today, we get good performance and even a boost in some apps.
 
Again, just showing what I posted. I understand what you're saying but you somewhat keep implying that there is cache thrashing on Sandy and Ivy and it's just not much the case. The cache thrashing by definition was causing the slower performance with it enabled in the old days, and clearly today, we get good performance and even a boost in some apps.

I'm simply pointing-out an edge case, because that was the question at the top of the thread. Just because the cache is not thrashed in regular "average" usage does not mean it's impossible.

In this heavily-loaded scenario, with just the right mix of independent threads accessing different data sets, it's certainly possible to have lots of conflict misses generated by the second thread. If you don't understand how this works I would be happy to explain it to you :D

If the program simply did not scale beyond 4 cores (a theory voiced in this thread), the performance would be nearly the same with and without HT enabled (perhaps vary by a few percent) because Windows kernel knows the difference between a real core and two virtual cores. This is unlikely to account for a 20% performance improvement with it disabled, so it is most likely to be a cache issue. Intel Burn Test is designed to stress memory as well as the processor core, so hitting a fetch bandwidth wall is not unheard of.
 
Last edited:
I understand what you're saying now that I re read this thread with more time to relax and focus, but it's pretty much AVX and bandwidth issues.

The cache explanation wasn't necessary for me but it's appreciated to help others in this thread. In Nahalem days that theory would have more merit, on Sandy / IVY however it's more complex and more related to saturated bandwidth and AVX. I reject your theory on cache misses causing this, because in my testing in the past hour tit for tat at 4.5 ghz on my Sandy, using latest IBT 2.54 AVX enabled, HT does decrease gflops by about 10-11 points with HT enabled as opposed to disabled. Right you say, you've been explaining this all along. Well I also tested and concluded that changing the available memory to test has the biggest impact, and each increase increases the available gflops back up in line to where it was disabled. So testing the max amount memory available under IBT will pretty much give or take a point or two perhaps, put you right back up to where you need to be with HT disabled. It's pretty much a memory / AVX issue with this proggie. HT on this program is implemented in a way to be tested, not increase performance. And if increasing available memory under IBT with Hyperthreading enabled puts it's performance back up in line where it should be, I can conclude it has nothing to do with the cpu cache thrashing.
 
Last edited:
Got a new 3770k. It's my first one.
Batch 3226C840 I7-3770k

Did some testing before I delid this.

[email protected] vCore 1.15v
4.5GHZ
Water cooled with UT-60 420mm RAD and DT Sniper water block.
TIM IC Diamond
Not Delidded
Ambient Temps 21C
http://valid.canardpc.com/2595435


HT on

3770kbasetest4p5HTonIBT.jpg


Package Min 26c Max 66C


HT off

HToff3770k4p5baseIBTtest.jpg


Package Min 23C Max 66C



Temps are almost the same but the Gflop performance during IBT stress test is much more efficient with only 4 cores running with HT off.

Just wanted to share my findings.

Good read.
 
Last edited:
Note how sonda is testing only 1024 mb of memory per test. I started out with the following:


Sandy 2600K clocked at 4.5 ghz HT ON / AVX enabled / Windows 7 SP1 / IBT 2.54

Sandy 2600K clocked at 4.5 ghz HT OFF / AVX enabled / Windows 7 SP1 / IBT 2.54

Initial test was done much like sonda, with 1024 mb available mem tested.Now note, I have 12 gigs of DDR3 im running here. First results put HT on about 10-11 points gflops behind HT disabled. Cache thrash the guy above would say. However I dug alittle deeper and started fiddling with the available memory to test. Each increase in available memory to test the gflops rise. Eventually, testing all available memory or close to it puts the performance back up even give or take a point or two with HT disabled. So if the available memory is changing this, it is most certainly not a cache thrash issue and more a implementation of AVX / memory output issue.
 
Note how sonda is testing only 1024 mb of memory per test. I started out with the following:


Sandy 2600K clocked at 4.5 ghz HT ON / AVX enabled / Windows 7 SP1 / IBT 2.54

Sandy 2600K clocked at 4.5 ghz HT OFF / AVX enabled / Windows 7 SP1 / IBT 2.54

Initial test was done much like sonda, with 1024 mb available mem tested.Now note, I have 12 gigs of DDR3 im running here. First results put HT on about 10-11 points gflops behind HT disabled. Cache thrash the guy above would say. However I dug alittle deeper and started fiddling with the available memory to test. Each increase in available memory to test the gflops rise. Eventually, testing all available memory or close to it puts the performance back up even give or take a point or two with HT disabled. So if the available memory is changing this, it is most certainly not a cache thrash issue and more a implementation of AVX / memory output issue.

Okay I'll buy that. Given the limited sample size, it had only one likely cause in my head:D
 
Back
Top