Dual Xeons: HT on or off?

Martin Cracauer · Jul 12, 2005

mikeblas said:
Synchronization bugs already happen. Consumer software that abuses threads (Pinnacle Studio 7 and Studio 8 are great examples) are abundant.

There are also driver-level problems; fortunately, dual-proc systems have already tought some companies lessons about building and testing multi-threaded software.

Funny thing is, synchronization isn't a new thing. It's just that PC-class developers think they don't need to know about it.

If you start actually looking at it, I think you'll be surprised at how many applications you use that are already multi-threaded.

Well, so far many high-performance applications use event loops now, most notably games. They wil have to change to a threaded model and they have neither experience doing do nor will they just re-write 3D engines. There will be a good mess with more or less unrelated threads tagged onto a main thread running the event loop that the previous incarnation of the engine had.

That is a receipe for a lot of hangs on the user's part. I am generally amazed how reliably many games run right now, I guess that party will be on pause for 5 years or so.

Many heavy-I/O applications use dispatching file descriptor events and fork entire processes or use asynchronous I/O, both of them are quite safe against deadlocks. Here, too, will we see unhappy campers.

rolo · Jul 12, 2005

mikeblas said:
This isn't exactly correct.

The problem with processors today is that they're substantially faster than the memory they access. A HT processor works by keeping two execution states ready to go in hardware. When one execution state stalls during a fetch to memory, the other execution state is expected to be runnable and steps in to start executing the state it has.

Partially correct.

The P4 has something like 7 execution units, each one specialized for certain instruction types (I think there are 3 ALU's, 2 floating point units, add/store, etc. whatever, numbers may be wrong

)

HyperThreading simply adds another dispatcher to the execution pipeline and execution units. One thread's execution doesn't have to wait for the other to stall -- they are both always feeding instructions to the execution pipelines.

It doesn't increase the amount of work that the execution pipelines can do, but it does give finer-grained concurrency between the two threads. Now instead of having two threads that switch back and forth every 10ms using 100% CPU time, you have two threads that are working simultaneously at 50% of the CPU's full performance potential.

Think of it as adding a second person to the start of a manufacturing assembly line. The two people take turns adding work items to the conveyor belt, and as such are each working 1/2 as fast as they could be alone. But if one of them needs to sneeze, or yawn, or scratch their nose, or go to the bathroom, the other worker can jump in and keep feeding the assembly line at 100%.

It really helps in areas that are latency-bound for performance. Like clicking on things. Why do you think everyone says that their P4 w/ HT systems feel "snappier" than their Athlons? It's not actually working faster, it's just working sooner.

mikeblas · Jul 12, 2005

Martin Cracauer said:
Many heavy-I/O applications use dispatching file descriptor events and fork entire processes or use asynchronous I/O, both of them are quite safe against deadlocks. Here, too, will we see unhappy campers.

Hardly!

Asynchronous I/O requires that you wait for an event to know that the I/O has completed. If you completely ignore error handling, then you're maybe you can call it safe from deadlocks, but any application that isn't a toy needs to know when it's I/O has successfully finished; either to report errors to the user or to get on with more work now that the deata has been read.

Spawning another process doesn't excuse you from synchronizing threads; you're just synchronizing the main thread of one process with the main thread of another process. If you don't have synchronization, how would the other thread know when something is available to be written, or when the buffer is safe to stuff more data into?

I can't understand why someone would want to spawn another process for I/O, anyway. You'd do IPC on your way to doing I/O? That doesn't make any sense to me.

mikeblas · Jul 12, 2005

rolo said:
One thread's execution doesn't have to wait for the other to stall -- they are both always feeding instructions to the execution pipelines.

Indeed, the two states might be executing concurrently if none of the resources they need for that execution overlap.

But when one of the execution units is stalled because it is reading from memory, the other unit gets complete use of the processor, at least until the memory read is satisfied. It's not fetching anything anymore, unless it is in cache because it knows the memory bus is busy.

If that code was running on a processor which didn't have HT, the processor would be completely idle while it waited for the cache miss to be satisfied. It might be idle for 300 or 400 cycles! It's bocked on retreiving that data from memory; hopefully, some previous instruction is wrapping-up execution so it isn't literally doing nothing. But since no instruction takes 300 cycles, the processor will be completely idle for a long part of that time.

By comparison, a HT-enabled processor hopefully (and probably) has other work to do in its other execution context during that time. It has hopefully queued instructions that don't require memory hits and can use otherwise idle resources on the chip.

See Chapter 9 of Programming with Hyper-Threading Technology; it pretty clearly explains how cache misses and pipeline stalls influence scheduling on the processor.

rolo said:
It really helps in areas that are latency-bound for performance. Like clicking on things.

You've totally lost me here. Why do you describe "clicking on things" as "latency-bound for performance"? The latency of what response or action limits the performance of a mouse event? The latency of the hardware interrupt that initiates the user-mode response to the click?

rolo · Jul 12, 2005

mikeblas said:
Indeed, the two states might be executing concurrently if none of the resources they need for that execution overlap.

But when one of the execution units is stalled because it is reading from memory, the other unit gets complete use of the processor, at least until the memory read is satisfied. It's not fetching anything anymore, unless it is in cache because it knows the memory bus is busy.

If that code was running on a processor which didn't have HT, the processor would be completely idle while it waited for the cache miss to be satisfied. It might be idle for 300 or 400 cycles! It's bocked on retreiving that data from memory; hopefully, some previous instruction is wrapping-up execution so it isn't literally doing nothing. But since no instruction takes 300 cycles, the processor will be completely idle for a long part of that time.

By comparison, a HT-enabled processor hopefully (and probably) has other work to do in its other execution context during that time. It has hopefully queued instructions that don't require memory hits and can use otherwise idle resources on the chip.

Yes, correct.

mikeblas said:
You've totally lost me here. Why do you describe "clicking on things" as "latency-bound for performance"? The latency of what response or action limits the performance of a mouse event? The latency of the hardware interrupt that initiates the user-mode response to the click?

Here's a scenario for you. Let's say some thread in the system is busy doing something -- anything -- and the user clicks on something in the UI. On a single core, non-HT system, that click is not processed at least until the active thread is pre-empted, which may be what ... 10ms? 20ms? I think I read that Windows' timeslice is 7ms, but I could easily be wrong. On a dual core, dual proc, or HT system, the UI thread may start executing immediately.

Single core, not-HT system: User clicks. Currently active thread is pre-empted 20ms later. Click is processed over the course of 10ms.

HT system: User clicks. Both threads are now executing, albeit at half-performance. Click is processed over the course of 20ms. <-- 33% sooner, even though it is half as "fast".

Anyway "latency-bound performance" is basically stating "performance is based on perception." The CPU isn't processing things faster, it's just processing them sooner.

mikeblas · Jul 12, 2005

rolo said:
I think I read that Windows' timeslice is 7ms, but I could easily be wrong. On a dual core, dual proc, or HT system, the UI thread may start executing immediately.

Windows doesn't have a fixed timeslice. It varies for foreground applications, which may get a boost so that they appear more responsive. Windows 2000, NT, Server 2003, and XP all have different quantum lengths and slightly different scheduling rules. And the user can even change them; by hacking the registry, or by touching the settings in the "Performance" tab under the system settings.

rolo said:
Here's a scenario for you. Let's say some thread in the system is busy doing something -- anything -- and the user clicks on something in the UI. On a single core, non-HT system, that click is not processed at least until the active thread is pre-empted, which may be what ... 10ms? 20ms?

In Windows, the application that's waiting for the click has a thread that's in a message loop. It's called PeekMessage() or GetMessage(), and inside that call, Windows is blocking on a message becoming available. Maybe some other application sends the message; maybe the system sends it (like in the case of WM_PAINT), or maybe a device driver sends it (in the case of WM_LBUTTONDOWN, your mouse click).

With the click message in the queue, the waiting thread has become runnable. If it has the same priority as the other thread that you're suggesting, then the system will mark it runnable. It may be scheduled much sooner than the active thread becoming pre-empted; if that thread does any blocking call into the system, then it's possible that the clicked thread will be run before the second thread is preempted. This is essentially a voluntary switch; the OS realizes the thread is doing something it needs to wait for (such as I/O, or handling a page fault) and schedules something else in the meantime.

If the window being clicked was in the background and the click brings the application into the foreground, it may receive a priority boost or a quantum increase. That will also have the window receiving the click sooner than the expiration of the quantum. There are other rules which may allow that to happen, too: if the running thread is in the foreground process but doesn't own any windows, for example, it will probably be preempted sooner.

Even with all this in mind, your scenario is quite contrived: I've never seen a system with only one runnable thread.

To convince yourself that what you're suggesting isn't the way things really work, read Chapter 6 of Windows Internals. Or, watch the "Context Switches" counter in perfmon. On my system, I'm doing 3200 context switches per second at this instant. If every thread was only being pre-empted after a 15 mS quantum, how could I ever manage more than 67 context switches per second?

rolo · Jul 12, 2005

mikeblas said:
...

Yes, all true of course. The gist of what I'm saying is that the UI thread with the click message may run sooner with less book keeping and thread scheduling "getting in its way" (with >1 logical processor). The situation is not contrived, just generalized and admittedly abstracted and simplified; I wasn't trying to say that the system only had 1 runnable thread, more that it had "a running thread."

The context switches counter is a valid point -- I completely forgot about that thing, as I haven't investigated performance in that area for some time now. Most of the performance work I've been doing has been related to cycle trimming in egregious areas, reducing memory usage, and getting rid of as much application startup disk I/O as possible (which works wonders). One could hypothesize that a system with >1 logical processor has less context switching to do, and that gives a performance boost (1-2% ? just a guess mind you).

However, I was referring to the quantum when two threads are actively vying for CPU time. If you have two threads actively doing computation, not blocking on any I/O or anything (Prime95 for instance), Windows will let each one run for X ms before pre-empting it assuming a higher priority thread does not become runnable. Most threads in the system are blocked or waiting most of the time, and as such are not runnable, and when they do any work they don't use the full quantum.

Martin Cracauer · Jul 12, 2005

mikeblas said:
Asynchronous I/O requires that you wait for an event to know that the I/O has completed. If you completely ignore error handling, then you're maybe you can call it safe from deadlocks, but any application that isn't a toy needs to know when it's I/O has successfully finished; either to report errors to the user or to get on with more work now that the deata has been read.

I don't know about Windoze but under Unix you either request that a signal is delivered everytime an outstanding asynchronous I/O system call completed, or you can poll for them later.

Of course there is some opportunity of deadlock if you errnonously wait for an outstanding request that you never actually scheduled. But that is very minor compared to the risks of true threading and locked data structures.

Spawning another process doesn't excuse you from synchronizing threads; you're just synchronizing the main thread of one process with the main thread of another process. If you don't have synchronization, how would the other thread know when something is available to be written, or when the buffer is safe to stuff more data into?

The point is that you don't have threads, you have processes, which do not share memory, hence they can do whatever the heck they want without clobbering each other's memory.

They communicate via some other form of IPC, usually pipes.

Doing a select or poll on such pipes is the form of synchronization here.

I can't understand why someone would want to spawn another process for I/O, anyway. You'd do IPC on your way to doing I/O? That doesn't make any sense to me.

Well, apache-1.x does that, for example. There is a pool for fork()ed processes that get delivered work to do from a master process which is the one actually listening on the TCP port(s).

I don't advocate any of the above over doing threads now that scalability to multiple processors is imperative. I just explain that the previous mechanisms were less likely to cause deadlock than full threading.

mikeblas · Jul 12, 2005

Martin Cracauer said:
Of course there is some opportunity of deadlock if you errnonously wait for an outstanding request that you never actually scheduled. But that is very minor compared to the risks of true threading and locked data structures.

Nor is it "safe against deadlocks". My point is that hyperthreading and multi-core processors don't make threading bugs any more likely; the bugs were always there.

rolo · Jul 12, 2005

Martin Cracauer said:
I don't know about Windoze but under Unix you either request that a signal is delivered everytime an outstanding asynchronous I/O system call completed, or you can poll for them later.

In "Windoze," you have at least the same ability. You either provide an event to be signaled upon I/O completion, or you can poll using HasOverlappedIoCompleted(), or do a wait using an interruptable SleepEx() call. There are also other ways to do async I/O, including I/O completion ports that seem to have the best performance (from what I've read).

Martin Cracauer said:
Of course there is some opportunity of deadlock if you errnonously wait for an outstanding request that you never actually scheduled. But that is very minor compared to the risks of true threading and locked data structures.

I'm not sure I follow you here. "An outstanding request you never scheduled?" Do you mean a garbage request ID of some sort? If so, I imagine the OS would return an error instead of deadlocking.

I3roknI3ottle · Jul 13, 2005

I seen an ad saying that SW: Revenge of the sith was made using A64 cpus, not sure if it was a marketting gimmick though :\

Martin Cracauer · Jul 13, 2005

rolo said:
In "Windoze," you have at least the same ability. You either provide an event to be signaled upon I/O completion, or you can poll using HasOverlappedIoCompleted(), or do a wait using an interruptable SleepEx() call. There are also other ways to do async I/O, including I/O completion ports that seem to have the best performance (from what I've read).

That is what I was assuming. Asynchronous I/O is usable and safe on all major platforms these days.

I'm not sure I follow you here. "An outstanding request you never scheduled?" Do you mean a garbage request ID of some sort? If so, I imagine the OS would return an error instead of deadlocking.

I meant that even if you don't do threads and instead rely on asynchronous I/O there is still opportunity to deadlock. One example of such a programming error would be to errnonously assume that there is an outstanding I/O request, while you never actually started it, or it has already been completed and the result absorbed but you forgot to account for it. Then you could wait for something that will never happen.

Again, my point is that the probability of an error like this when using async I/O is much lower than a deadlock or memory corruption when using threads.

While I don't advocate not using threads at this time it is also clear that we will see a decrease in program reliability as people move to threads and/or threaded programs move to SMP platforms.

feigned · Jul 13, 2005

Sir-Fragalot said:
Not true, virtual processors aren't counted as physical in Windows XP Home or Pro. Windows 2000 I am not sure about. Although I think 4 show up even then.

You're right about XP showing all four cores, but no more.

2k Pro will only show the first two physical processors.

See more here: http://www.microsoft.com/windows2000/docs/hyperthreading.doc

deathstar550 · Jul 22, 2005

HT on/off really depends on the programs that you're running.

For example, with 3D rendering:
http://www.gamepc.com/labs/view_content.asp?id=pee840&page=9

3D Studio Max runs faster with HT on, while Maya runs faster with HT off.

If you're using Photoshop, you'll go noticeably faster with HT on:
http://www.gamepc.com/labs/view_content.asp?id=pee840&page=10

No difference with media encoding:
http://www.gamepc.com/labs/view_content.asp?id=pee840&page=11

With gaming:
http://www.gamepc.com/labs/view_content.asp?id=pee840&page=7
The charts show a very slight performance decrease with HT on (south of 1 fps), but it's nothing to worry about. HT on/off here doesn't really matter either which way.

HT will help with multitasking, so I usually leave it on. For most practical purposes, HT on is the way to go. It gives sizeable performance increases with Photoshop and helps with multitasking in general, in return for a barely perceptible difference in gaming.

Unless you're going to be frequently using a program that shows a performance boost with HT off (if you're, say, a professional artist at Lucas using Maya all day), then it'd probably be best just to leave it on.

Dual Xeons: HT on or off?

Martin Cracauer

Guest

rolo

Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

mikeblas

[H]ard|DCer of the Month - May 2006

rolo

Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

rolo

Gawd

Martin Cracauer

Guest

mikeblas

[H]ard|DCer of the Month - May 2006

rolo

Gawd

I3roknI3ottle

Gawd

Martin Cracauer

Guest

feigned

[H]ard|Gawd

deathstar550

Limp Gawd