• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

How many threads should my program support?

cyclone3d

Fully [H]
2FA
Joined
Aug 16, 2004
Messages
18,173
How many threads does a program need to support before it makes you giddy?

Currently the program in question theroetically supports 4,294,967,295 threads.

I am contemplating making it support an unlimited number and thinking about it makes me wonder how long it will be before a computer will even be able to make use of 4,294,967,295 threads.
 
What's keeping the machine from making use of all those threads (ignoring the overhead)? An OS will schedule threads to available execute resources as it sees fit, even if it takes a long time with context switches, and all.
 
Memory, mainly. Each thread requires some stack space. To support 10,000 threads, each having a stack of 128K, you'd need 1,280 megs of memory -- just for the stack. Even if you don't use that much stack, you'll still reserve it, so you'll run out of address space pretty quickly.

I get giddy when software is designed properly, not when it reaches for arbitrary and counter-productive quantitative claims.
 
What's keeping the machine from making use of all those threads (ignoring the overhead)? An OS will schedule threads to available execute resources as it sees fit, even if it takes a long time with context switches, and all.

Well, if you have a processor with 4 cpu cores (no hyperthreading), then it is kind of pointless to use more than 4 threads at once especially if each thread of the program will load up one core 100%.
 
Well, if you have a processor with 4 cpu cores (no hyperthreading), then it is kind of pointless to use more than 4 threads at once especially if each thread of the program will load up one core 100%.

I think it's actually going to depend on what the software does. It doesn't sound right to use only X threads because that's the number of CPUs (or cores) present. What happens when a few threads depend on the results of one that's still executing on a single core? Can some other work be done in that time with the available resources, rather than leaving 75% of the processor idle?
 
Memory, mainly. Each thread requires some stack space. To support 10,000 threads, each having a stack of 128K, you'd need 1,280 megs of memory -- just for the stack. Even if you don't use that much stack, you'll still reserve it, so you'll run out of address space pretty quickly.

I get giddy when software is designed properly, not when it reaches for arbitrary and counter-productive quantitative claims.

Well, it wouldn't run with a huge amount of threads by default. The thread count is input by the user at runtime.

And, when a thread completes it's work, it frees the memory up it was using.

I just did a test between running 1,000 threads and 10,000 threads. The RAM usage difference was not that much.

threads max RAM usage
10,000 526,248k

1,000 408,308k
 
I think it's actually going to depend on what the software does. It doesn't sound right to use only X threads because that's the number of CPUs (or cores) present. What happens when a few threads depend on the results of one that's still executing on a single core? Can some other work be done in that time with the available resources, rather than leaving 75% of the processor idle?

All threads work independently of the others.

The only thing the program has to wait for is when it is outputting the actual results. It has to wait for the last worker thread to end.
 
Well, it wouldn't run with a huge amount of threads by default. The thread count is input by the user at runtime.

And, when a thread completes it's work, it frees the memory up it was using.

I just did a test between running 1,000 threads and 10,000 threads. The RAM usage difference was not that much.

threads max RAM usage
10,000 526,248k

1,000 408,308k
Then you're either not measuring memory usage correctly (I explicitly mentioned differences between usage, commit, and address space), or you're not actually running the threads concurrently. A combination of both, probably.
 
Then you're either not measuring memory usage correctly (I explicitly mentioned differences between usage, commit, and address space), or you're not actually running the threads concurrently. A combination of both, probably.

The threads are running concurrently.. testing out the other RAM stuff.

I just upped the data set to 10x it's original size to make sure all the threads would have a chance to run at the same time. I am using task manager to get these numbers.

10,000 4,067,364k - Peak Working Set
4,182,???k - Commit Size

1,000 3,934,180k - Peak Working Set
3.944,044k - Commit Size

Is there a program that can show the reserved address space a program is using?
 
Last edited:
All threads work independently of the others.

The only thing the program has to wait for is when it is outputting the actual results. It has to wait for the last worker thread to end.

In your program or in general?
 
In my program.
What do you mean by "use more than 4 threads at once?"
Well, if you have a processor with 4 cpu cores (no hyperthreading), then it is kind of pointless to use more than 4 threads at once especially if each thread of the program will load up one core 100%.
 
The threads are running concurrently.. testing out the other RAM stuff.
I remain skeptical, since the evidence you've given us doesn't support your assertion.

Is there a program that can show the reserved address space a program is using?
Yes. Windbg can, as can Process Explorer. Or process monitor, or whatever it's called these days.
 
It should support what ever is available. And if that is 1 or 12, or in the future, millions, it should be scalable. If you think about it, more threads does = more memory, but multicore/multithreaded processors scale up the memory as well. 12 threaded CPU's usually couple with 12 GB of RAM (or 6). I'm sure when 512 threaded CPU's are released, we'll be in the 64-128 GB RAM era.
 
It should support what ever is available. And if that is 1 or 12, or in the future, millions, it should be scalable. If you think about it, more threads does = more memory, but multicore/multithreaded processors scale up the memory as well. 12 threaded CPU's usually couple with 12 GB of RAM (or 6). I'm sure when 512 threaded CPU's are released, we'll be in the 64-128 GB RAM era.

Well, I guess we might survive making software for .001 % of systems out there...
 
It should support what ever is available. And if that is 1 or 12, or in the future, millions, it should be scalable. If you think about it, more threads does = more memory, but multicore/multithreaded processors scale up the memory as well. 12 threaded CPU's usually couple with 12 GB of RAM (or 6). I'm sure when 512 threaded CPU's are released, we'll be in the 64-128 GB RAM era.

More threads means more memory because of stack reservation. Most of the memory an application uses is taken up by its data, since most people don't create threads indiscriminately. cyclone3d is, curiously, making himself an exception to accepted practices. My own book outlines some suggestions for counting threads; Jeffrey Richter's book has many examples and discusses architectures which let you efficiently use threads.

12-threaded CPUs don't have a "normal" amount of memory. You might build such a machine with three gigs, you might build such a machine with 192 gigs. We're already in the era of 64 to 128 gig machines; you can order such a machine pre-configured for less than $25,000 today. The configuration you choose depends on the application, not any popularity contest.

A book like Windows Internals[/QUOTE] will help you understand why too many threads are bad. A simple explanation involves pointing out context switching, which is what the OS does to move execution context from one runnable thread that's already used it's whole quantum to another runnable thread. If you have too many runnable threads, you're going to end up with far more threads runnable than you have available cores, and you'll end up spending all your horsepower context switching instead of getting work done.
 
As an Amazon Associate, HardForum may earn from qualifying purchases.
I remain skeptical, since the evidence you've given us doesn't support your assertion.

Yes. Windbg can, as can Process Explorer. Or process monitor, or whatever it's called these days.

O.k., I downloaded and watched process explorer while I was running it with 10k threads. Is "virtual size" the column I should be looking at to see what address space is reserved? If so, it gets up over 12GB when it has about 8k threads running at once.

With it using only 1k threads, the virtual size maxes out at about 4.8GB.

It is going to be very difficult to get all 10k threads running at once since the threads that starts first finish a lot quicker then the latter ones because the calculations don't take as long.
 
Well, if you have a processor with 4 cpu cores (no hyperthreading), then it is kind of pointless to use more than 4 threads at once especially if each thread of the program will load up one core 100%.

If you're working on purely CPU-bound tasks & the sole purpose of threads is to parallelize computation, you might be right but there are other reasons to use multiple threads. The obvious case is that, if you have anything that might block on I/O (network, disk, user), you can easily have more threads than cores. If your problem is parallelizable N ways (for N > cores) but you want to consolidate their output so you can provide incremental output, you might want more threads than cores. If you're writing an application that, once you've decided to split into multiple threads, splits into some logical number of pieces, you don't really care how many cores are available.
 
More threads means more memory because of stack reservation. Most of the memory an application uses is taken up by its data, since most people don't create threads indiscriminately. cyclone3d is, curiously, making himself an exception to accepted practices. My own book outlines some suggestions for counting threads; Jeffrey Richter's book has many examples and discusses architectures which let you efficiently use threads.

12-threaded CPUs don't have a "normal" amount of memory. You might build such a machine with three gigs, you might build such a machine with 192 gigs. We're already in the era of 64 to 128 gig machines; you can order such a machine pre-configured for less than $25,000 today. The configuration you choose depends on the application, not any popularity contest.

A book like Windows Internals will help you understand why too many threads are bad. A simple explanation involves pointing out context switching, which is what the OS does to move execution context from one runnable thread that's already used it's whole quantum to another runnable thread. If you have too many runnable threads, you're going to end up with far more threads runnable than you have available cores, and you'll end up spending all your horsepower context switching instead of getting work done.


I am not advocating that a user actually use a huge number of threads, especially since the efficiency goes down when doing so.
I am planning in a future version to have it auto-detect and suggest a number of threads to the user, but they will still be able to manually enter how many threads they want to use if that is what they want to do.
 
As an Amazon Associate, HardForum may earn from qualifying purchases.
Who's the intended audience? Do you expect them to understand the ramifications of their choice(s)?
 
If you're working on purely CPU-bound tasks & the sole purpose of threads is to parallelize computation, you might be right but there are other reasons to use multiple threads. The obvious case is that, if you have anything that might block on I/O (network, disk, user), you can easily have more threads than cores. If your problem is parallelizable N ways (for N > cores) but you want to consolidate their output so you can provide incremental output, you might want more threads than cores. If you're writing an application that, once you've decided to split into multiple threads, splits into some logical number of pieces, you don't really care how many cores are available.

True.. and I understand that.
 
Who's the intended audience? Do you expect them to understand the ramifications of their choice(s)?

Hrmmm... The intended audience is anybody geeky or nerdy enough to even want to use my program. And yes, I would expect some of them to understand the ramifications of their choices but I have also added a bit of documentation as well as information during the user input prompts for those who might need a little bit of help.
 
O.k., I downloaded and watched process explorer while I was running it with 10k threads. Is "virtual size" the column I should be looking at to see what address space is reserved?
Yes. So, it does turn out that you were looking at the wrong measurement.

If so, it gets up over 12GB when it has about 8k threads running at once.
With it using only 1k threads, the virtual size maxes out at about 4.8GB.

It is going to be very difficult to get all 10k threads running at once since the threads that starts first finish a lot quicker then the latter ones because the calculations don't take as long.

You've told us nothing about your app, so we're all in the dark. Assuming that you use the same amount of memory per thread no matter how many threads you're using, your numbers are very telling. With 8K threads, you're using 1.5 megs per thread. With 1K threads, you're using 4.8 megs per thread. Why is that?

One way to explain it would be that you've got more threads concurrently running when you have more threads. And that only makes sense, as having too many threads grinds everything--including the thread that's doing the work of creating the other threads.
 
Yes. So, it does turn out that you were looking at the wrong measurement.



You've told us nothing about your app, so we're all in the dark. Assuming that you use the same amount of memory per thread no matter how many threads you're using, your numbers are very telling. With 8K threads, you're using 1.5 megs per thread. With 1K threads, you're using 4.8 megs per thread. Why is that?

One way to explain it would be that you've got more threads concurrently running when you have more threads. And that only makes sense, as having too many threads grinds everything--including the thread that's doing the work of creating the other threads.

O.k.. It is a prime sieve program. All the threads run concurrently. Each thread also has one or more work units.

When I was using 10k threads, I was using 10k work units, and when I was using 1k threads I was using 1k work units. This is why I am using more RAM per thread when only using 1k threads.
 
What is a "10K work unit"?

10 thousand work units.

One big binary array broken down into 10 thousand pieces, or rather, each thread only works on specific parts of the array. No thread ever overlaps the work of any other thread.
 
I can imagine that "K" means thousand without much help. I can't guess what "work unit" means to you, or how I can relate "work unit" to thread-local memory consumption or thread-local CPU demand.
 
I can imagine that "K" means thousand without much help. I can't guess what "work unit" means to you, or how I can relate "work unit" to thread-local memory consumption or thread-local CPU demand.

I figured I wasn't very clear on stating that it was 10 thousand work units and not work units 10k in size.

I am working with a multi-dimensional array of 32-bit ints. It is being used as a binary array to mark non-primes.

So each "work unit" is an array position.

Given that I am working on the same size array, or same amount of numbers, the more "work units", the smaller each "work unit" will be.
 
Each array position in a binary array is a bit. How does that change depending on the number of threads you have? Why do you need a multidimensional array to implement a prime sieve?
 
Each array position in a binary array is a bit. How does that change depending on the number of threads you have? Why do you need a multidimensional array to implement a prime sieve?

Hmmm. Good point.

I set it up originally with a multidimensional array because it made sense at the time. If I remember correctly, it had to do with making sure I wouldn't be overlapping anything.

So just to clarify, I would be able to write to the same basic array, but in different positions with different threads at the same time without having to use locks?

For example if I had an array of 32-bit ints with 32 positions and I wanted to use 2 threads, I could write to positions 0-15 with the first thread and write to positions 16-31 with the second thread at the same time?

I thought I read somewhere that this wasn't possible without using locks.
 
It's possible. You just have to be perfectly sure you're writing to different locations from each thread.
 
12-threaded CPUs don't have a "normal" amount of memory. You might build such a machine with three gigs, you might build such a machine with 192 gigs. We're already in the era of 64 to 128 gig machines; you can order such a machine pre-configured for less than $25,000 today. The configuration you choose depends on the application, not any popularity contest.


Ok, I mis-worded that. I meant that the "average" amount of memory. In the early 90's it was 2-4 MB, mid 90's it was 8-32 MB, and up from there. 486's were generally 4-16 MB of RAM... Not so much a popularity contest, more of an average amount for that era.

Sure, you can get a 192 GB RAM machine with quad Xeon's right now, but it's not the average, or normal. Average is ~4 GB or so.

My things was that you should have scalability and plan for the future. You make something that can use 500,000 threads, but uses 128 GB of RAM (totally just pulled that out of thin air), you'd think that if a CPU existed that had that horsepower, it'd have the memory footprint to go with it.

You wouldn't run a 6-core i7 (12 thread) with 256 MB of RAM these days, would you?


___Edit: I have very little depth of programming experience. This is just going from what I know and understand from the thread so far. You are obviously a much more proficient programmer than I am (no sarcasm or anything negative intended), I am just stating my opinion.
 
Sure, you can get a 192 GB RAM machine with quad Xeon's right now, but it's not the average, or normal. Average is ~4 GB or so.
It's not abnormal for servers. 4 gigs isn't the average, according to our survey.

My things was that you should have scalability and plan for the future. You make something that can use 500,000 threads, but uses 128 GB of RAM (totally just pulled that out of thin air), you'd think that if a CPU existed that had that horsepower, it'd have the memory footprint to go with it.
A program doesn't use a given amount of memory per thread; it uses as much memory as it needs for the data it processes in its working set.
You wouldn't run a 6-core i7 (12 thread) with 256 MB of RAM these days, would you?
Probably not, because i7 machines use three-channel memory and 256 megs isn't a number likely to fit that architecture. But the other reason you wouldn't, the point you're struggling to make, is that the computing power generally outweighs the available memory. There are applications where this is a requirement, too. Look at embedded systems, which typically have very little memory, but have comparatively more computing power per-flop compared to that amount of memory. The computer that controls the active handling system in my car, for example, is about as fast as a 68030, but has about 32 kilobytes of memory; about one four-hundredth of the memory shipped on personal computers which used that processor.
 
Well, so far changing it from a multi-dimensional array didn't help the speed at all, but it did bring the RAM usage down a bit.

Still have a few things to fix that I messed up when changing it, so it may speed up a bit a bit in that part once I get it fixed, but I am not really expecting that much.
 
On huge worksets 1 - 32000000000, it now has a "virtual size" of 3921892k instead of 4045540k and is about 2% faster.

So yeah.. a bit of improvement, but not much. But I will take what I can get.
 
If you are interested in improving runtime performance, you should try running your program with a profiler.
 
You can use a 64bit int or something to get over the limit, depending on your program, the Max i think it should support on any case (unless its meant to be run on some uber cloud computer with 40 CPU Sockets) is 8 * 2 * 2, Assuming a 2x8 Core CPU with Hyperthreading (its hypothetical a bit i think at the moment), but at moment i don't see a reason for more.
 
Back
Top