How many threads should my program support?

the maximum (theoretical) number he said is an 32Bit int.
you can use a 64Bit int, but right now i don't see a reason to him having 4bilion something Threads anyway.
 
You can't use a 64-bit int because the OS doesn't use a 64-bit thread ID.
 
..
For example if I had an array of 32-bit ints with 32 positions and I wanted to use 2 threads, I could write to positions 0-15 with the first thread and write to positions 16-31 with the second thread at the same time?

I thought I read somewhere that this wasn't possible without using locks.


Locks are only needed where there's a chance of resource colision - let's say that both threads want to flip a bit in same int. You could have thread1 read, modify write & then thread2 read, modify & write, and you don't have a problem. The danger comes in when you have T1 read, T2 read, T1 modify+write, T2 modify & write - T2 will overwrite changes made by T1.

If it's possible to apply a bitmask in an atomic fashion, you're set. If you can change your fundamental data type to something where you don't have to worry about conflicts (IE - if you're using ints as bools, it doesn't much matter - all writes are going to push them to 0).
 
You can't use a 64-bit int because the OS doesn't use a 64-bit thread ID.

So no way around that currently? Even in 64-bit Windows?

Not planning on doing it, but was just wondering.

What if I am using a 3rd party threading class?
 
Locks are only needed where there's a chance of resource colision - let's say that both threads want to flip a bit in same int. You could have thread1 read, modify write & then thread2 read, modify & write, and you don't have a problem. The danger comes in when you have T1 read, T2 read, T1 modify+write, T2 modify & write - T2 will overwrite changes made by T1.

If it's possible to apply a bitmask in an atomic fashion, you're set. If you can change your fundamental data type to something where you don't have to worry about conflicts (IE - if you're using ints as bools, it doesn't much matter - all writes are going to push them to 0).

O.k... yeah, no chance of resource collision.

Also, I start out with all set to 0 and the writes puch bits to 1s.
 
Setting a bit to 1 isn't atomic; it's read-modify-write. Only if each thread has its own memory will you avoid contention.
 
not sure what the point of this thread is. your maximum speed to be gained is by using the least amount of threads possible to occupy all cores and all hyperthreading. any more than that and you are doing a lot of context switching and other stuff with memory. this is not "work" but "overhead" which takes away from the available processing time
 
Setting a bit to 1 isn't atomic; it's read-modify-write. Only if each thread has its own memory will you avoid contention.

The only problem is, if I made each thread have it's own memory, it would up the RAM usage quite a bit.

The way I have the threads set up, they keep track of what areas of the array they are assigned to work on. No two threads ever work on the same part of the array.

Also, in looking up atomic bit functions, it looks like they automatically use locks which I definitely don't want.. or am I missing something here.
 
Look for environment variable NUMBER_OF_PROCESSORS.

That will only work in Windows. I am going for *nix compatibility as well when looking for the available cpu cores.

I'm not too concerned about that at this point though as there is a bit more that I want to add to the program before I worry about automatic detection of cpu cores.
 
not sure what the point of this thread is. your maximum speed to be gained is by using the least amount of threads possible to occupy all cores and all hyperthreading. any more than that and you are doing a lot of context switching and other stuff with memory. this is not "work" but "overhead" which takes away from the available processing time

I know this. It is more of a "it is there if needed at some point in the future" thing.
 
not sure what the point of this thread is.

It's to help cyclone3d learn how to program with threads, I think. That is, if it is okay with you!

Look for environment variable NUMBER_OF_PROCESSORS.
This environment variable is unreliable and incomplete.

You should call GetSystemInfo, then look at the dwActiveProcessorMask and dwNumberOfProcessors fields of the returned SYSTEM_INFO to see how many processors there are, and which ones are available. You can then call GetProcessAffinityMask to see what processors the user has configured for your run of the application.



The way I have the threads set up, they keep track of what areas of the array they are assigned to work on. No two threads ever work on the same part of the array.
If your code is such that more than one thread never touches the same part of the array at the same time, then you have no contention.

Also, in looking up atomic bit functions, it looks like they automatically use locks which I definitely don't want.. or am I missing something here.
To which specific functions are you referring?
 
It's to help cyclone3d learn how to program with threads, I think. That is, if it is okay with you!


This environment variable is unreliable and incomplete.

You should call GetSystemInfo, then look at the dwActiveProcessorMask and dwNumberOfProcessors fields of the returned SYSTEM_INFO to see how many processors there are, and which ones are available. You can then call GetProcessAffinityMask to see what processors the user has configured for your run of the application.



If your code is such that more than one thread never touches the same part of the array at the same time, then you have no contention.

To which specific functions are you referring?

The ones in atomic.h ... but after looking over them again, it looks like those are just made "atomic" by use of locks so they really aren't atomic in the first place.

Is there any way to set a single bit in a variable to 0 without doing a read-compare-write?

Right now I am figuring out what bit in an unisgned int needs to be written, and then using a bitwise OR to change that bit to a 1 to mark it as a non-prime. Whatever bits are still 0 are primes.
 
The ones in atomic.h ... but after looking over them again, it looks like those are just made "atomic" by use of locks so they really aren't atomic in the first place.
The locks make them atomic.

Is there any way to set a single bit in a variable to 0 without doing a read-compare-write?
No. This should be obvious; think of what you're asking. You want to read a word (or byte, or something) from memory. Then, you want to change one bit. Then, you want to write it back. If the value you're reading with to start is in the processor's cache, then it has to be sure that no other processor (or core) has it cached, as well. Maybe it turns out what's in cache is dirty, and that test isn't atomic. If the value you're reading to start with isn't in cache, the processor has to go to memory to get it. That costs more than 100 clock cycles, usually. Locking a cache line of memory for that long would significantly impact the performance and scalability of multi-core and multi-processor systems.
 
The locks make them atomic.

o.k.

No. This should be obvious; think of what you're asking. You want to read a word (or byte, or something) from memory. Then, you want to change one bit. Then, you want to write it back. If the value you're reading with to start is in the processor's cache, then it has to be sure that no other processor (or core) has it cached, as well. Maybe it turns out what's in cache is dirty, and that test isn't atomic. If the value you're reading to start with isn't in cache, the processor has to go to memory to get it. That costs more than 100 clock cycles, usually. Locking a cache line of memory for that long would significantly impact the performance and scalability of multi-core and multi-processor systems.

That's what I thought.. just wanted to make sure I wasn't missing something.

And it also makes sense why using more(smaller) work units speeds up the program significantly. At the optimal size (still need to add an auto detect calculation to my program), the sieve finishes over twice as fast then when using less(larger) work units.

I knew it had to do with it fitting in cache but really hadn't thought about the number of clock cycles it took when it wasn't in cache.

I can go from right at 1 minute down to 27 seconds or less when calculating primes in between 1 and ~32,000,000,000(due to stuff having to be evenly divisible it isn't exactly 32b) when using 8 threads.
 
Back
Top