(C++) - Any faster way to keep a float between zero and one ?

Blazestorm · Nov 25, 2010

I'm working on an assignment for a 3D graphics class... it's pretty much done and was just seeing if there was anyway to improve this.

Part of it is a triangle rasterizer which takes three 3D points and draws the triangle that fits between all 3.

One of the things is color interpolation (or anything really...). And right before the triangle rasterizer sets the pixel it converts the color (which is stored as 3 floats, red, green, blue) into unsigned chars by multiplying by 255 and casting to an unsigned char. But what happens is if the value is greater than 255 it rolls over and goes to zero. Well this is an issue with colors because it goes from a solid red to black. This causes artifacts usually on the edges of triangles. And it really only happens if I interpolate between like solid red, and solid green because the values start at 1.0f. If I do some colors inbetween it never rolls-over the size of an unsigned char.

When I have these 6 lines (well 12 if it was "properly" formatted) it takes up 5-6% of my programs processing according to a profiler. Because with 300,000 pixels that's a lot of checks to do.

So I'm just wondering if there's a faster way to do it... some little trick I don't know about...

Code:

  if(color.r > 1.0f) color.r = 1.0f;
  if(color.g > 1.0f) color.g = 1.0f;
  if(color.b > 1.0f) color.b = 1.0f;
  if(color.r < 0.0f) color.r = 0.0f;
  if(color.g < 0.0f) color.g = 0.0f;
  if(color.b < 0.0f) color.b = 0.0f;

or this... both have the same performance roughly.

Code:

  color.r = (color.r > 1.0f) ? 1.0f : color.r;
  color.g = (color.g > 1.0f) ? 1.0f : color.g;
  color.b = (color.b > 1.0f) ? 1.0f : color.b;
  color.r = (color.r < 0.0f) ? 0.0f : color.r;
  color.g = (color.g < 0.0f) ? 0.0f : color.g;
  color.b = (color.b < 0.0f) ? 0.0f : color.b;

Nevermind · Nov 25, 2010

I just looked at it very briefly, and it may be compiler dependent, but i found the below method about 2.5 times faster than what you posted.

Code:

short a = color.r*255;
final_unsigned_char = a & 0x8000 ? 0 : (a & 0x7F00 ? 255 : a);

Blazestorm · Nov 25, 2010

Nevermind said:
I just looked at it very briefly, and it may be compiler dependent, but i found the below method about 2.5 times faster than what you posted.

Awesome.. pretty creative way to do it haha. bit-wise operators using shorts/chars are definitely faster than floats

Optimized - 4.11%
If-Checks - 17.20%
Conditional - 15.50%
No Checks - 10.28%

So that's why I was saying it was taking 5-6% of my program, because I was counting no-checks as the baseline. But it turns out apparently casting from float to char is more expensive than float to short, then short to char.

Just curious, where did you find that method? I was checking on google, stackoverflow etc. and couldn't find anything really

Here's messing around with 12288 cubes all with color interpolation... video shows the artifacts I was talking about before it goes into spazz mode with random colors every frame.

Thanks... 4x faster... well percentage wise, program still runs about the same frame-rate haha.

Code:

  //color.r = (color.r > 1.0f) ? 1.0f : color.r;
  //color.g = (color.g > 1.0f) ? 1.0f : color.g;
  //color.b = (color.b > 1.0f) ? 1.0f : color.b;
  //color.r = (color.r < 0.0f) ? 0.0f : color.r;
  //color.g = (color.g < 0.0f) ? 0.0f : color.g;
  //color.b = (color.b < 0.0f) ? 0.0f : color.b;

  //FrameBuffer::SetPixel(x, y, color.r*255.0f, color.g*255.0f, color.b*255.0f);

  short a = color.r * 255;
  unsigned char redChar = a & 0x8000 ? 0 : (a & 0x7F00 ? 255 : a);
  a = color.g *255;
  unsigned char greChar = a & 0x8000 ? 0 : (a & 0x7F00 ? 255 : a);
  a = color.b*255;
  unsigned char bluChar = a & 0x8000 ? 0 : (a & 0x7F00 ? 255 : a);

  FrameBuffer::SetPixel(x, y, redChar, greChar, bluChar);

Nevermind · Nov 25, 2010

Blazestorm said:
Awesome.. pretty creative way to do it haha. bit-wise operators using shorts/chars are definitely faster than floats

Optimized - 4.11%
If-Checks - 17.20%
Conditional - 15.50%
No Checks - 10.28%

So that's why I was saying it was taking 5-6% of my program, because I was counting no-checks as the baseline. But it turns out apparently casting from float to char is more expensive than float to short, then short to char.

That's a bit odd. The cast from short to char should be cheap enough (a single AND while all variables are in registers). Possibly you're confusing the OOO reorderer and saturating the FPU with the FMUL's so close together and it works better with some integer code between (although three FMUL's are needed in either case), or possibly your compile has some cute tricks that applies when multiplying by the literal 255 but not 255.0f (although mine hasn't), but I really have no idea.

Out of curiosity, how does your original code profile if you skip the *255.0f entirely, or replace it with *255 ?

Blazestorm said:
Just curious, where did you find that method? I was checking on google, stackoverflow etc. and couldn't find anything really

I found it in the back of my head

Blazestorm said:
http://lousyhero.com/code/cs250disco2.wmvThanks... 4x faster... well percentage wise, program still runs about the same frame-rate haha.

You're welcome.

robododo · Nov 25, 2010

Why use floats at all? Wouldn't be faster all-around to use integer math for your colors? This would have the benefit of avoiding the rounding errors it seems like you're dealing with.

Of course, I could be totally bonkers here. I've worked on too many systems without an FPU, so "avoid float like the plague" is pretty much burned into my head.

Blazestorm · Nov 25, 2010

Nevermind said:
That's a bit odd. The cast from short to char should be cheap enough (a single AND while all variables are in registers). Possibly you're confusing the OOO reorderer and saturating the FPU with the FMUL's so close together and it works better with some integer code between (although three FMUL's are needed in either case), or possibly your compile has some cute tricks that applies when multiplying by the literal 255 but not 255.0f (although mine hasn't), but I really have no idea.

Out of curiosity, how does your original code profile if you skip the *255.0f entirely, or replace it with *255 ?

Ahh I was actually reading the profiler wrong... because with yours... the floats were being assigned to a short it was considering that code to be the float to integer call, but when I cast it as I passed it into the function it counted it as "inside" that function. so in reality that 4% should be ~10% with most of it going to _ftol2_pentium4 calls

So yours is roughly the same performance as no checks at all, but has the added benefit of keeping it within the range. Casting from float to an integer is the main reason this is slow. But there's nothing that can be done about that anyways

robododo said:
Why use floats at all? Wouldn't be faster all-around to use integer math for your colors? This would have the benefit of avoiding the rounding errors it seems like you're dealing with.

Of course, I could be totally bonkers here. I've worked on too many systems without an FPU, so "avoid float like the plague" is pretty much burned into my head.

Most of the interpolation stuff is relying on floats, the entire triangle rasterizer is using floats too. Well not entirely... it uses integers for the looping and uses std::ceil to get the values from the floats. There might be better ways, but this is just how we were taught to do it...

The mid-point line algorithm is using all integers though.

mikeblas · Nov 25, 2010

Blazestorm said:
When I have these 6 lines (well 12 if it was "properly" formatted) it takes up 5-6% of my programs processing according to a profiler. Because with 300,000 pixels that's a lot of checks to do.

So I'm just wondering if there's a faster way to do it... some little trick I don't know about...

This is what the SIMD instructions are for. Your compiler should support intrinsics that let you use the SSE instructions; I don't thikn you need anything past SSE to do this, so you needn't be concerned that your code will run on any available processor--any Pentium would do.

If you haven't used the SSE instructions before, the idea is that they give you a 128-bit register that you can use as if it were a small packed array of a couple idfferent data types. In your case, you can pretend its a packed array of four floats. Which makes sense, right? A float is 32 bits, so 4 times 32 bits is 128 bits.

Here's code that will get you started, and replaces the code snippet you've posted previously:

Code:

#include <emmintrin.h>

int main(int argc, char* argv[])
{
	float fColorR = 3.5f;
	float fColorG = -3.3f;
	float fColorB = 0.5f;

	// load the three floats into an m128 register
	// accepts four values but we only need three, so pass 0.0
	__m128 sseFloats = _mm_set_ps( 0.0f, fColorR, fColorG, fColorB );

	// mm_min_ps and mm_max_ps will do 4 floats at a time. We only use
	// three in this example, but provide the fourth value anyway
	// because we have to.

	// clamp to the minimum value
	// could use mm_setzero_ps here, since they're all zero
	sseFloats = _mm_max_ps( sseFloats, _mm_set_ps( 0.0f, 0.0f, 0.0f, 0.0f ) );

	// clamp to the maximum value
	sseFloats = _mm_min_ps( sseFloats, _mm_set_ps( 1.0f, 1.0f, 1.0f, 1.0f ) );

	// getting the value back out requires an output array,
	// we cant pass individual addresses since the operation happens in one instruction.
	// make a temp array for that. Note use of mm_storeu_ps instead of mm_store_ps
	float fOutput[4];
	_mm_storeu_ps( fOutput, sseFloats );

	// move the values from the array into the scalar values we had before
	fColorR = fOutput[2];
	fColorG = fOutput[1];
	fColorB = fOutput[0];


	return 0;
}

I've commented the code for you. To help you see what's going on, I start with the three float variables you have and stuff in some test data.

The _mm_set_ps() intrinsic will take four individual float values, glue them together, and stick them into a 128-bit MMX register in the processor. That register looks like a variable, and it sort of is--but the compiler will try to keep the value in a variable as aggressively as it possibly can. In this trivial example it is successful at having the value always in a reigsre.

The _mm_max_ps() intrinsic computes the min of the four values in each operand, and returns a new 128-bit pack with the results. We store that back into the sseFloats variable -- but again, the variable isn't a variable, it's really an MMX register.

Same thing the other way with the _mm_min_ps() instrinsic; just compute the min.

Note that these are intrinsics. They look like functions, but they cause the compiler to emit a single (or, only a couple) instructions. If you dump the code in the debugger or from the compiler, you'll see it is very small. Most notably, there are no branches even though the functions involve conditional operations. This is where some of your speed comes from; there are no pipeline stalls or branch mis-predictions. You just execute and the ALU does it all.

The rest of your speed comes from hitting memory less often. Your code will end up touching memory once for a single, aligned 128-bit read instead of pumping here and there for three or four 32-bit reads. Computing as much as you can with one read helps write cache-friendly code, and that helps avoid cache misses, and so on.

Alignment is important. If you write a 128-bit value from a register to memory, it must be aligned to 16-byte boundary. Otherwise, you'll have an alignment fault and crash. You can tell the processor that you want to do an unaligned store, but it is not nearly as efficient as doing an aligned store.

Given that background, let's write the code to be more efficient and actually take a look at what the generated code does. This is what you'll want to aim for in your actual project code. Since you're writing graphics code, you're probably passing RGB values around all over the place; maybe even RGB plus an alpha or gamma channel. Four floats in a package that's easily manipulated, then, is really handy! Instead of thinking about "float fColorR, fColorG, fColorB", you'll want to think about how you can decompose your problem into using a single __m128 with all three (or four) values in one shot, using these MMX instructions.

Let's try it. I'll also use setzero and keep a couple locals.

Code:

__m128 BetterClamp( __m128 sseInput )
{
	static const float fOne = 1.0f;
	__m128 sseFloats = _mm_max_ps( sseInput, _mm_setzero_ps( ) );
	__m128 sseOnes = _mm_load1_ps( &fOne );
	sseFloats = _mm_min_ps( sseFloats, sseOnes );
	return sseFloats;
}

I also use _mm_load1_ps() to load four of the same 1.0f values into each of the slots in the MMX register. This is a really great savings, too, along with setzero. The most important note here, though, is that I rearrange the code just a little so that the compiler has the best shot at re-using the registers and not pumping memory for the values. If we neglect the function preamble, the generated code is very small:

Code:

	static const float fOne = 1.0f;
	__m128 sseFloats = _mm_max_ps( sseInput, _mm_setzero_ps( ) );
0032162E  xorps       xmm0,xmm0 
00321631  movaps      xmmword ptr [ebp-1A0h],xmm0 
00321638  movaps      xmm0,xmmword ptr [ebp-1A0h] 
0032163F  movaps      xmm1,xmmword ptr [ebp-20h] 
00321643  maxps       xmm1,xmm0 
00321646  movaps      xmmword ptr [ebp-180h],xmm1 
0032164D  movaps      xmm0,xmmword ptr [ebp-180h] 
00321654  movaps      xmmword ptr [ebp-40h],xmm0 
	__m128 sseOnes = _mm_load1_ps( &fOne );
00321658  movss       xmm0,dword ptr [fOne (3257C0h)] 
00321660  shufps      xmm0,xmm0,0 
00321664  movaps      xmmword ptr [ebp-160h],xmm0 
0032166B  movaps      xmm0,xmmword ptr [ebp-160h] 
00321672  movaps      xmmword ptr [ebp-60h],xmm0 
	sseFloats = _mm_min_ps( sseFloats, sseOnes );
00321676  movaps      xmm0,xmmword ptr [ebp-60h] 
0032167A  movaps      xmm1,xmmword ptr [ebp-40h] 
0032167E  minps       xmm1,xmm0 
00321681  movaps      xmmword ptr [ebp-140h],xmm1 
00321688  movaps      xmm0,xmmword ptr [ebp-140h] 
0032168F  movaps      xmmword ptr [ebp-40h],xmm0 
	return sseFloats;
00321693  movaps      xmm0,xmmword ptr [ebp-40h] 
}

I think you'll find that this is the fastest approach, and particularly valuable if you can re-wire your application to use the MMX registers throughout instead of moving around three different values, doing conditionals on them, and so on.

Let me know if you've got any questions!

Nevermind · Nov 25, 2010

Blazestorm said:
Casting from float to an integer is the main reason this is slow. But there's nothing that can be done about that anyways.

Actually there is. The i don't have your exact code but the below should give you an idea. From this:

Code:

bool v5(float* p, unsigned char* s, int isize)
{
    for(int i = 0; i < isize; i++)
    {
        short sa = p[i]*255;
        s[i] = sa & 0x8000 ? 0 : (sa & 0x7F00 ? 255 : sa);
    }
}

To this:

Code:

bool v6(float* p, unsigned char* s, int isize)
{
    float f255 = 255.0f;
    int sh;
    __asm
    {
        fld        [f255]
        mov        ecx, [isize]
        mov        esi, dword ptr [p]
        mov        edi, dword ptr [s];
        
loop_label:
                
        fld        [esi]
        fmul    st(0), st(1)
        fistp    [sh]
        mov        eax, [sh]

        mov        ebx, eax
        and        eax, 0x80000000
        cmp        eax, 0
        jz        not_zero
        mov        eax, 0x0
        jmp        store
not_zero:
        mov        eax, ebx
        and        eax, 0x7FFFF0000
        cmp        eax, 0
        jz        not_gt_one
        mov        eax, 0x1
        jmp        store
not_gt_one:
        mov        eax, ebx
store:
        movzx    eax, al
        mov        byte ptr [edi], al

        add        esi, 4
        inc        edi
        dec        ecx
        jnz        loop_label
    }
}

... I get about another 2x speed improvement.

Nevermind · Nov 25, 2010

Do'h! Too slow. Use Mike's code, his SSE will be a lot faster than my poor x87.

mikeblas · Nov 25, 2010

It'll be faster, but it might be difficult to integrate into the application well enough that the speed differential is worthwhile.

devil22 · Nov 25, 2010

Just a little tip, it's not necessary to do a "cmp eax,0" after the "and eax, X" - the Zero Flag will be set appropriately by the AND operation.

jimmyb · Nov 25, 2010

For modern x86 desktop applications is it common practise to use SIMD instructions? Are alternative execution paths usually provided for processors which might not support them? Is there typically any concern about compatibility?

Blazestorm · Nov 25, 2010

Haha, yea I figured there would be some assembly that would be faster. I just wasn't sure if I wanted to get into that. And I've heard about SSE but haven't had a chance to use it. There is a class later on dubbed "Low Level Programming" which introduces a lot of the SSE stuff and other instruction sets. But I won't be taking that until the summer.

The point of this 3D graphics class is more to understand the 3D pipeline that OpenGL / DirectX use and how it's implemented, not necessarily for speed. But it was running really really slow, and I thought on a modern system there's no way simple software rendering at a low resolution should be that slow. So that's when I tried the profiler to find ways to improve it.

Right now the slowest thing is operator new and free, which I think is related to some temporary std::vector's I'm using when doing clipping and a few other operations. I think I might write a small class to replace the std::vector as I don't really need the features to speed that up. The other slowest thing is my mtx44 operator * with vec3 (two classes I wrote, simple structs).. which is called quite often because I have to transform all the vertexes. And I'm pretty sure when I sat in on that low-level class they mentioned that processors have instruction sets for matrix math now.

I'll be reading up on the SSE stuff now though and see what I make of it. It seems somewhat similar to writing Shaders with DirectX which also use 128bit registers w/ 4 x 32bit floats. Now I remember reading all that clamping stuff in shaders! Which is what I was trying to do here, but didn't remember what to call it.

Thanks for the all info though, I'll test it out and see what happens. It would be a fun project / learning experience to optimize this software renderer as best I can.

ameoba · Nov 25, 2010

mikeblas said:
This is what the SIMD instructions are for. Your compiler should support intrinsics that let you use the SSE instructions; I don't thikn you need anything past SSE to do this, so you needn't be concerned that your code will run on any available processor--any Pentium would do.

Great reply. I was just going to say "look into MMX/SSE" but you nailed this one out of the park.

Nevermind · Nov 26, 2010

devil22 said:
Just a little tip, it's not necessary to do a "cmp eax,0" after the "and eax, X" - the Zero Flag will be set appropriately by the AND operation.

Thanks, I'd forgotten about that. Assembler comes up so rarely that I never really learned it properly.

ikari303 · Nov 27, 2010

mikeblas said:
This is what the SIMD instructions are for. Your compiler should support intrinsics that let you use the SSE instructions; I don't thikn you need anything past SSE to do this, so you needn't be concerned that your code will run on any available processor--any Pentium would do.

Seriously, answer of the decade. I wish I could upvote this. I dabbled with some SSE stuff in a parallel/distributed systems course I took for matrix multiplication, but surprisingly, the GCC-optimized version when compiled with -O3 -m64 was actually faster than my use of intrinsics, so I must've been doing something inefficient.

(C++) - Any faster way to keep a float between zero and one ?

Blazestorm

Supreme [H]ardness

Nevermind

Gawd

Blazestorm

Supreme [H]ardness

Nevermind

Gawd

robododo

Weaksauce

Blazestorm

Supreme [H]ardness

mikeblas

[H]ard|DCer of the Month - May 2006

Nevermind

Gawd

Nevermind

Gawd

mikeblas

[H]ard|DCer of the Month - May 2006

devil22

2[H]4U

jimmyb

2[H]4U

Blazestorm

Supreme [H]ardness

ameoba

Supreme [H]ardness

Nevermind

Gawd

ikari303

Limp Gawd