Stilt's patch for Bulldozer, improving superpi's score by 18%-30%

quickz

Limp Gawd
Joined
Jul 30, 2012
Messages
256
http://www.xtremesystems.org/forums...Bulldozer-Revelations-Episode-2-(SuperPI-x87)
http://forum.hwbot.org/showthread.php?t=78490

Few days when I was doing some low level testing for other purposes, I found something that didn't make any sense to me.
Now I roughly know what it is and what it does, but still some questions remain: Why does this "feature" exist in the first place and why it is activated on all 15h family parts. I would normally assume it is a workaround for some errata, however no bulletin exists for this one either. Also this feature does not exist in any documentation, or it does but only AMD has access to the required level. I find it hard to believe that it would be a design issue as the affected instructions work fine (but slowly) and it existed since early Zambesi revisions and, currently is still present in Richland and probably beyond (within family 15h)...
Effect: A massive performance hit in application heavily utilizing x87 instructions

After the fix has been applied SuperPI shows 18-30% improvement in performance. Bigger the calculation, bigger the improvement. Since this kind of fix is quite unheard of, I knew that I would be crucified if I would make such claims without any providing evidence.
SuperPI 1M: > 1 second improvement
SuperPI 8M: > 10 second improvement
SuperPI 32M: > 35 second improvement
I'm wondering if FAH would benifit from this magic patch. :)
 
Last edited:
Where's the actual description of what is being changed or the source code.

Am I missing something in those threads?
 
You're not missing anything. Guy's being extremely enigmatic :)

Document he references doesn't seem to be useful (in context of operations he's
performing) either. AMD errata list may be a better source of information --
smells like he's altering some errata-related registers.

To test it (on Linux) one would need to understand what the code does.

It's currently difficult for several reasons:
- author didn't reveal (note the irony given XS topic) the nature of the changes he's performing
- author didn't release the source code
- author UPXd the binary (which raises the bar needed to reverse engineer the code)

Reverse engineering would require tracing (modding or going via debugger, I presume)
operations made on WinRing0 DLL level -- it's not super difficult but quite time consuming...

An alternative is writing simple code (using WinRing0 as well) dumping all CPU MSR
and PCI regs. Then comparing state of registers before and after running his utility.


Any takers?

EDIT: on second thought, one could first try to evaluate his work (FAH-wise) on Windows
-- just to see if it's worth it.
 
Last edited:
Y


Reverse engineering would require tracing (modding or going via debugger, I presume)
operations made on WinRing0 DLL level -- it's not super difficult but quite time consuming...

Our definition of "difficult" seems to be light years apart...

:D
 
Good point on testing if it has any impact on FAH.

Based on the description that it primarily affects x87 code, I'd guess that it won't have a significant impact on FAH (if at all)

Wonder what he's hoping to gain by not revealing wtf is being changed...

If it does speed up SuperPi and is related to x87, there's likely some way of simplifying the fetch/decode to a large extent in order to speed up the execute/retire.

I wouldn't be nearly as interested if he hadn't tried so hard to hide it.

Looks like I'll be browsing the old hacking sites to find a memdump program again. (Assuming that is still the easiest way to dump packed code)

EDIT: Well this looks promising: http://securityxploded.com/unpackingupx.php
 
Last edited:
LOL, I just read his EULA.

Code:
3. You may not rent, lease, sell, modify, decompile, disassemble, or reverse
engineer this program or any subset of this program. Any such unauthorized
use shall result in immediate and automatic termination of this license and
may result in criminal and/or civil prosecution.

Nice attitude towards community collaboration.

My take -- he's after Andy Warhol's "15 minutes".
 
He is probably trying to protect himself by locking the program. I mean lets face it, lets say even 50 percent of the sites that might mirror the program with compromise it with a virus now days. To many people out there are trying to find any way possible to get you information so they can rob you. That is the reason I refuse to download any program except right from the creators website. And even then you have to be careful because a lot of websites have fake downloads that take you to shady sites.
 
Your argument is universally valid.
One can modify and distribute malicious source just like one can modify and distribute
malicious binary. UPXd or not. It doesn't really matter.

Security "concerns" aside, there's no reason to conceal the principle of the mod which
surely can be described verbally, in English.
 
From prehistoric memories: x87 instructions are co-processed, and the thread waits for the answer, but it can continue to step while waiting.
You must have the compiler switch on to use the x87 instructions (and the chip back then).

If the switch is not on when you compile, it does the instruction manually in a bunch of steps. It's not really emulation, it's just a bunch more instructions.

Sidebar - Got a $550 penalty on our power bill for going slightly over last year's summer use. Wifey made me plug until I can figure something else out. It's not that the folding cost $550 for a month, it cost whatever 20 amps comes to, plus another $550 penalty. $1800/m bill up from $1100 last year. Used $150 more power, payed $700 for it.
 
Last edited:
From prehistoric memories: x87 instructions are co-processed, and the thread waits for the answer, but it can continue to step while waiting.
You must have the compiler switch on to use the x87 instructions (and the chip back then).

If the switch is not on when you compile, it does the instruction manually in a bunch of steps. It's not really emulation, it's just a bunch more instructions.

Are x87 instructions really out of date in modern programs?
But according to gcc's man page, 387 code is still the default choice for an i386 compiler.
See below for detailed, which is cited from the man page of gcc 3.4.6:

Code:
       -mfpmath=unit
           Generate floating point arithmetics for selected unit unit.  The choices for unit are:

           387 Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise.
               Code compiled with this option will run almost everywhere.  The temporary results are computed in
               80bit precision instead of precision specified by the type resulting in slightly different results
               compared to most of other chips. See -ffloat-store for more detailed description.

               This is the default choice for i386 compiler.

           sse Use scalar floating point instructions present in the SSE instruction set.  This instruction set is
               supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips.
               The earlier version of SSE instruction set supports only single precision arithmetics, thus the double
               and extended precision arithmetics is still done using 387.  Later version, present only in Pentium4
               and the future AMD x86-64 chips supports double precision arithmetics too.

               For i387 you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make
               this option effective.  For x86-64 compiler, these extensions are enabled by default.

               The resulting code should be considerably faster in the majority of cases and avoid the numerical
               instability problems of 387 code, but may break some existing code that expects temporaries to be
               80bit.

               This is the default choice for the x86-64 compiler.
 
Back
Top