Let's Benchmark Our SMP Systems

Poncho said:
Here's a new 1:

Code:
                                 this      other              relative
                             computer   computer           performance
  
 Dhrystone 2.1                  23692      11682  kDhryst. 202 percent
 Whetstone                       5089       2522  MFLOPS   201 percent
 Eight queens problem           27179      17665  pps      153 percent
 Matrix operations            1101023     148309  k ops    742 percent
 Number crunch                1219195     370078  k ops    329 percent
 Floating point                 84950      39423  k ops    215 percent
 Memory throughput            1265348     603339  kB/sec   209 percent
                                                                                                           
                                            Total CPU      327 percent
                                            Total FPU      208 percent
                                              Average      293 percent
                                                                      
                              Application Performance      318 percent

This one is a similar platform, using the server board (previous benchmark was the workstation board), 2 Woodcrest (Conroe Xeon 2.33 @ 1333fsb) procs and only 1gig of FBDDR2@533
oh my!
eek.gif
That things drystone is incredible!
 
pxc said:
As you can see there are some very nice improvements with the 64-bit exe.
Hey, that was a good idea to port it to x64.

Are you comparing the x64 version to the Win32 version on WOW32 under Windows 64?
 
pxc said:
That might be part of the boost. It would be interesting to see the Win32 numbers for the code running native on the exact same box. There's debate about how much slower WOW32 is, and so on ... and not having a 64-bit rig (one that's available to me for happy playtime, anyway) I'm in the dark about it.
 
Someone else can test it. :p I only have x64 installed on that system.

There are more variables to it anyways. The 32-bit executable is built with the old MS C++ 6.0 (v12.00.xxxx) compiler from 1998. The 64-bit version uses the newer MS C++ 7.1 (v14.00.xxxxx) compiler from 2005. There would likely be a boost in the 32-bit version by recompiling it with the C++ 7.1 compiler.
 
pxc said:
Someone else can test it. :p I only have x64 installed on that system.
Oh, well; thanks anyway.

pxc said:
There are more variables to it anyways. The 32-bit executable is built with the old MS C++ 6.0 (v12.00.xxxx) compiler from 1998. The 64-bit version uses the newer MS C++ 7.1 (v14.00.xxxxx) compiler from 2005. There would likely be a boost in the 32-bit version by recompiling it with the C++ 7.1 compiler.

VC++ 7.1 was released in 2003 (most people call it VC++ 2003, in fact). 8.0 came out in 2005. That's a substantial difference, as the compiler and libraries teams did nontrivial work for those releases.
 
mikeblas said:
VC++ 7.1 was released in 2003 (most people call it VC++ 2003, in fact).
The Win2K3 PSDK (2005 release) comes with a 64-bit 7.1 compiler (or pre-release 8?... it came out 7 months before VS8 was released) and it was updated in 2005. :p I'll happily accept any free compiler I can get from MS.

Here's the difference between the 6.0 and 7.1 on regular 32-bit WinXP with a Dothan 2GHz:

original 32-bit .exe (VC++ 6.0)
Code:
                                this      other              relative
                            computer   computer           performance
 
Dhrystone 2.1                   4466        n/a  kDhryst.   0 percent
Whetstone                       1065        n/a  MFLOPS     0 percent
Eight queens problem            6092        n/a  pps        0 percent
Matrix operations             247588        n/a  k ops      0 percent
Number crunch                 199660        n/a  k ops      0 percent
Floating point                 16429        n/a  k ops      0 percent
Memory throughput             722128        n/a  kB/sec     0 percent

compiled with VC++ 7.1
Code:
                                this      other              relative
                            computer   computer           performance
 
Dhrystone 2.1                   6222        n/a  kDhryst.   0 percent
Whetstone                        992        n/a  MFLOPS     0 percent
Eight queens problem            6193        n/a  pps        0 percent
Matrix operations             275607        n/a  k ops      0 percent
Number crunch                 362095        n/a  k ops      0 percent
Floating point                 16584        n/a  k ops      0 percent
Memory throughput             721552        n/a  kB/sec     0 percent

(edit) compiled with VC++ 8.0
Code:
                                this      other              relative
                            computer   computer           performance
 
Dhrystone 2.1                   6869        n/a  kDhryst.   0 percent
Whetstone                       1258        n/a  MFLOPS     0 percent
Eight queens problem            6188        n/a  pps        0 percent
Matrix operations             247724        n/a  k ops      0 percent
Number crunch                 426497        n/a  k ops      0 percent
Floating point                 27052        n/a  k ops      0 percent
Memory throughput             723351        n/a  kB/sec     0 percent
 
Heres mine, specs in sig:

this other relative
computer computer performance

Dhrystone 2.1 14228 11682 kDhryst. 121 percent
Whetstone 2835 2522 MFLOPS 112 percent
Eight queens problem 18204 17665 pps 103 percent
Matrix operations 294161 148309 k ops 198 percent
Number crunch 412280 370078 k ops 111 percent
Floating point 43369 39423 k ops 110 percent
Memory throughput 1574353 603339 kB/sec 260 percent

Total CPU 158 percent
Total FPU 111 percent
Average 145 percent

Application Performance 155 percent
 
this other relative
computer computer performance

Dhrystone 2.1 26861 11682 kDhryst. 229 percent
Whetstone 5230 2522 MFLOPS 207 percent
Eight queens problem 34371 17665 pps 194 percent
Matrix operations 912243 148309 k ops 615 percent
Number crunch 778145 370078 k ops 210 percent
Floating point 82742 39423 k ops 209 percent
Memory throughput 2848766 603339 kB/sec 472 percent

Total CPU 344 percent
Total FPU 208 percent
Average 305 percent

Application Performance 334 percent

Dual Opteron 275'[email protected], Supermicro H8DCE MB, 2gb DDR-400 ECC.

How do you get this in code view?
 
Has anyone else noticed that the "Matrix Operations" number from clibench doesn't scale to multiple processors? Is that just on hyperthreaded machines, or on multiprocessor machines?

I think the matrix code in CLiBench is unrealistic because it's unoptimized; there's a little bit of loop folding and the compiler tries to enregister what it can, but the inner loop just does a single multiplication each pass. Any application where this code would exist would make use of at least some simple optimizations.

Unrolling the loop gets a 60% improvement, and goes from scaling less than 1% to 5%. I think that writing the loop to use SIMD/SSE instructions could make a much bigger gain, but it'll be a little while before I have the time to do that.

Seems like the scaling problem is not so bad on dual processor machines; I think on HT machines, the problem ends up being that the cache is saturated... but on my dual processor machines, I still only get a modest improvement when adding another processor.

Representing tables is nearly hopeless, but here's what happens on my 3.4 GHz Pentium 4 with HT, comparing 1 thread to 2 threads as well as the original code to the code with my modifications:

Code:
		1	2	Scale
Original	209223	210948	1.01
Modified	322042	338513	1.05
Improvement	1.54	1.60

And the same table for my dual proc Opteron 248 rig:

Code:
		1	2	Scale
Original	179604	254933	1.42
Modified	362912	413546	1.14
Improvement	2.02	1.62

It's a bit surprising that it only took me 15 minutes to double the performance and improve the scaling of software that's supposed to be running as a performance benchmark!

LATER:

After playing a bit more, I can get a 6x improvement. The second array can be made row-order instead of column order, and that makes much better cache usage. Even not doing that, unrolling the inner loop to do a few j iterations in addition to a few k iterations helps a lot, since the j's that are prefetched column-wise can hang out in memory a little longer.

Note that 6x improvement still isn't using any MMX or SSE instructions.

But I guess this is where I start having philosophical questions: should a benchmark run craptacular code (which is the status quo for this program), or carefully optimized code (which is what I'm working towards), or something inbetween? Should it use none of the processor's special features (that's the CliBench.EXE status quo) or should it use the advanced features of the processor if they're available (use SSE/MMX)?
 
Back
Top