IvyBridge Xeons: Folding performance

AndyE

Limp Gawd
Joined
May 13, 2013
Messages
276
I thought that people might be interested to collect folding performance data of the new Intel Ivy Bridge Xeons. Given that there are quite a few CPU models, let's collect them in this thread.

I can share some E5-2697v2 numbers with a 8104 work unit.

Sometimes there are quite high deviations between different runs of the same projects. To make numbers more comparable, I started the WU with a SandyBridge Xeon, let it run for a while, stopped the machine, restarted the same OS image with the second machine equipped with IvyBridge Xeons.The numbers are the average of 3 time steps in each machine.

Configuration:
System 1: 2x E5-2687W, 3.1 GHz, 8core, 4% overclocked, 1600 MHz DDR3
System 2: 2x E5-2697v2, 2.7 GHz, 12core, stock frequency, 1600 MHz DDR3 ECC

System 1: average tpf of 9m27s
System 2: average tpf of 6m46s (40% better)

System 1: 395.000 ppd
System 2: 652.000 ppd (65% better)
(If I will be able to OC the IB Xeons on the other Asus board, this number could go up to 692.000 ppd)

Power consumption is compared on the same system. Just the CPUs are replaced. This system is not energy optimized, so the numbers should only be used for their watt differences, not the absolute values.

E5-2687W: idle desktop: 128 watt, folding: 460 watt
E5-2697v2: idle desktop: 140 watt, folding: 400 watt

I need to check this later, but I think I can shave off another 40 watt of power by a more careful optimization of components (PSU, mem, fans, ...)

With the current results, energy efficiency is:
E5-2687W: 860 ppd / watt
E5-2697v2: 1630 ppd / watt ( +90% improvement)

for comparison:
Currently my best system is a i4p system with 1480 ppd/watt.
An optimized dual GTX Titan system is about 860 ppd / watt

cheers,
Andy

Update:
please see the new numbers posted in post #11.
 
Last edited:
That's far better than I would have predicted. Most impressive.
 
Winner winner chicken dinner! It looks like IVB Xeon's are going to make people think about a 2P system vs a 4P one, then again maybe not lol. Until you get them on the used market it still seems like more of a niche thing. Who knows. So much excitement! :D

Thanks for sharing!
 
Very impressing. Let's do a simple math. The 2697v2's allcore turbo is 3.0GHz, while 2687w's is 3.4GHz, and with 4% overclocking it becomes 3.536GHz.
So:
Code:
(3.0*12*2)/(3.536*8*2)=1.2726
(9+27/60)/(6+46/60)=1.3966
1.3966/1.2726-1=0.097=9.7%
That is to say, there is a ~10% performance improvement at the same frequency. This is a really good result. However, I think this value might be a bit too high. Have you installed thekraken-0.7? And did you observe DLB got enabled in both tests?
 
Do you have many accessories in that system, or is the idle high just due to power settings (you mentioned not energy optimized)? My 2P 2660 with the same board does 90W/270W, so I imagine (as you said) with a little tweaking you can get some truly amazing PPD/W numbers!
 
Let me try to answer some of your questions. BTW, the system is now at 80% completion and the tpf hasn't changed much. It hovers around the initially observed 6m46s for P8104.

@Kai,
FAH has very good scalability properties (= scales well with number of cores). See the math that quickz put together, that shows the throughput increase between the 2 architectures. Combined with QBR system which "accelerate" faster systems a bit more, this result is not sooo unexpected.

@sc0tty8
Hi sc0tty8, nothing special, I would just turn up the BCLK frequency on the Asus Z9PE-D8 WS workstation board. Due to BIOS issues with this board, I have the new CPUs currently in the server board (Asus Z9PE-D16) which does not allow to change BCLK. THat'S why I have to wait. Movieman over at xtremesystems.org had his hands on other IB Xeons and used them in the Hyperspeed enabled Supermicro board and confirmed, that IB Xeons can be "OCed" with BCLK like the SB Xeons. So, it is currently an assumption, but I am positive that it will be possible.
BTW, I can crank up the memory speed with the DIMMs on the workstation board which might give another little boost. IB XEons can use memory with up to 1866 MHz, vs the 1600 MHz official max for SB Xeons. The current results are with 1600 MHz RAM.

@Liger88
It will be interesting when the first high-end IB Xeons will appear as ES versions. Guess that the performance numbers listed above might trigger some price premiums people are willing to pay to get this performance still much cheaper than I did.

@quickz,
good calculation. I did something similar a while ago, which was the base to put my hands on the top-end model. Like you I am still surprised about the 40% increase, which is higher than most reviewers measured the E5-2697v2 against the E5-2680 and E5-2690. Both of those SB XEons are 10-15% slower than the top of the SB Xeon line E5-2687W I used here.

I didn't look into this deeper but some ideas why the IB Xeon is 40% faster vs. the slightly overclocked E5-2687W:
The FAHCore instruction mix might rely heavily on some instructions which had been better implemented than in SB Xeons (microarchitecture)
The memory access pattern of the FAHcore might allow super-linear scaling in certain scenarios. (Usually, a CPU with n-times more cores has a perf improvement which is lower than "n". Superlinear scaling hits sometimes in, when the memory footprint of the working set fits to a larger extend into the next higher cache level. In these cases, you get slightly above "n" speedup ratios vs. the calculation you and I did. (FAH has a relatively small memory footprint, so the probability that this can happen is higher versus an application that requires hundreds of GB working set size, like Linpack)

@402blownstroker
I can submitt some entries when the system stabilizes and some WU are finished end2end without reconfigurations on the fly.

@EXT64
My E5-2650 system was in the same ballpark as your wattage indicate. No1 issue in my system is the PSU (SIlvertake 1200w) which is one of the surplus PSU I have available. Oiginally needed due to the fact that it is the only PSU with 40A on the 5V rail (for my SSD project). On the downside, this PSU has very bad efficiency on low and very low load.
No2 issue are the CPU coolers and number of fans. Due to availability, the system has 2x H100 watercoolers with 4 fans spinning at 100% (the fancontrol is broken with my H100). The CPUs don't need that much cooling. Besides the noise issue, the H90i are much more efficient and less noisier. I might change these. There are still 4 fans in the case, which are for this application not needed. One should be ok (as the CPU heat is taken care of by the watercoolers)
No3 optimization is BIOS Settings - need to play around with some of the settings. Which ones don't impact performance but save energy. Gee you remind me that I didn't turn off NUMA memory on this machine. Need to do that.
No4: The system has currently 256GB RAM. Not needed for folding

Andy
 
Last edited:
Performance update:

@EXT64's question reminded me on the BIOS NUMA setting. Initially, I forgot to turn it off. Done. Here are the new numbers:

TPF of the very same P8104 unit improved from 6m46s to 6m30s.
With this new number as base:
System 1: average tpf of 9m27s
System 2: average tpf of 6m30s (45% better)

System 1: 395.000 ppd
System 2: 692.000 ppd (75% better)

With the current results, energy efficiency is:
E5-2687W: 860 ppd / watt
E5-2697v2: 1730 ppd / watt ( +101% improvement)

Given the increasingly better results, I am inclined to jump on a Supermicro Hyperspeed motherboard. This motherboard might be able to run the IB Xeon with 6% higher BCLK and memory can be cranked up to 1977 MHz (1866 * 1.06). I do have the 2133 MHz Corsair DIMMs, so the mb is the only missing component.

Under the assumption that the TPF times improve by 6% - down to 6m8s - the IB Xeons might hit 756.000 ppd with this particular 8104 unit.

This is ppd wise 91% better than the E5-2687W system with 4% overclock and deep in 4P territory. Combined with a target power consumption of below 400 watt this is very tempting

Andy
 
Performance update:

@EXT64's question reminded me on the BIOS NUMA setting. Initially, I forgot to turn it off. Done. Here are the new numbers:

TPF of the very same P8104 unit improved from 6m46s to 6m30s.
With this new number as base:
System 1: average tpf of 9m27s
System 2: average tpf of 6m30s (45% better)

System 1: 395.000 ppd
System 2: 692.000 ppd (75% better)

With the current results, energy efficiency is:
E5-2687W: 860 ppd / watt
E5-2697v2: 1730 ppd / watt ( +101% improvement)

Given the increasingly better results, I am inclined to jump on a Supermicro Hyperspeed motherboard. This motherboard might be able to run the IB Xeon with 6% higher BCLK and memory can be cranked up to 1977 MHz (1866 * 1.06). I do have the 2133 MHz Corsair DIMMs, so the mb is the only missing component.

Under the assumption that the TPF times improve by 6% - down to 6m8s - the IB Xeons might hit 756.000 ppd with this particular 8104 unit.

This is ppd wise 91% better than the E5-2687W system with 4% overclock and deep in 4P territory. Combined with a target power consumption of below 400 watt this is very tempting

Andy
 
I did not expect to get a 8101 WU so soon. It did. The second WU is a dreaded 8101 :(

avg. tpf of 5 frames: 11m8s
About 467.000 ppd.

Usually my E5-2687W system had 300k for 8101. This is a 56% ppd improvement.
 
Turning off the NUMA option would improve the performance by ~4%? This is a bit strange, I thought turning on this option should be faster.

Performance update:

@EXT64's question reminded me on the BIOS NUMA setting. Initially, I forgot to turn it off. Done. Here are the new numbers:

TPF of the very same P8104 unit improved from 6m46s to 6m30s.
With this new number as base:
System 1: average tpf of 9m27s
System 2: average tpf of 6m30s (45% better)

System 1: 395.000 ppd
System 2: 692.000 ppd (75% better)
 
Last edited:
Turning off the NUMA option would improve the performance by ~4%? I think this is a bit strange....
Sure.
It is (relatively) well known that this option improves folding performance. I just forgot it yesterday
 
I didn't know that either - interesting. I'm interested to check what mine is set to, though it has been running 11 8102 in a row (and they are above average PPD 8102) so I don't dare interrupt it.
 
Shens.
Kraken + NUMA enabled beats any NUMA-off config.
 
Hmm...

Was the same WU used each time?

Was it launched from scratch? [no checkpoint involved]
Resuming from checkpoints causes fluctuations in performance; hell, even when starting from scratch
fluctuations do happen (ain't that right, Grandpa?)

To have better view of performance, I would rather fold 0 through X % several times in each
configuration and then look at the data. While it sounds laborious, I'm sure you'll find results interesting.

Anyway, we've been through "flipping NUMA back and forth" period in 2011, there's no need to go
through it again; at least not on non-IB-E systems :)
 
Hmm...

Was the same WU used each time?

Was it launched from scratch? [no checkpoint involved]

As it was the first WU I did with the new IB Xeons - so, yes it was the same WU.
0% - 60% of the unit was done with NUMA on, 60 - 100% of the unit was with NUMA off.

The rig is now in its second WU (P8101)
Started this WU with NUMA=off and reported results
Checked this WU by turning Numa=on. Average over 10 frames is 4% lower than before (11m35s)
System is now back on Numa=off and tpf are now back to 11m8s.

Andy
 
Iiinteresting...

Given that it's [H] appliance, can you run: fahdiag | pastebinit
and post the result? Just to satisfy the doubt in me please :)
 
I noticed that you got pretty good performances on 8104, however, the performance you got on 8101 is not as good as that on 8104 -- even 11m8s is not a very ideal result.
There are some PPD results for a 2p IV-B E5 (24C, 2.4G all-core, 1333 ecc ram) in http://hardforum.com/showpost.php?p=1040076950&postcount=9
That rig's TPF for 8101 is 13m40s, with a linear scaling we could expect its TPF @3.0G to be about 10m56s, which is significantly faster than 11m8s.
Any thoughts?

The rig is now in its second WU (P8101)
Started this WU with NUMA=off and reported results
Checked this WU by turning Numa=on. Average over 10 frames is 4% lower than before (11m35s)
System is now back on Numa=off and tpf are now back to 11m8s.

Andy
 
12s out of 11m8s is significantly faster with the random fluctuations we see in daily TPFs of 8101???
I don't think so.
 
tear,
I'll get you the output as soon the machine is back in Linux.

As these are new CPU types, my curiosity about their performance pattern is not only in folding but some other apps as well.

I am currently running y-cruncher, an application Alex Yee wrote for calculating huge number of mathematical constants like "pi". This application is very well optimized and its computational density is higher than those of FAHcores - as seen by the power consumption. It is not uncommon that the very same system consumes 10-15% more energy during y-cruncher runs vs. FAHcore runs.

FYI,
here are 2 screenshots taken from 2 consecutive runs I just did. The only difference between the 2 runs is the BIOS Setting for NUMA. All other things in BIOS and OS are left unchanged. Please consider, that Y-cruncher uses much more memory than FAHcore, so the impact of the NUMA setting is higher vs. in folding.

I used y-cruncher version 0.6.2, Win8 and set the application to calculate 5bn digits of "pi". Memory consumption is 26.5 GB.

Run 1: Numa = Off; Total compute time = 455.8 sec, screenshot
Run 2: Numa = On, Total compute time = 544.0 sec, screenshot (= 19.4% longer)

You can check it with your own machines, just download the app and tinker with the NUMA settings.Y-cruncher's performance is highly repeatable and serves well for those kind of tests.
BTW, the old best mark was 530 sec, with an overclocked SB Xeon system (dual E5-2687W, OC=6%). :)

Andy
 
Last edited:
Please post fahdiag's output whenever you happen to run FAH on it again, please :)
 
I noticed that you got pretty good performances on 8104, however, the performance you got on 8101 is not as good as that on 8104 -- even 11m8s is not a very ideal result.
There are some PPD results for a 2p IV-B E5 (24C, 2.4G all-core, 1333 ecc ram) in http://hardforum.com/showpost.php?p=1040076950&postcount=9
That rig's TPF for 8101 is 13m40s, with a linear scaling we could expect its TPF @3.0G to be about 10m56s, which is significantly faster than 11m8s.
Any thoughts?
Only a few ideas, it is too early to claim experience with the system and I don't have access to the other CPUs.
What I currently can use for a guess is a finished P8104 and an ongoing P8101 unit which is interrupted (system is used for other stuff)

What I noticed though was that the level of speedup I saw between my E5-2687W and the new E5-2697v2 was different in the 2 WUs. There was a higher speedup (as seen in TPF) in the 8104 WU vs. the initial phase of 8101.

Why? I don't really know, it could be a different access pattern, a different mix of computational functions, etc ...

I think, that over time, when more and more people will provide data points for different CPU models and comparison data, we will have more clarity about some of those effects.
 
Please post fahdiag's output whenever you happen to run FAH on it again, please :)
Sure tear.
Question: I am not running the [H] appliance, but the "discrete" installation. Is fahdiag there available as well?
I'd like to finish those other non-folding runs today before I get the system back on folding for next week (I'll be in the US for a few days)
 
If you used "fahinstall" utility (from the guide), fahdiag should be available.

I'd like to finish those other non-folding runs today before I get the system back on folding for next week (I'll be in the US for a few days)
Sure thing, I'm putting a machine together to test few things here as well (semi-related topic).

Where in US are you going to?
 
tear,
did restart the [H] OS to quickly grab the fahdiag for you.
BTW, I will be in Seattle.
rgds,
Andy
 
I hear a rumor that E5 v2 is better than E5 v1 on overclocking (by adjusting BCLK), it's said that 10% or even 12% overclocking is possile for E5 v2, while for E5 v1 it's only 8% at most.
Looking forward to your overclocking results.

Given the increasingly better results, I am inclined to jump on a Supermicro Hyperspeed motherboard. This motherboard might be able to run the IB Xeon with 6% higher BCLK and memory can be cranked up to 1977 MHz (1866 * 1.06). I do have the 2133 MHz Corsair DIMMs, so the mb is the only missing component.

Under the assumption that the TPF times improve by 6% - down to 6m8s - the IB Xeons might hit 756.000 ppd with this particular 8104 unit.

This is ppd wise 91% better than the E5-2687W system with 4% overclock and deep in 4P territory. Combined with a target power consumption of below 400 watt this is very tempting

Andy
 
Last edited:
Performance update: The system just started its 3rd WU (P8103).

TPF (avg of first 3 frames): 8m40s
According to the bonus calculator this equates to 680.000 ppd for a P8103 project.

(6% OC would get us to 8m10s and 744.000 ppd for P8103 :) ) Not bad for a 2P system ..
 
AndyE, as always, thanks for starting this interesting thread. I'm waiting for some ES-goodness to show up on crapBay. ;)
 
AndyE, as always, thanks for starting this interesting thread. I'm waiting for some ES-goodness to show up on crapBay. ;)
brilong, you are welcome.
Go and hurry up. I think Intel hasn't made enough samples to cover all requests :D
 
AndyE, as always, thanks for starting this interesting thread. I'm waiting for some ES-goodness to show up on crapBay. ;)

depending on how much you are willing to spend there are a few out there already
 
depending on how much you are willing to spend there are a few out there already

Nathan_P, I have a saved search setup with the following terms:

Code:
(QF93, QEEY, QDDT, QDT6, QD29, QE4Z, QDKA, QDUF, QDDK, QDUA, QD0Y, QDUB, QDUD)

I grabbed these from the E5 spreadsheet. I was aiming for certain higher-end models. Do you think I'm missing anything in my search? Thanks.
 
Nathan_P, I have a saved search setup with the following terms:

Code:
(QF93, QEEY, QDDT, QDT6, QD29, QE4Z, QDKA, QDUF, QDDK, QDUA, QD0Y, QDUB, QDUD)

I grabbed these from the E5 spreadsheet. I was aiming for certain higher-end models. Do you think I'm missing anything in my search? Thanks.

Probably, those Q codes come from ebay so not sure how accurate they are and there are some missing. I run several searches using xeon es, xeon e5 v2 and xeon e5 es.

What I don't yet know is which are the better cpu's to go for in terms of steppings
 
Hey AndyE Thanks for the info and putting in all the hard work! Good to know these will be worth the money for upgrading some of my 2p xeon systems once they're more available. Cheers!
 
Just a thought...
If your acpi tables are messed up (memory mapping incorrect, cpu1 ram linked to cpu2 and vice versa) numa could slow you down.

This would be the first test case we have witnessed it on and is thus of great interest.
Also, please check if numa is turned off at the OS level, having missmatched settings might also cause an issue.

When testing, use the same unit, reset at 0.
Paused and restarted wu's have a history of variability in tpf and the numbers are not reliable.

Cheers :)
 
Patriot,
I had this NUMA setting issue observed last year in a range of non-NUMA aware apps. Y-cruncher being one, folding another and some other HPC apps. If you take Intel's LinPack benchmark it is as you described - but given the level of optimization Intel had put into this implementation it is probably NUMA aware.

I am currently in the US and can't access the system from abroad.

Even worse, my production for this week stopped abruptly a few hours ago. Too bad, with 3 systems running, 2mio ppd average 24hrs rate was slowly building up. Need to restart the systems next weekend when back. I can take another look at this issue like you described, but I had so many occurences of performance differences that it became over time a repetitive pattern. BTW, any tool available to check if the ACPI table is correct or screwed up?
 
Patriot,
I had this NUMA setting issue observed last year in a range of non-NUMA aware apps. Y-cruncher being one, folding another and some other HPC apps. If you take Intel's LinPack benchmark it is as you described - but given the level of optimization Intel had put into this implementation it is probably NUMA aware.

I am currently in the US and can't access the system from abroad.

Even worse, my production for this week stopped abruptly a few hours ago. Too bad, with 3 systems running, 2mio ppd average 24hrs rate was slowly building up. Need to restart the systems next weekend when back. I can take another look at this issue like you described, but I had so many occurences of performance differences that it became over time a repetitive pattern. BTW, any tool available to check if the ACPI table is correct or screwed up?

If the Software is non-NUMA aware I have seen worse performance with it enabled.
However...this is not the case with folding unless something has changed.
Incorrect mapping tables would do it...but I dunno... I will try to get onto an i2p v2 later this week and test it myself.
 
If the Software is non-NUMA aware I have seen worse performance with it enabled.
To address any potential misunderstanding of my points raised earlier:
The performance non-NUMA aware apps is (usually) lower with NUMA enabled, and higher with NUMA disabled.
 
Back
Top