Intro, 1 week folding, lots of lessons learned

AndyE

Limp Gawd
Joined
May 13, 2013
Messages
276
Hi there.

Joined [H] a few days ago. The reason for joining was the wealth of information available on all things hardware and the good information being collected and made available for folders. Thanks a lot and very much appreciated by a folding newbie.

A quick intro:
Since 30 years I am a software guy who recently got back into hardware. As part of a private "home lab", I built a system for high speed I/O ( > 20 GB/s) and got a few other components for explorations in the world of multiprocessing and GPUs. Last week, I finshed one of my private projects and started to explore a bit more the world of folding, tinkering with system setups, cost vs. PPD, energy vs. PPD, system balance, stability, etc ...
Reconfigured a few times my components, system and setups to maintain enough flexibility between the different things I do and intend to do going forward.

A week ago, I started with a mio points I collected from 2 weekends of initial trials in April into this 7 day test period. Applying the knowledge I gained throughout this week, the final result would have been higher, but the lessons learned makes it easy up.

Some of the lessons:
  1. Deferred the decision which team to join until the end. Going forward I'll switch from team 0 to 33. Given what I have seen with brilong, Grandpa and others, I will not be a top producer in this team, but want to give back to this community with my production to your team result.
  2. system stability is more important than the last ounce of performance due to hardware "optimization". System maintenance efforts and overall performance impact seems to grow with the square of frequency gain via OC.
  3. It took me some time to understand the non-linear model of PPD focecast, results, impact of internet upload connection speed, etc .. The model in itself is clear, but how to optimize multiple available levers simultaneously was a bit more tricky. Especially if acquisition cost , immediate energy cost or 3 year TCO is added to the consideration.
  4. The relationship between the 4P and GPU based systems. If I would have joined a year ago (or earlier), the "business" case for a 4P rig would have been for me unquestioned. Today and going forward, I am not so sure anymore. Today the 4P systems have better PPD/watt (up to and over 1000) vs the GPUs (650 in my case). Given that we are now entering the new era of explicit solvent simulations possible with FAHCore 17 on GPUs, one of the big gaps is closing. The long term evolution of GPU performance might continue to outspeed the CPU development, which makes this decision not an obvious anymore.
  5. In addition to the points made in (4). If the unification of the GPU interfaces to OpenCL allows the PG to put more and more emphasis with this codebase to tackle potentially larger problem sizes, the intrinsic computational power of GPUs can "unfold" faster than in the CPU world. The performance of OpenCL drivers will evolve as well. Observing that out of the 3 or 6 GB of GPU memory my cards have, only a fraction is currently used, (bigadv projects need comparatively low GBs on memory), -bigadv on GPUs might not be too far away. As said, no implications for now, but while I tinkered with the idea to build one of those 4P systems, these kind of thoughts came to my mind.
  6. Expect the unexpected. Just in this test week, the first wide area power outage happened in my district of the city in the last 30 years. I do have a gasoline emergency power generator to feed the house, but not enough juice to run the systems under load. Due to frequent switches between the grid and back to the generators (grid recovery didn't run as planned), it severly impacted PPD scores and increased my admin efforts (not to kill WUs). Daily production on -bigadv were more impacted than shorter x17 cores. Screwing up only one 2 hrs run and than back to normal had less influence on QRB overall.
  7. With FACcore17 energy efficiency per PPD of a smaller GPU box is quite close to a 2P -bigadv box
  8. and many more, potentially well known points to the experienced folders, but new experiences for me.

At least I achieved to move from position 16.000 to 422 overall, exceeded for the first time 1 mio PPD and was among the top 3 contributers of team 0 :D

original.jpg


original.jpg



Looking forward,
Andy
 
Last edited:
Welcome aboard. Interesting thoughts that you have posted there - will be interesting to see how everything pans out once ivy xeons are out for PPD/W
 
[*]It took me some time to understand the non-linear model of PPD focecast, results, impact of internet upload connection speed, etc .. The model in itself is clear, but how to optimize multiple available levers simultaneously was a bit more tricky. Especially if acquisition cost , immediate energy cost or 3 year TCO is added to the consideration.

Care to share your acquisition cost and TCO findings in more detail?

Thanks.
 
Thanks for the welcome.

Brilong,
sure, no problem. I'll come back later today with some assumptions, numbers and thoughts. Travelling right now.
Andy
 
Welcome to the fold.
As far as what client to go with... I like the both and option.

Intel 4p and c17 folding both hold promise.
AMD 4p has excellent price/perf but if I built now I would want to go AD due to newer instruction sets and lower power consumption.
 
welcome...

and holy shit...

my first week here I barely learned the right way to pick my nose...

Glad to have you! :D
 
Welcome!:D
So what you running?:cool:
Thank you.

This question isn't easy to answer as the system(s) are in a constant state of flux.

A quick comment where I am coming from: Fast I/O workloads require obviously a different setup than a system optimized for computational science. Over the last 9 months or so, I did some study work on contemporary configurations for different workloads. As important as the number of cores are to this community (I guess), the number of parallel I/O paths and other bottlenecks are of interest in the I/O field.

Some issues are simple, others are rather fundamental design constraints which are hard to tackle. A quick anectode on the simple issues as an example I've never thought about. One SSD needs about 0.6 - 1 amp @ 5 volt of power to write at full speed. Start putting 24 or more SSDs in one system and the probability is high that your PSU's fuse blows. So I put in a bigger PSU, 850 watt, same situation, the fuse goes. 1200 watt, same story. 1500 watt, still the fuse goes off. A 1500 watt PSU can't drive 240 watt worth of power 48 SSDs need to write at full speed. The culprit is the voltage. Basically all PSUs are optimized on their 12 volt rails to crank out huge amount of power at 12 volt for CPUs and GPUs. But not on 5 volt. Modern SSDs need only 5 volt, zero use of 12 volt.I found only one PSU able to drive my mediocre 240 watt SSD power load - the 1200 watt Silverstone Strider, which is rated at 40amp@5volt, but has some headroom. Lesson learned.

To accomodate the frequent reconfigurations, I built with super simple means "RaidPacks" of 8 or 16 SSDs

24x Samsung 840 Pro (256GB, as 128GB models have asymetric read/write speeds)
original.jpg


This is one of the portable "RaidPacks" to be moved from one machine to the next. One power plug to connect to the PSU and one raid controller. Quick and easy.
original.jpg


When NVidia announced and shipped the Titan, the architectural extensions of the GK110 GPU were the key reasons for me to wait for availability of theses cards. Things like SM 3.5, Dynamic Parallelism, new bit instructions... are not available on the GK104, GK106 and GK107 based cards. To be able to compare the architectural efficiency of the Titans to AMD cards, I bought 4 Gigabyte 7970 GE cards later as well. When I am not working on my own project, these cards represent currently my main folding "budget"
original.jpg


Similar to the scaling efficiency challenge in the folding area (thread affinity addressed i.e. with TheKraken), starting with a certain speed, I/O run into an interrupt affinity issue as well.

Anywway, to cut a long story short. I am curious, love to look into things I have no glue about and I am still looking for the better folding setup.

Andy
 
Welcome to the Fold, so to speak! Looks like you've got some interesting toys to play with, and a few stories to share as well... :). Glad you chose to hang out with us here on the [H].
 
Thanks Axdrenalin.

If folks don't object, I finish with this post my offtopic "journey" into I/O land.

It went by many people unnoticed, but 2012 marked for I/O as a kind of inflection point, similar to the famous Intel decision in 2004 to move its development efforts into multicore.

What happened relevant for the I/O mainstream in 2012 ?
  1. The introduction of the Xeon E5-2600 series marked a few significant architectural enhancements. Similar to AMDs breakthrough architectural enhancements which were introduced in April 2003 with the K8 based Opterons (Sledgehammer) - moving the memory controller on chip, simplifying multisocket systems with Hypertransport and bringing NUMA to the mainstream. The SB Xeons are the first mainstream server CPUs which have the PCI Express root complex with (16,24,40) lanes v3.0 on die, Direct Cache deals with the long standing problem of cache pollution with high speed DMA and signifcant L3 cache bandwidth enhancements vs. the predecessor (6 fold) and some more..
  2. In summer 2012 the first raid controllers and HBA (host bus adapters) appeared on the market with a) PCIe 3.0 support lifting the bandwidth to 7 GB/s per x8 card and b) a new generation of Roc (raid on a chip) processors finally fast enough to copewith the speed of 8 fast SSDs. Until then, raid controllers were either a bottleneck for all SSD setups or hugely expensive
  3. The availability of cost effective consumer SSDs with fast and more important sustainable write performance. Many consumer SSDs were good at benchmarks but had lackluster and more often unpredictable write performance due to design choices optimized for desktop usage. For instance, SSDs connected to raid controllers don't get the TRIM command. SSDs dependent on the TRIM command to start garbage collection will experience severe write penalties in these setups. SSDs with proactive GC do much better.
  4. Support of these kind of extensions in mainstream operating systems

What is to the folder core count and frequency, is to the I/O guy bandwidth and latency.
Number one bottleneck for balanced system performance these days is memory bandwidth. One GTX Titan can saturate the main memory bandwidth of a LGA-1155 based system, leaving basically no remaining bandwidth for the CPU to do its job. One of the main reasons why gamers experience very bad scaling with triple and quad GPU setups on LGA-1155 systems. (The second is the threading model in the drivers). A good rule of thumb for a balanced system config is to keep the I/O bandwith below 20 or 25% of available main memory bandwidth.

The second component is latency. The old adage "You can buy bandwidth, but you have to design for low latency" still holds true. Here the move of the PCI root complex from the external hub on die is this design improvement triggering astonishing levels of performance per unit cost. Marketing guys would call it unprecedented ...

To maintain flexibility in my project, my main mainboards are more oriented a good assortment of I/O slots
Asus P8Z77WS (LGA-1155, 32 GB, 4 PCIe x16 slots)
Asus P9X79WS (LGA-2011, 64 GB, 4 PCIe x16 slots)
Asus Z9PE-D8 (dual LGA-2011, 128 GB, 7x PCIe x16 slots, able to host 4 GPUs)
Asus Z9PE-D16 (dual LGA-2011, 256 GB, 6x PCIe x16 slots, not able to host 4 GPUs)
plus a few simpler LGA-1155 mainboards

CPUs available:
2x Xeon E5-2687W
2x Xeon E5-2650
1x i7-3930K
1x i7-3770K
1x i5-3470
1x i3-3225
3x Celeron G1610

As written in my first post, I did reconfigure quite a few times based on the ongoing folding experience and the list above should provide a bit of context what is available "for experimentation". To close off, the reason why I currently don't have any AMD system is based on the current AMD PCI express architecture which is still based on the SR5690 hub, limiting the I/O bandwidth of the overall system. This consideration doesn't hold true for compute jobs, as the 4 memory channels per socket, HT interconnect and massive numbers of cores are in good balance there. If I would potentially look into 4P systems, I'd rather go the Intel route for the reasons stated above.

BTW, I haven't look too deep into it, but based on initial observations, the folding cores seem to have a high level of memory access locality for both CPU and GPU cores. The performance of the GPU core17 is practically agnostic of the available bandwidth in the PCI slot. PPD results are more dependent on frequency changes in the shaders than in the GPU memory speed.

CPU cores like A4 or A5 also depend much more on CPU frequency than main memory speed. The relatively small footprint usually increases the share of cache hits in the data cache hierarchy (L1 and L2 scale with the frequency of the CPU core) . One of the limiting factors in computing large data sets (>50 GB) are TLB cache misses. For those who aren't familar with TLB=Translation Lookaside Buffers. Basically all CPUs today run with virtualized memory systems. So the address an application program sees is processed by the memory management unit and mapped to a physical address. This mapping is very fast when both addresses (virtual and physical) are in the TLB cache. Is the virtual address not covered by the addresses in the TLB, the memory management unit has to start a couple of main memory accesses to find the matching physical addresses. Doesn't look like FAH is impacted by TLB misses. If it would, CPUs with larger TLB caches would perform better.

/end offtopic

Andy
 
Last edited:
Back to folding.

My very first attempt in April was with the following setup:
i7-3930K, 64 GB
4x Titan
original.jpg


BTW, I run the bigger systems with out of the box water coolers. Started off with Corsair H100 on the E5-2687W and moved later to H90 which are smaller, easier to get into cases and good enough to keep the Xeons under 60C - even with the most intense apps (Linpack, y-cruncher). Current FAH cores run with 50-52C. This might change if more of the architectural extensions like SSE, AVX, etc. are brought into use and increase the computational density.

Core17 weren't available (to me) then and with core15 WU, the system performed in the 200-250K PPD range. While I like the i7-3930K CPU for other activities, in hindsight it wasn't a good fit for this job. Cutting 2 CPU cores out to feed the GPUs, the remaining 10 cores were too few to run larger jobs and the jobs available in the CPU slot didn't have QRB. So the better throughput didn't benefit from the QRB lever. For pure GPU feeding, it was an overkill.

When the Asus Z9PE-D8 arrived, I moved the 4 Titans to this board. core17 arrived by then and due to the need of the NVidia OpenCL driver to assign a full CPU core to one GPU, I used 28 cores in a VM to run -bigadv jobs there. Vs. a native implementation, performance in the VM was approx 75%.
original.jpg


Core17 was available and 2 of my Titans showed 130-140k PPD, while the 2 others were in the 110k range. My initial thought was that Titans have very high sample variation. I was wrong. The culprit was me. I forgot to set all 4 Titans to the same system setting and 2 lower performing cards had the option "double precision" still turned on. Keeping it on impacts the single precision performance as the turbo mode doesn't kick in.

Sidenote:
Quite a lot of discussion filled threads in the internet about the utility of a GTX Titan vs. alternatives like the GTX 680 or GTX 690. "Overpriced", "marginal impact", "cheaper alternatives available" were often read arguments. The Titan's flexibility for different workloads can be seen in this one example which depends on a different ratio single vs. double precision floating point. (To my current knowledge, FAH cores use single precision)

I'd like to share a graphic by the Super Computer Center in San Diego, which shows the performance graphs of CPUs and GPUs for Amber - also one of the broadly used applications in the field of bioinformatics.

The slide is from a recent presentation (PDF file) at GTC 2013 in March.

A few comments:
  • For performance reasons, Amber leverages mixed precision calculation (i.e. single precision for individual multiplications of vector elements and double precision for its summation)
  • A single GTX Titan seems to be 37% faster than a 8 node dual Xeon E2-2670 cluster (the CPUs alone are 25600 US$ at newegg.com)
  • One Titan as slightly faster than 4 GTX 680 in one compute node
  • It is 22% faster than 2 x K10 cards (the "pro" version of the GTX 690, roughly equivalent to the GTX 690)
  • Due to its higher frequency, it is 28% faster than its professional K20X brethren. Due to scaling issues between 2 GPUs, it is still faster than 2 of them. (The K20X is the more expensive version of the K20 which is currently listed at newegg.com for 3500 US$ each)
  • The performance metric measured is nanoseconds of folding simulation per one day of compute time. The time steps are usually 2 femtoseconds, so for one nanosecond there need to be 500.000 iterations of the force calculations between the atoms under investigation.

original.jpg



/end sidenote

Anyway, after I learned the initial steps to K-factor and QRB and the like, I wanted to offload the GPUs from the Dual Xeon. The NVidia drivers load 4 CPU cores with "useless" polling activities. Valuable CPU power which is especially important at "the edge" of the best possible TFP times in QRB WU.

If I understood it correctly, there is a non linear impact (for illustration purposes I used 8105). Please note, that the direct relationship between TPF and PPD is linear. 10% better TPF time = approx 15,5% higher PPD in this case. (Hope, I got this right)
original.jpg


So basically, if my fastest system is most likely on the right side of this graph, increasing its performance by 10% (measured in TPF), the impact on the total PPD of all my systems running will be the highest.

With this graph in mind, I removed all the GPUs from the Dual Xeon system and assigned all resources to one -bigadv slot. What I haven't checked yet is the impact of NUMA and the choice of 1 or 2 NUMA aware slots. Most non-NUMA aware applications suffer a high % of efficiency when they move from a single socket platform to a dual NUMA socket platform. My understanding is that TheKraken is covering much of the thread affinity issue. I don't know about other systems, but my A4 and A5 cores still have 6-12% CPU load in system state (kernel mode). Question: Would a dual slot approach, where each FAH slot is exactly aligned to one socket perform better than a single NUMA FAH slot on the same machine? Is there some experience in the team? Please share any pointer you have.

The Titans had to move a 3rd time. As they don't produce a lot of PCI traffic, I moved them to the only 4 PCI slot system with an LGA1155 socket in my repository - the Asus P8T77WS board. Equipped with my smallest 4core IvyBridge (i5-3470) assembly is simple. Power consumption under load is at 800 watt, but unfortunately, the NVidia drivers are unstable killing the whole system with kernel errors. I tried both 314.22 and 320.14 beta. No difference.

It would be great to see two issues solved.
1) That future versions of the NVidia OpenCL driver don't use the CPU for polling. This keeps the CPU in high power state, adds 50-70 watts of energy which do not contribute to compute results.
2) Find a fix for the driver crashes to keep the 4 Titans in this "smaller" system

Currently, only 2 Titans are running in this system. They are stable, produce about 300-310k (WU 7663: TPF 1m5s). If I could add the 2 remaining GPUs, the expected energy efficiency would be in the 750 PPD/ watt range.


Andy
 
Last edited:
To close the system description, some points on the AMD 7970 setup.

I was pleasantly suprised by the cooling solution of the Titans. They are very effective to get even in Quadpacks the heat out of the system case.

The only AMD 7970 GE cards in town were Gigabyte 7970 cards. WIth its 3 fans, the noise level of a single card is very low and temperature level with stress tests like Furmark are well controlled. Things start to change when a second card is added. If the 2 cards are side-by-side, the one were the air intakes are "covered" by the second card, can't maintain reasonable temperatures at all. 100C are often the result. The second difference to the Titans: The hot exhaust air is led into the case were a second set of fans is required to cool the case down.

I thought about different solutions, but ended up with the following.

If I need all 4 cards in one system for a relatively short time, I plug in 2 cards directly into PCI slots and connect the 2 intersecting other cards via 6 inch long PCI extender cables, lifting them above the other 2 cards. Pointing a good fan to this setup helps as well.

For the longer duration of keeping system folding I looked for another solution.

1) The AMD OpenCL driver does not use much CPU attention - so I wen't for a Celeron G1610

2) I looked for a cheaper motherboard with 2x PCI 3.0 x16 slots (physical slots), which split electrical to 2x PCI 3.0 x8 connection. The IB and SB desktop CPUs support 16 PCI 3.0 lanes plus the DMI interface to the PCH chip. There are many motherboards out there, which have 2x physical x16 slots, but the second slot is connected via this DMI connection, which is a very unbalanced system design for this purpose. A better solution is to use a motherboard which uses all 16 lanes to PCI Slot 1 if this is the only one occupied, and split the 16 lanes to 2x 8 when both slots are occupied, leaving all the DMI capacity for the rest of I/O (SSD, LAN, USB, ..)

The second thing I wanted to have was that the physical location of the two PCI x16 slots were not 2 but 3 slots apart. This leaves more room for fresh air to the second card.

3) PSU. Any 600w+ PSU is ok, as long as it has 2x PCI 12V connectors. There are Molex adapters available for those who have PSUs with only single PCI 12V connector (3$)

For folding, the AMD cards are located in 2 identical systems.
original.jpg


With stock frequency, PPD hovers between 195 and 210k per system (with 7663 WU)
original.jpg


The celeron is fine with the load to drive the 2 cards. Barely leaving power save mode (Please NVidia, take note)
original.jpg


Power consumption is 380 watt / system (measured at wall outlet)

Once a day, one of the GPUs per system pick up an old x16 core WU. (i.e. 11293) With a total forecasted PPD of 3700, these WUs basically cut production of this system in half. From an energy efficiency perspective, I shouldn't run these WUs. I keep it running as the projects behind these units need the results as well. Core16 units are just no joy to run ....

One additional observation:
The AMD driver does not object the RDP protocol of remote desktop. Can be used. (The NVidia driver allows access via RDP as well, but does not renew the next GPU work package, unless restarted).

Andy
 
Welcome to the fold.
As far as what client to go with... I like the both and option.

Intel 4p and c17 folding both hold promise.
AMD 4p has excellent price/perf but if I built now I would want to go AD due to newer instruction sets and lower power consumption.

Thanks Patriot :)

With regards to your AMD recommendation I'd like to ask a question for my better understanding. The Abu Dhabi series (63xx) is not referenced in the smp/bigadv PPD spreadsheet, not even once. Is there any special reason why the most modern Opteron architecture is not represented in the folding ranks? Given that many of the 4P pros here use AMD based systems, their focus seems to be more on the 62xx and 61xx platforms, but not on the 63xx series.

I'd welcome some insights to better understand the reason why this is the case (Or I might have overlooked some available info sources ...)

Thanks,
Andy
 
price/perf is worse on current gen.
However... there are instruction sets that should be taken advantage of in the future that 6200 and 6300 have whilst 6100 is left out.

it takes 16 6200 cores to match 12 6100 cores at the same clock rate.
6300 have about a 5% faster clock to clock over 6200 for folding.
both 6200 and 6300 are nicer on power usage.

So 6100 clock for clock is better...currently.
However as you can see...still no comparison to Intel's clock-clock.
http://hwbot.org/benchmark/cinebench_r11.5/
 
Last edited:
Congrats, you are the 48p record holder :D

So the change in architecture for BD and PD (a combined FP unit per 2 integer cores) doesn't work as effectively as the 61xx series for folding code?

Just checking if I understood this correctly:
However... there are instruction sets that should be taken advantage of in the future that 6200 and 6300 do not have.
Are you referring to 6100 and 6200, or do you really mean 6200 and 6300?
(I would have assumed that the latest architecture has the newest instruction set)

BTW, here is a good description of the Haswell micro architecture
http://www.realworldtech.com/haswell-cpu/
 
I built a similar system to your HD 7970 rigs. I have a Gigabyte UD7 motherboard I found used to allow me to run three Gigabyte cards with that spacing. With a very high workload (LTC mining) the fans on these coolers barely run fast enough at max. I have 4x 2000rpm 120mm fans on the side panel and back of the hard drive cage blowing air at the GPUs and 2x 3000rpm 120mm fans on the end of the GPUs (covering the video outs) exhausting and the cards were still hitting 86 C. The fans top out at about 3300rpm whereas the stock 7970 cooler will get much louder and spin up to 5200rpm at 100%. Folding@home power usage is significantly lower and the Gigabyte fans are powerful enough to cool the 7970s for folding.

Nice setups, your high performance IO stuff is very off topic but really interesting so I like that your posting it here.
 
Edited derp...
AVX added in 6200 series...
6300 series has a touch better ipc and clocking.
 
Welcome to [H], sounds like you'll fit in nicely around here.
 
Thanks for nice welcome folks, much appreciated.

Some new experiences (at least for me) since last weekend, when I switched teams.

1) The switchover with the FAHcontrol clients (v7.3.6) was seamless. Just changed the teamid to 33. With the smp clients (std [H] setup) changing the team id in the client.cfg file seems to be a transient, not permanent change. All the bigadv WUs are still on team 0. Bummer. I travelled on short notice to China, and potentially due to the great chinese firewall, cant currently remote in my network to check the cause.

2) Running 7663 WUs the dual AMD 7970 systems deliver solid 200k PPD (no OC) and the dual Titan system i in the 300-310k range. Seemingly 50% more. The actual difference in sustainable daily production is higher. The 7970 GOus pick up quite often core16 workloads which have lackluster PPDs (4-5k PPD), while the Titans basically run only 7663 units. So the daily production of the 4 AMD and 2Titan cards is between 560-650k points (out of a theoretical max of 700k)

3) the 2 bigadv systems either get 8101 or 8105 WUs, the 8105 units have about 30% higher PPDs on the dual Xeon systems. In the average, the 2 bigadv systems (2xE5-2687 and 2xE5-2650) deliver currently a bit more per day than the 6 GPUs. If the AMD cards would only get 7663 like WUs, it would be a level playing field.

4) Adding my two "productions" (team 0 and team 33) together, it looks like a 1000-1300k PPD production rate is possible. I'll be back next weekend and will bring all systems finally to team 33.

5) One question on the EOC stats page. The 24hrs avg rate seems to take a longer time period into account than just the expected "I joined team33 today, lets start over". Since I joined team33 my "last 24hrs" production rate was always above 500k, but the "avg 24hrs" rate slowly climbed from 100k to 250k. Any pointer to the way it is calculated is welcomed.

Brilong,
I haven't forgotten the promised TCO data, will try to share it later this week. Some sites aren't accessible here.For instance the photo sharing site I normally use (www.pbase.com) isn't accessible here in Beijing.

cheers,
Andy
 
Last edited:
EOC ppd average is based on the last 7 days of production.
So you have a bunch of 0s being averaged in currently.
 
Nice setups, your high performance IO stuff is very off topic but really interesting so I like that your posting it here.

There is a distributed computing element in this play.

With "control intensive" parallel applications (FAH is a compute intensive parallel application) the scaling is much harder to achieve. My goal with the I/O system is to achieve with a single system the performance of a 400 node cluster in a particular workload. I am not there yet, but the current 70% level achieved look like I can possible find the remaining 30% over time. Anyway, its fun.

Andy
 
I looked for a cheaper motherboard with 2x PCI 3.0 x16 slots (physical slots), which split electrical to 2x PCI 3.0 x8 connection. The IB and SB desktop CPUs support 16 PCI 3.0 lanes plus the DMI interface to the PCH chip. There are many motherboards out there, which have 2x physical x16 slots, but the second slot is connected via this DMI connection, which is a very unbalanced system design for this purpose. A better solution is to use a motherboard which uses all 16 lanes to PCI Slot 1 if this is the only one occupied, and split the 16 lanes to 2x 8 when both slots are occupied, leaving all the DMI capacity for the rest of I/O (SSD, LAN, USB, ..)


One additional observation:
The AMD driver does not object the RDP protocol of remote desktop. Can be used. (The NVidia driver allows access via RDP as well, but does not renew the next GPU work package, unless restarted).

Andy

These are very interesting points. I especially like the AMD driver/RDP tidbit as I used to RDP into my server but now have a wireless mouse and old lcd hooked up in case I need access. I wonder if this also holds true for other users without admin priviledges playing games, 'cause this has the same effect on nVidia folding as well. I just so happen to have picked up a HD 7770 the other day. Time for some tests.

Thanks for pitching in and welcome to team 33.
 
I'd welcome some insights to better understand the reason why this is the case (Or I might have overlooked some available info sources ...)

To me, the obvious reason is price. I paid $400 shipped for 4x 6166HE's (1.7GHz Dodecas). To get a 16 core 62xx chip, I'd be looking at significantly more cash for similar PPD and higher power usage.

To close the system description, some points on the AMD 7970 setup.

I was pleasantly suprised by the cooling solution of the Titans. They are very effective to get even in Quadpacks the heat out of the system case.

The only AMD 7970 GE cards in town were Gigabyte 7970 cards. WIth its 3 fans, the noise level of a single card is very low and temperature level with stress tests like Furmark are well controlled. Things start to change when a second card is added. If the 2 cards are side-by-side, the one were the air intakes are "covered" by the second card, can't maintain reasonable temperatures at all. 100C are often the result. The second difference to the Titans: The hot exhaust air is led into the case were a second set of fans is required to cool the case down.

I have the Gigabyte card as well, but the non GE version clocked at 1GHz as my inner card in my main rig. I fixed my temperature issue with it being sandwiched by upping the fan speed from low to medium on my case for all of the exhaust fans. The higher airflow dropped from 90C+ at 100% fans down into the upper 70C range with less fan usage.
 
Andy, thanks for joining AND posting. I'm enjoying your observations and I'm learning!
 
Welcome!

Nice to have you with us!

Very interesting posts!
 
looks like AndyE switched over to team 33, already on my threat list...
 
wow, thanks for data. We can go forward and learn from your work loads!

Nice stuff man!
 
A good opportunity for a quick recap - one week passed.

I am currently not able to get access to all log files, but some interesting patterns seem to emerge.Due to the above mentioned issue that the 2 bigadv systems did not join team 33, a separation of data is possible.

Out of 9.8 mio maximum possible points/week my setup with the best work units could potentially achieve, 8.1 mio points were actually produced within 7 days (= 83%"efficiency"). System admin effort was zero (involuntary, due to no VPN connection possible from China back home)

The 2 bigadv systems produced 4.3 mio points and the 3 smaller systems ( with 2 GPUs each) totalled 3.8 mio points. If the early patterns observed continued, then the one system with the 2 Titans contributed 2 mio pts and the 2 systems with the 2x 7970 each performed a combined 1.8 mio points (I'll check for final results when home). The major cause for the below average utilization of the 7970 was the significant impact of the low PPD WUs with core16, which also take very long to execute. When a 7970 GPU picks up one of these WUs, its daily production is far off the 100k it is able to deliver with 7663 WUs. The Titans had a) less pre core17 WUs, and b) when they got one, these WUs still had 60k ppd, consequently less impact on daily production.

On energy consumption:
1) of the 2 groups, the 2 bigadv systems had the highest total PPD output over the week and the lowest power consumption. Clear victory for theses systems on energy efficiency.
2) within the GPU camp, the Titans beat the 7970's by a wide margin, much more than a comparision by respective peak performance would suggest. Primarily caused by the impact of the very low performing core16 WUs, driving down the sustainable PPD/w of the 7970's.

A power budget of 750 watt achieved 4.3 mio points/week with the 2 bigadv systems,and the 2 systems with the 4x 7970 attained 1.8 mio points. In a "perfect" world of only 7663 WUs, the 4x AMD cards should be good for 2.8 mio points/week, so I "lost" 1 mio in this week of observation.

caveat/disclaimer: Core17 is still in beta, and all numbers are subjet to change when the final version is relased.

The system with the Titans was a consistent performer, but it is still not optimal. Due to the driver keeping the CPU 100% busy polling, the higher load on the CPU more than offset the good energy efficiency of the Titan cards by themselves. Will check, if the 2 cards can be supported by a low power Celeron with 35W TDP as well to save against the 77W TDP of the 3470 currently in the system.

Note2Self: I might replace the 4x 7970 cards with the 4x Titans for a week. If the power envelope stays at the 380w level, this setup might produce up to 4.2 mio PPD/week, which is very close to the 2 bigadv system's 4.3m weekly rate at the same power level.

More opportunity for better insight when I am back home in a few days.
 
Last edited:
These are very interesting points. I especially like the AMD driver/RDP tidbit as I used to RDP into my server but now have a wireless mouse and old lcd hooked up in case I need access. I wonder if this also holds true for other users without admin priviledges playing games, 'cause this has the same effect on nVidia folding as well. I just so happen to have picked up a HD 7770 the other day. Time for some tests.

Confirmed: Std users logging in to a PC running F@H v7 with my HD7770 do NOT interrupt the folding on the GPU. Folding slows (of course) but does not stop after the WU is finished and rolls over to the next without intervention. This is something which has always plagued me with NVidia GPU's when someone else without admin priviledges logs on to the PC. I made so many attempts to workaround this it's not funny and just accepted it as a way of folding life, but it's obviously a driver issue. Kudos to AMD. AMD GPU's are now far more compelling, especially for multi-user rigs in the farm, -er garden :)
 
Confirmed: Std users logging in to a PC running F@H v7 with my HD7770 do NOT interrupt the folding on the GPU. Folding slows (of course) but does not stop after the WU is finished and rolls over to the next without intervention. This is something which has always plagued me with NVidia GPU's when someone else without admin priviledges logs on to the PC. I made so many attempts to workaround this it's not funny and just accepted it as a way of folding life, but it's obviously a driver issue. Kudos to AMD. AMD GPU's are now far more compelling, especially for multi-user rigs in the farm, -er garden :)

Any idea what would happen if you had both an AMD and Nvidia GPU in the same rig?

I'm curious if there would be a way to circumvent the problem with the Nvidia RDP driver issue by having an AMD card in the system in some form or fashion.
 
Here is a quick back of the envelope calculation about folding costs over 3 years for 3 systems with identical base (case, CPU,MB,PSU,mem for 220 Euro) and 2 GPUs per system.

The PPD values are based on my (identical) 2 GPU setups:
1) Titan 165k ppd
2) GTX780 140k ppd
3) AMD 7970 GHz Edition: 100k ppd

Power consumption for these ppds (base system with 2 GPUs):
1) Titan: 440 watt
2) GTX780: 430 watt
3) 7970: 380 watt

Usually, only GPU prices are considered, forgetting about the remaining system cost. i.e.The newegg price per GPU does not include system costs of remaining components. In my case the rest is 220 Euro (case, mem, cpu, PSU, MB), adding 110 Euro per GPU. This impacts the cheapest GPU more than then the more expensive ones.

Prices in Euro (bare GPU / fully loaded)
Titan: 1000 / 1110
780: 650 / 760
7970: 400 / 510

Energy is 0,16 Euro / kwh

TCO: Total cost of ownership over 3 years (system with 2 GPUs and energy):
Titan: 4070 Euro
780: 3328 Euro
7970: 2618 Euro

Production (WU 7663 as base over 3 years)
Titan: 361mio points
780: 307mio points
7970: 219mio points


energy efficiency (ppd / kwh over 3 years)
Titan: 195k ppd / kwh
780: 170k ppd / kwh
7970: 137k ppd / kwh

cost efficiency (total ppd production / TCO Euro )
Titan: 88.8k ppd/ Euro
780: 92.1k ppd/ Euro
7970: 83.7k ppd / Euro

Disclaimer: many simplified assumptions are applied (utilization, energy cost stability, etc ...)
More to come

rgds,
Andy
 
Last edited:
[*]Deferred the decision which team to join until the end. Going forward I'll switch from team 0 to 33. Given what I have seen with brilong, Grandpa and others, I will not be a top producer in this team, but want to give back to this community with my production to your team result.

Andy


Wrong.

1.4 million PPD is a top producer!

Nice to have you with us!
 
Wrong.
1.4 million PPD is a top producer!
Nice to have you with us!
Thank you.

It just happened that I crossed the 20 mio bar for [H] today :D
original.jpg


The interesting things for me in this week and last week:

1) The strong interaction of folding stability and overclocking
Especially with the GPU systems, cranking up the frequency impacts seemingly exponentially the effort to care about the systems. "Valid" for GPU systems. As I need to travel quite a bit, my ability to reset systems and bring them back without killing WUs is limited. So I reduced the clocks on my GPUs by 7-8% from what could be seen as stable frequency, to avoid any interruptions. I am not there yet, but in the last 5 days, only one GPU had had one hiccup which I could fix easily (Just restarting). The critical component with regards to stability with the folding systems seems to be the GPU driver. Seen from an operating system perspective this software is almost a "black box". There is not much the OS can do about interacting end2end between the base hardware of the system, the interface to the GPU and the application running on the GPU. It would be very nice if the user has the ability to treat in the upcoming world of heterogeneous computing (multiple CPU types in one system) his system like he used to treat a simple computer. Still some miles to go.

Anyway, by reducing the GPU speed, my daily ppd production went up. Less outages, no QRB hits, just smooth folding.

Looking at the last few days, the average runrate became flatter and flatter. The spike with 1,72m ppd is when the bigadv system delivers 2 WU in single day (due to 22 hrs time-to-completion)
original.jpg


2) Looking at the output, the 8900 WUs are a boon to the Titan's and the GTX 780 GPUs. When the GPUs moved from 7663 to 8900 the impact was significant. Will be interesting to see if the PG dev team will be able to continue this kind of improvement with the next iterations of OpenCL based WU.

3) The QRB system in core17 changes the economic considerations with GPUs.
The GTX780 is only 14% slower than the Titan (based on TPF), yet the daily PPD of the Titan is 25% higher (thanks to the QRB). It is still not enough to justify an all-Titan setup, but the QRB bonus model has imho significant implications vs. mid- and lower end cards. A card half the speed of a fast card will get only 1/3 of the PPD. Traditionally, people looked at TPF values to compare GPU performances, with core17 WUs this might lead people to wrong conclusions about the ppds they will be able to deliver.

4) We are entering the era of big data (just kidding)
What I would like to do is collect each and every entry in all my logfiles for later analysis.
For instance, I can't see (automatically) what my *delivered* WUs actually produced. Or to do an analysis over 20.000 WUs (will take a long time with me), to see patterns of things, a log sequence of a single WU would not reveal.
Question: Is there an option, that I can collect all my logfile entries of all systems in a single place. I know I can copy my logfiles, but I'd welcome an automatic system, were I don't have to care about this (i.e. avoiding duplicate entries, ...)

To sum it up: I want to see patterns of behavior in my systems, like delays at certain moments which might happen only in certain conditions, not only transactional log data for single WUs, etc, etc ...

5) Still no 4P system in place :-( (need to be fixed :) )

cheers,
Andy
 
HFM.NET may or may not do the collection you're looking for... I use it on my boxen to keep track of things and it does collect logs, I just don't know how long it retains them for...
 
Back
Top