Major new folder around

His post on their forums:

So I am breaking in three racks, each rack has 4 chassis ,each chassis has 14 blades. Each blade is a dual processor machine with 6 cores per processor and 36gig of RAM. I will probably stop running the folding first thing Monday as I am just trying to provoke any more failures. About 26 out of the 168 have an issue of some sort and the supplier will be in at some point to fix them. Each chassis is drawing about 3KW right now.

That's some serious hardware.
 
So each blade is putting out around 38kppd. I want to see bigadv numbers with linux and the wrapper. 10 million points a day?
 
So each blade is putting out around 38kppd. I want to see bigadv numbers with linux and the wrapper. 10 million points a day?

He said each blade is a dual-hex, but that number sounds low for L5640's at stock clocks, and its gonna be something pretty low powered if its stuffed in a blade.. I wonder what they are...
 
Holy mother of god

mog.jpg
 
I don't believe I've ever seen anyone produce that much before. It must be some kind of record. He has literally doubled his team's output and CPC&BT is a team in the top 10, which is a very respectable standing in its own right. Too bad it's only short term, but wow... :eek:
 
Last edited:
Scary thing is, I don't think he's running bigadv. Too many WUs for those number of points.
 
its a great way to stress test hardware before they go live, and in a massive blade form factor environment like that its critical to get the failures known as early as possible, otherwise you end up chasing phantoms and ghosts for months before you learn the root cause.

15% failure in hardware isnt as bad as some may think
 
15% failure in hardware isnt as bad as some may think
From what I've heard it is quite the norm. I think the usually expected failure rate is just above 10% or thereabouts. CMIIR.
 
wtf, that's a lot of ppd right there. damn.
Imagine if the likes of Dell and HP did this for their burn-in testing (Patriot you know what to do;)), 48-hours folding on every server shipped.:eek:

From what I've heard it is quite the norm. I think the usually expected failure rate is just above 10% or thereabouts. CMIIR.
Maybe over 3 years (with drives) but this is within a week and isn't the idea of blades is less moving parts (less to go wrong)?. No PSU's, no fans, possibly no drives or interface cards <- All the main suspects.
 
Imagine if the likes of Dell and HP did this for their burn-in testing (Patriot you know what to do;)), 48-hours folding on every server shipped.:eek:


Maybe over 3 years (with drives) but this is within a week and isn't the idea of blades is less moving parts (less to go wrong)?. No PSU's, no fans, possibly no drives or interface cards <- All the main suspects.

um... if we ever peak past 5% we are gonna be pissed...
no 2% is the 3 yr acceptable failure rate... and that is on regular proliant... dunno about other manufactures blades... for the most part... ours have a pair of drives with a raid card.
if you do double density blades... 32 blades per enclosure... then you need a 2u storage pool to draw drives from... but yes you could have 32 dual hex machines in a 10u space... might need to have all 8 psu's installed and all 10 fan units... but that would be some horsepower.
 
um... if we ever peak past 5% we are gonna be pissed...
no 2% is the 3 yr acceptable failure rate... and that is on regular proliant... dunno about other manufactures blades... for the most part... ours have a pair of drives with a raid card.
if you do double density blades... 32 blades per enclosure... then you need a 2u storage pool to draw drives from... but yes you could have 32 dual hex machines in a 10u space... might need to have all 8 psu's installed and all 10 fan units... but that would be some horsepower.

normally I would agree that 2% should be the failure rate, but over the past year I've seen the failure rate of the HP Blades go through the roof, meanwhile the IBM Blades have been rock solid at 0% failure rate

Dunno what the difference in supply is, but HP seems to be cutting corners
 
normally I would agree that 2% should be the failure rate, but over the past year I've seen the failure rate of the HP Blades go through the roof, meanwhile the IBM Blades have been rock solid at 0% failure rate

Dunno what the difference in supply is, but HP seems to be cutting corners

huh... not had a single production blade crap out on me since I have been here 3.5 years ago... did have a x2 doa stupid factory worker mushed the pins on the 771 socket...

were they 200 series by any chance?
cause then I can see that... ask me on irc lol...

good news is... we ditched hurd... focus back on insourcing and r&d ... better products take more resources....
 
We've seen the QC issue on both Intel and AMD based BL 400 series blades (bl460/bl490) but they have been real solid in reliability up until last May, right around then all the new blades have had a really high failure rate. QC just isnt there on the new ones, memory and motherboard issues on about 1/3rd of all the new blades we get in = maddening
 
We've seen the QC issue on both Intel and AMD based BL 400 series blades (bl460/bl490) but they have been real solid in reliability up until last May, right around then all the new blades have had a really high failure rate. QC just isnt there on the new ones, memory and motherboard issues on about 1/3rd of all the new blades we get in = maddening

ouch.... yeah thats pretty bad... 490 shouldn't have been released as a 4xx series...
have never had a g6 460 go bad on me... well at least the refocus should help change that...
 
And 168 SMP units have to time out .... wish he could have -oneunit'd those clients and taken them off line cleanly instead of pulling the plug.
If he's a new folder he may not be aware of the ramifications. Many long term folders don't even know the reasons why it's not good for the research to terminate WUs before completion. Another possibility is that his work may have required him to shut down these systems within a prescribed period. Don't know if that's the case but if I had knowledge beforehand that I only had a very limited amount of time to run the systems, I don't think I'd use them this way.
 
I would think he could submit a request to the F@H admins to abort his assigned WUs.

I really like that other DC projects, such as GIMPS and projects using BOINC, make it very easy to abort WUs so they can be immediately reassigned. For F@H being so popular, it has some of the worst client software and infrastructure of any DC project.
 
I really like that other DC projects, such as GIMPS and projects using BOINC, make it very easy to abort WUs so they can be immediately reassigned. For F@H being so popular, it has some of the worst client software and infrastructure of any DC project.
I totally agree and to think this project has been in operation for a decade...

In any case, one possible reason F@H may not have implemented an abort feature is because it could be abused for the reassignment of more productive WUs, but that opens up a huge other argument about the flawed point structure being at the crux of many controversies over the years, which is outside the scope of this topic.
 
Looks like they are at it again. I wonder what the deal is? Maybe someone is setting up a datacenter or a computing cluster. Interesting!

Edit: More info here.
 
Last edited:
and he may not be using all the available threads - if those chips do hyperthreading he has 24 available threads per blade, but is only using 12 ......

H.
 
If it's for a computing cluster, then HT is (usually) turned off for accurate reporting to the head node, which would account for the -smp 12. Perhaps he could turn on HT on all the blades via a script or something, while folding, and then turn it back off afterward.

Plus, while it's good for the project, I do hope he's not putting out those numbers long term haha. :)
 
Man, at 800K he's averaging more points per update than each of the next three teams bit-tech is trying to catch (#6, #5, #4). I'm guessing he won't be able to keep this experiment going much longer though.
 
800K updates! That's insane. Hell, his internet upstream bandwidth is probably bottlenecking his output!
 
Plus, while it's good for the project, I do hope he's not putting out those numbers long term haha. :)
Those numbers are pretty scary indeed. It's a lot of science for sure and we can use the competition from lower ranked teams to get us more motivated. Although I've seen people hit above the 1M mark before, I don't recall anyone producing this much. :cool:
 
Back
Top