Probably useless syscalls used by the FAH client

quickz · Aug 12, 2012

Has anyone noticed the fact that lots of CPU time used by the FAH is running in systime mode and probably just be wasted?

Let's take a look at a typical result of top while FAH is running (on a 12C24T dual Xeon E5645):

Code:

top - 14:07:51 up 19 days, 23:49,  3 users,  load average: 22.45, 8.64, 3.37
Tasks: 170 total,   1 running, 169 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  6.7%sy, 93.3%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  49554076k total, 17380604k used, 32173472k free,   563156k buffers
Swap:        0k total,        0k used,        0k free, 14413228k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
32242 root      39  19 3323m 1.8g 2988 S 2396.2  3.7  34:56.11 thekraken-FahCo
28625 root      20   0     0    0    0 S  0.3  0.0   1:28.69 kworker/0:2
    1 root      20   0 19164 1464 1216 S  0.0  0.0   0:09.42 init

where we can see the sy percent is 6.7% at the moment, and it varies in the range of 5% to 10%.

The sy number represents the cputime percent that CPU is doing system calls. In my opinion the observed result is too bad for FAH, since FAH is a program doing pure floating calculations, I think a reasonable number for it would be as low as 0.0%-0.1%.

Then I try to find out what the exact syscall the FAH client calling is. I use strace to diagnose and found out the answer is sched_yield(), which was called by every FAH thread at a rate of several thousand times per second.

The sched_yield() is a syscall that causes the calling thread to relinquish the CPU. I don't think it's necessary for FAH since we are usually running only one FAH thread on one CPU core and don't need to relinguish the CPU at all.

As we know that thekraken is able to deal with the syscalls and signals invoking from the FAH clients, so is it possible to develop something like thekraken to eliminite all these probably useless sched_yield() syscalls?

fastgeek · Aug 12, 2012

Way over my head; but if some clever bugger figures out a way to make our TPF's go down slightly, then more power to them!

tear · Aug 12, 2012

Yeah, these are boatloads of sched_yield(); technically they are totally unnecessary.

Problem is that GROMACS implemented own synchronization primitives and
implemented them pretty badly -- here's one example (most frequently executed
sched_yield).

TMPI_YIELD_WAIT(ev) is a macro that gets (on Linux) expanded to sched_yield() -- FYI.

Code:

int tMPI_Event_wait(tMPI_Event *ev) 
{
    int ret;
    /* for most OSes yielding waits result in much better performance 
       (by an order of magnitude) than using the OS-provided wait functions 
       such as pthread_cond_wait(). That's why we do a busy-wait loop here.*/
    while( (tMPI_Atomic_get(&(ev->sync)) - (ev->last_sync)) <= 0 )
    { 
        TMPI_YIELD_WAIT(ev);
    }
    tMPI_Atomic_memory_barrier_acq(); 
    ret=tMPI_Atomic_get(&(ev->sync)) - (ev->last_sync); 
    return ret;
}

First off, I'm doubting the claim in the comment.

Second, using sched_yield() is generally a bad idea.
Programmer has no way of knowing what and when scheduler is going to do (there's
no guarantee that some _other_ thread will be scheduled or _when_ the reschedule will
occur).
Plus, what's more important, and what you've observed, on multi-processor machines
where each worker thread gets to "own" a processor these sched_yield() invocations
offer even less benefit -- if you're the only running thread on a processor there's no
other thread to yield to [!] -- all worker threads are executing simultaneously.

Third, there are few other issues with this code:
- ev->sync is externally modified (by the thread that "signals" us) ...
- ... but it's not marked as volatile so compiler could only load it once into a register
 (against programmer's intent -- we'd never exit the loop); luckily for GROMACS we
 haven't seen it happen... yet
- also, I don't see how memory barrier can help as loop's exit condition had already
 been met

That said...

I did an experiment back in the day -- turned sched_yield() into a no-op (on kernel
level -- could probably dig out a patch for you, if you'd like) but it didn't noticeably
affect the performance.

I suspect this is because the "yield loop" is called when the calling thread is actually
(busy-loop-with-lotsa-sched_yields-)waiting for another thread...

quickz · Sep 7, 2012

Thanks for your detailed analysis, tear.
Recently, I found by chance that by using a 3.5.3 kernel with the BFS patch we could let the "sy" CPU percent become zero completely.
I was very excited when seeing this. However, just several minutes later I became disappointed because I found the performance didn't increase at all.
So, looks like this is probably a totally wrong way.

Probably useless syscalls used by the FAH client

quickz

Limp Gawd

fastgeek

[H]ard|DCOTM x4 aka "That Company"

tear

[H]ard|DCer of the Year 2011

quickz

Limp Gawd