This thread has two purposes. First, I would like to know if anyone has a good method for stress testing NFS servers. I'm trying to find a way to test a server that only shows problems during periods of high load (a large number of connections and read/write operations), without putting it into production again.
Secondly, i'm going to try to describe the problem below, if anyone wants to toss some ideas my way, that would be fantastic.
One of my projects just blew up in my face and I have no idea why yet. I do some work for a school that operates Linux terminal services, with about 1500 clients. They have a single "home" server that handles NIS authentication and shares out the user's /home directory via NFS. This server is about 15 years old and has a bad backplane. I built them a new server, with a modern OS (Ubuntu 14.04).
It worked fine until everyone started logging on in the morning, then it crashed and burned. The load average spiked to over 100, users were reporting unbearably slow performance. The odd thing is, with a load average over 100, the CPU was 98% idle, and IOWait was very low, so the processes were not waiting on the CPU or on the drives. I was logged into the server via SSH, and saw none of the sluggish behavior that I would expect with a server under that much load.There were kworker threads that were hitting 100% CPU usage, according to Top. Additionally, the loopback interface was seeing gigabits of data being "dumped" to it intermittently.
Thats really all I have to go on. There were no problems reported in the logs at all.
Secondly, i'm going to try to describe the problem below, if anyone wants to toss some ideas my way, that would be fantastic.
One of my projects just blew up in my face and I have no idea why yet. I do some work for a school that operates Linux terminal services, with about 1500 clients. They have a single "home" server that handles NIS authentication and shares out the user's /home directory via NFS. This server is about 15 years old and has a bad backplane. I built them a new server, with a modern OS (Ubuntu 14.04).
It worked fine until everyone started logging on in the morning, then it crashed and burned. The load average spiked to over 100, users were reporting unbearably slow performance. The odd thing is, with a load average over 100, the CPU was 98% idle, and IOWait was very low, so the processes were not waiting on the CPU or on the drives. I was logged into the server via SSH, and saw none of the sluggish behavior that I would expect with a server under that much load.There were kworker threads that were hitting 100% CPU usage, according to Top. Additionally, the loopback interface was seeing gigabits of data being "dumped" to it intermittently.
Thats really all I have to go on. There were no problems reported in the logs at all.