NFS testing and performance

wizdum · Mar 11, 2015

This thread has two purposes. First, I would like to know if anyone has a good method for stress testing NFS servers. I'm trying to find a way to test a server that only shows problems during periods of high load (a large number of connections and read/write operations), without putting it into production again.

Secondly, i'm going to try to describe the problem below, if anyone wants to toss some ideas my way, that would be fantastic.

One of my projects just blew up in my face and I have no idea why yet. I do some work for a school that operates Linux terminal services, with about 1500 clients. They have a single "home" server that handles NIS authentication and shares out the user's /home directory via NFS. This server is about 15 years old and has a bad backplane. I built them a new server, with a modern OS (Ubuntu 14.04).

It worked fine until everyone started logging on in the morning, then it crashed and burned. The load average spiked to over 100, users were reporting unbearably slow performance. The odd thing is, with a load average over 100, the CPU was 98% idle, and IOWait was very low, so the processes were not waiting on the CPU or on the drives. I was logged into the server via SSH, and saw none of the sluggish behavior that I would expect with a server under that much load.There were kworker threads that were hitting 100% CPU usage, according to Top. Additionally, the loopback interface was seeing gigabits of data being "dumped" to it intermittently.

Thats really all I have to go on. There were no problems reported in the logs at all.

Brian_B · Mar 11, 2015

Best advice I can give:

hdparm will test local disks
iperf will test your network
I don't know of any direct application that benchmarks nfs, but most people will do a timed dd copy of a specified file length and see how long it takes to various devices or using various configurations. You could probably script out several hundred/thousand simultaneous dd writes to attempt to synthetically slam your system and see how the traffic is handled.

I haven't played much with nfs recently, but I remember it can be extremely picky about block transfer sizes, default file permissions, synchronization, and other more network-specific parameters. I wouldn't be surprised if the load average on your server was it just sitting around waiting on the nfs mount (since load average is derived from the time the CPU is sitting around waiting on some resource, be it the CPU itself or data from a mounted device). Just an offside, but I've also seen full/insufficient swap file space be a culprit for high load averages. And more than once I've seen a failing network cable or switch gum things up with random errors, which you wouldn't always think to check otherwise.

These threads may be mildly helpful:

http://ubuntuforums.org/archive/index.php/t-1047907.html
http://www.slashroot.in/how-do-linux-nfs-performance-tuning-and-optimization

stormy1 · Mar 12, 2015

It has been a while since I messed with nfs much but off the top of my head:
Check the settings in the above links on the old server and duplicate them on the new as a starting point.
My guess is the biggest culprit is the threads.
With 1500 users and only 8 threads it is going to be in bad shape and would fit the rest of what you said.
Try 8 or 16 x the number of cores in all cpus for a starting point.

diizzy · Mar 12, 2015

Supposedly the new NFS server in FreeBSD (v3 and v4 although using v4 is much slower than v3) is much faster than the old one and the one i Linux. You probably need to run -HEAD to get the recent fixes but I've had good experience with -HEAD in general. Keep in mind NFS runs poorly off ZFS if if the filesystem "sync" feature is enabled so UFS(2) is recommended unless that changed very recently. I have nowhere that load but new server plays nice with Linux using NFSv3 as protocol.
//Danne

Red Squirrel · Mar 12, 2015

I'd be curious about this as well. There's lot of tools, but without having a baseline to know if the numbers are good or bad and a way to bring all that info together in real time and historic, it's hard to really make anything of the information. Something that runs in the background and produces graphs such as latency and other important factors for each raid array would be nice. Perhaps a way to find out which files hammer the server the most etc too.

I find my file server just grinds to a halt and VMs and other data access starts to time out if there's too much stuff going on at the same time such as backups. Worse is when a mdadm check decides to start. It pretty much takes the whole server down. I've seen the load go as high as 20. Even SSH gets slow. VMs crash one by one due to poor IO performance.

I'm also curious if I'm better off having one large raid 10 vs multiple smaller raid 10's, and can't seem to find info on that anywhere.

wizdum · Mar 12, 2015

Thanks for the help so far. I have already set the number of threads, as we had that problem with the old server. Its currently set to 128, with a CPU that has 8 cores.iPerf tested about 900mbps in both directions, so I think the network is fine. I tested the performance with dd last time, I will try hdparm and get back to you. Hopfully swap isn't being used, as the server has 24GB of RAM, which is 6x more than the old server had.

I do have the ability to benchmark the old server during off hours.

wizdum · Mar 12, 2015

HDParm looks fine, this is on a RAID10 array with 12 3TB Sata 3 drives:

/dev/sda1:
Timing cached reads: 13274 MB in 2.00 seconds = 6642.22 MB/sec
Timing buffered disk reads: 1142 MB in 3.00 seconds = 380.28 MB/sec

iPerf looks interesting when I try to read a large file though:

Client connecting to , TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[ 4] local port 34266 connected with port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-23.7 sec 500 MBytes 177 Mbits/sec

------------------------------------------------------------
Client connecting to , TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[ 3] local port 34267 connected with port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1000 MBytes 839 Mbits/sec

I'll test it again tonight when everyone is off the network.

QwertyJuan · Mar 12, 2015

Need For Speed?

wizdum · Mar 12, 2015

QwertyJuan said:
Need For Speed?

Network File System

wizdum · Mar 12, 2015

Looks like the previous low transfer speed may have been related to network congestion:

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local port 5001 connected with port 53541
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 4.5 sec 500 MBytes 934 Mbits/sec

One other thing to mention, the nfsstats on the old server only show NFSv3 traffic, where as the new server shows only NFSv4 traffic.

green91 · Mar 13, 2015

Did you happen to catch a screenshot of top when the latency occurred?

wizdum · Mar 13, 2015

green91 said:
Did you happen to catch a screenshot of top when the latency occurred?

I did not, unfortunately.

NFS testing and performance

wizdum

[H]ard|Gawd

Brian_B

2[H]4U

stormy1

[H]ard|Gawd

diizzy

2[H]4U

Red Squirrel

[H]F Junkie

wizdum

[H]ard|Gawd

wizdum

[H]ard|Gawd

QwertyJuan

[H]F Junkie

wizdum

[H]ard|Gawd

wizdum

[H]ard|Gawd

green91

Limp Gawd

wizdum

[H]ard|Gawd