erratic read speeds openindiana

Nutnut

n00b
Joined
Jun 9, 2012
Messages
27
Hi all,

N00b here - been lurking for a good while but unfortunately now i have an issue that i cant resolve using the search button. :(

I've just cobbled together a openindiana 151a + Napp-IT box but I'm having LAN read speed issues.
If the file data (30GB .MKV) is in the ARC then it will consistently pull 106MBs but if it has to pull it from disk, it bounces between 106MBs and about 80MBs like a sawtooth on the graph. Sometimes there also a pause of about 5 seconds whilst windows just says 'calculating' with no transfer at all.

According to Napp-it benches I'm getting nearly 900MBs sequential reads so there should be no problem maxing out the LAN. Its almost as if its reads are sort of bursting rather than streaming.

I have an old openfiler box that will sit at 119MBs all day long and it's on mush lesser hardware.

Openindiana box:

Asus P6T6 WS Revolution (Loads of PCIe slots)
12GB ECC RAM
Xeon L5520
2x IBM M1015 (Flashed to IT mode) NO SAS Expanders
8x 3TB Seagate disks - RAIDZ2 (More to follow if I can get this working)
2x Intel 330 SSDs mirrored boot (will soon rebuild with these as mirrored ZIL)
1x OCZ Agility 3 240GB as L2ARC makes no noticeable difference if this is used or removed).

LAN Gigabit throughout (24port DLINK managed switches)

Client machine - Win7 x64 (P9X79 with i7 3930K and Intel 520 SSD)

Any thoughts appreciated.

Mark
 
Since you're running a managed switch, take a look at all your ethernet stats for both the server port and the client port....the big issue I found was Ethernet flow control was fouling things up...I disabled it both on the switch *and* on my clients and servers (using ethtool).
 
Another thing that could cause network issues (especially if dlink implemented it badly) could be Spanning Tree Protocol set to on with the ports you are using Especially if you are using vlans. If it is an option, try disabling it and see if that makes a difference.
 
Your P9X79 has the intel LAN chipset which is fine. Unfortunately, the P6T6 has the Realtek LAN chipset, which leaves a bit to be desired with regard to speed.
 
Thanks for the replies though I don't think it's a network issue. It seems to be the way it's read from the disks.

If I start off the copy of a fresh (uncached) 30GB test file the speed will bounce. If I then cancel the copy after a minute or so (10GB in to the copy so that it's held in the ARC) and then kick off the copy again, it will max out at 106MBs for the first 10GB then bounce between 80 and 100MBs for the last 20GB.

Basically, if it has to read the file from the spinner disks, the speed bounces. If it's from cache it runs at a steady 106MBs.

Write performance is fine though and will sit at 106MB which i guess is all the Realtek NICs can achieve.

I do have a couple of Intel Pro Server cards here somewhere so I'll give them a go in the morning (nearly 2am here) to see if it helps and pushes the speeds up to nearer 120MBs that I get with the OF box.

Will post back with results...

Mark
 
Ok, so i dug out a decent NIC (HP NC360T which is the Intel PRO/1000) and I have the same results albeit edged up by ~10MBs - bouncing between 116MBs & 90MBs.

watching the lights on the HDD controller it's clear that as soon as disk activity starts (cache miss) the transfer rate bounces. Whilst there's no disk activity and it's reading from RAM, its maxing out at a steady 116MBs (100% utilisation) until the cache it depleted.

Help :confused:
 
One other possibility, though it is not THAT likely is that you are running into interrupt contention issues. Sometimes on consumer/prosumer boards (and I have seen it with ASUS WS boards before like the P5DG2 and P5W64) they do a lot of interrupt sharing between PCIe slots and other connected devices. It may be a longshot, but it is something quick you can try. Disable everything you are not using. The Marvel SAS, SATA, onboard LAN, onboard audio, etc. Then power down, bring it back up and check again. Another thing, which 2 slots are you using for the M1015s? The x16 slots that are crossfire/sli enabled can also have an effect on speeds, they are video-focused. Try moving the cards to PCIe 2.0 x16_2 and PCIe 2.0_4 or PCIe_4 and PCIe_6.
 
Just tried - no difference. I even removed the second HBA, disabled onboard NICs, etc. Can't disable the onboard sata ports as they are running the boot drives but i don't thinks it's a hardware issue as such anyways.

If I run dd bench with a 100GB test file (32K blocks) i get 784.76MBs Write & 778.12 MBs read so the hardware seems fine. I can see the disks going nuts and the controller flashing like crazy so it's clearly using the disks.

i think it's just a config issue - maybe buffers somewhere? It's like it reads a chunk of data into the buffers then 'sleeps' until the buffers are empty and then starts over again in a sort of 'stop/start' fashion rather than a smoother, slower 'stream'... bit like wifey's driving technique! ;)
 
Something else i have just discovered - when it's bouncing all over the place, if i kick off a second file copy from the same pool it reaches max speed.

I've also done a complete clean install and taken the opportunity to reconfigure the 2x Intel ssd's as a mirror. Works like a charm. Seems to be stutter on the spinners?
 
Last edited:
That isn't unheard of. If a network device can interleave the data from multiple sources or streams it is moving over the wire, it will often fill out the connection more smoothly than just a single thread. Your array, for whatever reason might just not be moving enough data out on a single thread but queue them up together and it better utilizes the bandwidth (for whatever reason, eg some data coming from disk and some from memory, cache etc). Though there is a small additional overhead for the additional filesystem crap at either end, sometimes it is faster. It can often be seen in Windows, where if you start one copy to a server you might see for example 80MB/s, but if you open a second copy the first drops to 50 and the second continues at 40. It all depends on where the bottleneck is.
 
Thanks for all of your replies but whilst I hear what you are saying, I don't think it's the issue here. as I said earlier, I'm on the right side of 700mb sec on the benches and copying a bunch of files from that pool to itself (forcing read & write simultaneously to the same pool) it will sustain 300MB/sec (single read thread). I always make sure that there's enough unique data to fully deplete the cache so that the numbers are true.

In addition when the second file copy is started, it's from the same set of disks and not cached.

Are we really saying that this box in all of its hardware specs, can't saturate a single 1Gb link whilst I have an old shonky openfiler box running on an old hp ml110 g5 with 8x1.5tb disks in raid6 hanging off a 2 port startech esata 3Gb controller using port multipliers that can? No way. It must be something simple.

Looks like openindiana just isn't for me which is a shame because I need raid6 (or equiv) and persistent reservations but openfiler won't do it (gracefully).

Any other products I've missed?
 
Dear Mark,

If it is critical for you to raise it to the 116MB/sec, maybe can you try do two more tests,

use OI-Host <-----FTP-----> Windows 7 Client to transfer 20GB ~ 30GB size image file

1. This is to verify whether you get the same symptom by switching to FTP protocol. If you get consistent, say 100MB/sec from start to finish using FTP, then we know at least the underlining storage and LAN hardware behavior are both fine.

2. Can you attach an E-SATA external mechanical disk (fresh format, empty NTFS) to your Windows 7 Client host and write from OI-Host directly to this external disk? You can use USB3.0 external mechanical disk as well with pre-verified speed results. Please list exact model of mechanical hard disk in use so we can relate it to HDTUNE test behavior. (obviously you need e-SATA or USB3.0 facility on your client computer as well)

Cheers
 
Hi Lightp2,

Thanks for the suggestions - Indeed it is important that I get to max out the 1Gb NICs as if this thing ever actually works as hoped, I will be moving to 10Gbe. I move a lot of large files around (HD video & RAW images) so this is all a proof of concept.

I get the same results with FTP, iSCSI, etc.

I don't have any USB3/eSATA mechanical disks spare at the moment and i'm not really sure what it will prove? Other storage arrays write happily to the client machines SSD at 119MBs so there's not bottleneck on the LAN and client. Unfortunately, they have other issues like no persistent reservations or no raid6 support.



Mark
 
Dear Mark,

1. FYI, based on xtremesystems SSD endurance thread "specific test pattern and setup"

1.1 After writing massively for a long time, quote from xtremesystem thread example,

Intel 520 60GB - Day 105
Drive hours: 2,466
ASU GiB written: 783,153.21 GiB (764.80 TiB)
Avg MB/s: 93.62 MB/s
MD5: OK

For your reference.
 
That's the 60GB version which is useless for incompressible writes to start with at 80MBs in comparison to the 235MBs of the 240GB drive that i use. IOPS are also over 2x as high...

But the proof is in the pudding... I also have a direct attached, RAID5 set of 5x WD 2TB EARS drives on the internal Intel controller so a quick test confirms that I can write a 35GB movie file (incompressible) to the SSD at over 200MBs.
 
then we will try diagnostic.

1. Since you said you have local RAID diskset, it is even better
1.1 Try to SMB 30GB write from OI-host to local RAID-diskset. please list megabyte/sec.
1.2 Try to Boot Fedora-Live CD/USB or CentOS 6.2 LiveUSB/CD, I forgot which one has it, one of them you can use lftp You may need to mount RAID5 manually. (If you are not sure, then do not do this step as accident may corrupt your RAID5 data) (Sometimes Linux Live can mount fakeraid as well.)
1.2.1 Usually the latest Linux Live distribution will boot (since you have good experience with openfiler, I guess Redhat-based Fedora/CentOS should be fine.
1.2.2 FTP 30GB from OI-FTP to Linux Live - local mechanical RAID5 set. please kindly list shown megabyte/sec
1.2.3 FTP 30GB from OI-FTP to Linux Live - Local SSD520. please kindly list shown megabyte/sec (this will wear your SSD, unfortunately, since you have 12GB RAM, smaller may not help)
1.2.4. As a free test, if you can mount NFS/CIFS, copy something from network to local /dev/null, transfer 60GB image file. /dev/null I think is a blackhole so you can transfer/copy anything to it and it will be gone. I only know how to use in dd example, dd if=/mnt/nettest/bigfile110G.iso of=/dev/null bs=1M you can use a time command in front so that it can show estimate mbyte/sec once finished. In this case, it will tell mostly the remote filesystem+speed of NIC since it does not write to local disk at all if you use it this way. From here we want to know between OI-Host and Linux Live whether it is achieving maximum network speed.

2. There is NO bias or preference on any of these. Since Linux has reputably good performing networking stack and system behavior. I use it here. I really prefer CentOS 6.2 LiveCD/USB for this test. Using LiveCD ensures that there will not be any local Windows-stuff so we can check carefully which is which.

2.1 Since your problem is network file transfer speed, having other scenario without the network is not complete. You need to include network file transfer so others can have a defined scenario.

Cheers.
 
I don't want to risk the RAID 5 set but we seem to be concentrating too much on the client. As I type this, I'm transferring 70GB of data from the shonky openfiler box to this client at a constant 118MBs. Surely this is enough to prove that the disks and network stack of the client as well as the LAN itself is fine. The Openfiler box is also using the same NC360T NIC that the openindiana box uses.

In addition, I have a raft of other servers and client machines sitting around here that also pull stuff from the OF box and between each other at the same speeds. Until a couple of days ago, I even had a windows storage server running on a DL360G5 that was maxing the LAN consistently.

On top of this, i CAN max the network between OI and Win7 provided the file is pulled from ARC. If I copy a 10GB file from OI to win7 80~100MBs. once complete if i copy it again - 116MBs constant. if i then go to a 2008 server box and copy THE SAME 10GB file (so it's in cache still) - 116MBs.

If I then copy a new 10GB file to the win2008 box 80~100MB again (because it's not in ARC). go back to the win7 box and copy the file that was copied to the 2008 box (so it's in OI cache but win7 has never seen the file).... 116MBs.

The clients are fine it's the OI box when reading from disk AND sending over the LAN that's the issue - just not sure exatly what the problem is.

Writing back to the OI box on the other hand is fine.

Thinking i'm gonna bin OI and the good dreams I had of ZFS. :(
 
What version of OI? Have you updated OI to latest 151a4? There are bunches of driver updates and more. Please try that first before you give up!

And if not that...at least give Solaris 11, or NexentaCore, or Nexenta community edition a try before you give up on ZFS.
 
OI151a4 + Napp-IT.

I have considered Nexenta but the community edition has a max 18TB capacity, Core is expensive and Solaris 11... hmmm... solaris 11... Downloading now. If nothing else, lets see if this issue goes away.

Mark
 
I think you meant 'enterprise is expensive'? nexentacore is free, AFAIK. it is also EOL..
 
Also...if you are using any file copy manager, try without it. TeraCopy, for example, is known to cause issues when copying from solaris machines. I know it doesn't explain why it would work when copying from ARC, but it's at least one more variable to eliminate if you haven't already.
 
danswartz - you right, but as you say EOL any way.

Silenus - nope, nothing clever going on here..

Just tried Solaris 11 + Napp IT. Same issue.

When you look at the 8x activitiy lights on the controller, you can see them pulse at a rate of about ywice per sec. Where as when you are copying files locally to the box or bench or whatever, they go nuts and stay on almost solidly.

Still think it's sommat to do with the way it reads from disk and shoves it in the LAN buffers.

Giving up me thinks...
 
Two things you could try

1 - Reduce the number of disks in the raidz2 pool to 6 (or increase to 10)

2 - check what "zdb | grep ashift" says (though if it was a 4k alignment issue you might expect to see the issue on writes too - though who knows exactly how their SmartAlign feature really works)
 
Mark- One other question... Are you running just a single linked GigE connection or are any of these links multiple lacp teamed or round-robined?
 
Only one nic configured (dual port card).

I have tried all iterations with the disks that are possible. This includes a basic stripe, 6disk raidz2 and so on. Still the same. Unless the data is cached, I get dodgy read performance over the LAN.

Installing openfiler now to see if I have the same issues (to eliminate hardware probs). Will post the results.
 
Yep. i've also moved all of the other cards around too.

Given up now - loaded openfiler... now witing for a 24TB mdadm sync.... :(
 
Well - what'd d'ya know. Openfiler worked like a charm. Ended up building a centos6.2 box with xfs as OF is dead as far as I'm concerned and that too is working like a charm.

Defo sommat up with Solaris based OS's. Question is, is it my hardware combo or an underlying issue in the OS?

Thanks for all of your replies. :)

Mark.
 
Nutnut,

I'm having the exact same issues (and more) with Solaris 11 and derivatives on my boxes as well. Pool throughput seems to lag down, but ARC hits can flood the wire. I have more than plenty read speed on the disks (900MB/s scrub speeds from 6 disks in triple mirror), and still can't get there. I'll be starting another thread for my other CIFS speed woes, but this same problem for sure will be mentioned in that thread as well.
 
I assume this was with the kernel CIFS server? Is it possible to try with NFS and/or samba to see if it's specific to the CIFS server? Also, might want to re-post your original post to the openindiana mailing list. Some pretty savvy folks hang out there who might have thoughts.
 
Glad it's not just me (in the nicest possible way of course).

It's not just cifs but the same with any transfer from disk - Cifs, SMB, NFS, iSCSI.... Anything that's not in ARC.
 
Last edited:
What is the sleep timer set to on your disks? I wonder if it is reading at 900mb/s, filling up the ARC and then the drives are parking their heads. Then the ARC runs out and the drives go back to 900mb/s mode. That is the only thing I can think of.
 
One more thing, Do you have atime on or off? atime can slow down a the file system when it writes a timestamp every time a file is accessed.
 
I always set atime=off on my pools, to eliminate nuisance writes (I don't have any DBs or anything that would care about atime anyways). It doesn't appear that ZFS is letting the heads park - I see pretty even distribution of IOs in iostat, with svc_t averaging like 5 ms per disk, slowly banging away constantly throughout with reads.

Nutnut, I also see this behavior on other protocols (even netatalk/afp). On my box at home (built-in SATA, 3x WD30EZRX in triple mirror), AFP would absolutely destroy gigabit with ARC hits (117 MB/s constantly, no dips). Pool reads, however, seemed to max out in the 9x MB/s, and flex downward to the 5x and 6x. I noticed my asvc_t in iostat was a little bit long, so I decided to try setting zfs_vdev_max_pending=1, and now I'm getting a lot shorter asvc_t (like 10ms), and my read throughput is much, much more constant (i'm at around 111 MB/s, sometimes dips slightly to 108 MB/s). I need to do a bit more testing, but it seemed to make things much better.

However, the same cannot be said for the Poweredge 2950 I have here at work (hence my other thread about this). This setting doesn't seem to be so damning with full-fledged SAS controllers, and seemingly made no difference on this box.
 
Last edited:
Sorry for lack of replies - been away...

Taishan - I had the same issue on all protocols too. I had a play with various array configurations using all 8 disks including an RAIDZ1, basic, mirror... same issue.

In playing around with other OS's (Windows, CentOS (mdadm), etc) i'm easily maxing the NIC and local read/writes, albeit sequential, are around 800/900MB/sec. Random around 300 iirc.

This all proves that the hardware is fine and I'm sure there's a simple setting somewhere that will resolve the issues but the fact that I've tried various flavours of solaris/ZFS and had the same issues means i had no choice but to drop it which is a REAL shame. I loved Napp-IT etc but ho-hum.

If you ever do conclusively find the solution, please do let me know.

Leftside - atime is off (no need for it) and the disks aren't sleeping. :(


Thanks all

Mark
 
Mark,

I think I may have an answer for you - and it's stemming from my research with my 3x WD30EZRX disks in my box at home (Xeon E3-1240, 16 GB ECC DDR1333, Intel C204 Chipset, SATA AHCI, 3x WD30EZRX in triple mirror, Intel Larsen Creek 20GB SLC boot drive).

I had the exact same issues that you were having. Basically, ARC w/AFP killing the wire flat out (117MB/s, 119MB/s with jumbo frames), but pool reads leaving a lot to be desired (between 50-90MB/s bouncing around). Writes are flat out max for 4 seconds, dip, then max again, etc.

From what I've gathered, it's ZFS's own I/O scheduler defeating itself. Once I set zfs_vdev_min_pending = 1 and zfs_vdev_max_pending = 1, my pool reads now essentially flood the wire (around 110 MB/s, again using netatalk/AFP). Not too shabby for three slow disks that ZFS usually hates anyways (it's ashift =12 for the curious). Even my SMB throughput shot up under Windows 7, and seems more consistent. It's not quite as flat as ARC hits, but getting close.

Now, I don't seem to have as much of a problem at work, where I have more (and faster) disks in the pool, even though it's got much less processor power. However, although you have even more disks than I do at work, mine are all in triple mirror. Remember, in mirror, you have the write bandwidth and write IOPS of one disk in the vdev, and the read IOPS and read bandwidth of all of them combined. In a RAID-Z configuration, you have the read/write bandwidth of all of them combined, but only the read/write IOPS of one disk. This is an incredibly important distinction. I believe ZFS is cornering itself by queuing up tons of read IOPS, and then when SMB wants some metadata that it doesn't have, it's having to wait for that to clear before it can continue, and ends up getting stalled (zig-zag). This is bad enough in triple mirror with three slow disks. But you're even more limited in RAID-Z (you have the IOPS of only one disk) so you're going to be absolutely rammed. It would also explain why I don't see as many problems at work (I'm approaching enough disks to get sufficient read IOPS across disks to not see the impact). Remember - the lowliest ZFS appliance (7120) has 11 disks, using at minimum 9 of them in an active pool (triple mirror, 2 spare).

I'd really give this a go - it helped my box at home a lot, and I'll be doing more testing here at work on this Poweredge 2950.
 
What you say certainly makes some sense but I have also tried a 2 disk mirror on Intel 330 SSDs 60GB (slow crappy ones) and a 3 disk mirror using 240GB Agility 3s (these badboys as unreliable as they are, are fast).

You clearly have a deeper knowledge of ZFS and solaris in general than I but i cant believe the read speeds can be that slow unless there is a general issue with the way Solaris does stuff. Remember - local reads are in the region of 800MB sec. I'm was only playing with sequential data at the time of testing.

mdadm is rocking along with no issues at all. I think ZFS is great but it just ain't working for me. :(
 
Okay 1) you are running a "Asus P6T6 WS Revolution" which has qty 2 Realtek 8111C 10/100/1000Mbp NICs and 2) you also tried a "HP NC360T which is the Intel PRO/1000" NIC and had the same results.

Both these NICs max out at 1000Mbps (that is bits per second) so even without any overhead the "theoretical best" transfer rate would be 100MBs (this is megabytes).

You are complaining about your experience of a "pull 106MBs but if it has to pull it from disk, it bounces between 106MBs and about 80MBs like a sawtooth" - it seems to me you are using 80% of your link and you want to fully saturate it. Then you see "bouncing between 116MBs & 90MBs" for the HP NC360T NIC (about 90% saturation).

These are not bad number especially if your benchmark is via somethinglike NFS. I wonder what would happen if you used a pipeline like 'netcat' - note even napp-it relies on 'netcat' for async - see replication http://www.napp-it.org/extensions/replication.html You als might consider using 'mbuffer' take a look at its use with and without 'netcat' at https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/ZFSBenchmarkTest Warning depending on your HW 'netcat' might not even send data fast enough in which case you might need 'socat'.

IMHO I think your bottle neck might reside in a) not using oi_151a4 (it is a simple upgrade), b) usinging a network file systems NFS SAMBA to benchmark, c) using a 8x 3TB Seagate disks - RAIDZ2

I addressed a) and b) above lets talk about c) you have three drives in a RAIDZ2 that seems so wrong you have one data disk and two parity disks. Do your self a favor and just mirror and keep one spare drive for repairs if a failure occurs. Now redo your initial tests on a mirrored pair and let me know the results. Then use a test pipeline

-------------------------------------------------------
(localhost first should hit abotu 200MB/sec)

Receiver end:
LD_LIBRARY_PATH=/usr/sfw/lib ./socat TCP4-LISTEN:5678 - | /root/mbuffer-20080507/mbuffer -m 2048M > /dev/null

Sender part:
dd if=/dev/zero bs=1024k count=10000 | /root/mbuffer-20080507/mbuffer -m 2048M | LD_LIBRARY_PATH=/usr/sfw/lib ./socat - TCP4:localhost:5678

-------------------------------------------------------
(client server should saturate your link)

Receiver end:
LD_LIBRARY_PATH=/usr/sfw/lib ./socat TCP4-LISTEN:5678 - | /root/mbuffer-20080507/mbuffer -m 2048M > /dev/null

Sender part:
dd if=/dev/zero bs=1024k count=10000 | /root/mbuffer-20080507/mbuffer -m 2048M | LD_LIBRARY_PATH=/usr/sfw/lib ./socat - TCP4:client:5678
 
Okay 1) you are running a "Asus P6T6 WS Revolution" which has qty 2 Realtek 8111C 10/100/1000Mbp NICs and 2) you also tried a "HP NC360T which is the Intel PRO/1000" NIC and had the same results.

Both these NICs max out at 1000Mbps (that is bits per second) so even without any overhead the "theoretical best" transfer rate would be 100MBs (this is megabytes).

GigE has a theoretical maximum of 125MB/s (8 bit/ encoding) so you divide by 8. Something like SATA for example uses 8bit/10bit encoding so you divide by 10.
 
Last edited:
Back
Top