CAD system for client - Intel 5520 chipset + 2x Quadro FX5800's = issues

AreEss · Jun 9, 2009

sabregen said:
You have confused me. The client that this is for has specifically stated that they wanted Quadros. FireGL cards are only an option if we can't get the Quadros to work. They're not really worried about IQ here. The RealiZMs don't appear to even be in production anymore, and are not readily available. The timeline for delivery is very short, so waiting weeks for anything was out of the question.

That's exactly why I don't allow clients to specify hardware; because the client doesn't know what is and isn't compatible, and what will and won't perform. They get bombarded with glossy print ads and benchmarketing. I work from a set of requirements - the system must do this, this, and this - and derive specs from there, for that very reason. And no, Creative raped 3DLabs, which is why I had to spend a lot on service parts. I think I'm the only source in the world for replacement parts now. Sigh.

And who does RAID-5 for performance? whether you're using onboard SAS controller, onboard SATA controller, or a dedicated RAID card, you still have to calculate parity, somewhere. They are not (for this machine) concerned with data loss all that much (a reload at this point only requires that they drop in my slipstreamed disk, and reload AutoCAD), they are concerned with machine downtime (that's why RAID10...as long as two disks in different sets don't fail), and time incurred for parity calculations (hence the RAID10 > copy on write...no parity calcs). They don't care about the data on the machine, per say (well, only so long as it takes them to do the work on the machine...then it gets dumped off).

Long thread elsewhere about it, and I'm aware of their workflow scheme. People here, generally don't know jack about RAID. They can talk a good game, but it's just a game. I've been doing it for years upon years. Parity is irrelevant when done correctly. They come up with any number they can to justify their ignorance. The reason for RAID5 is purely performance - you will not saturate any card with RAID10. RAID5 is faster, cold hard fact. The specific reason I stated LSI 8480 is that I'm familiar with it, and I've deployed it. It can do >475MB/s long sequential read on unaligned partitions with 7200RPM disks. Performance with 4+1 10k SAS, aligned, full stripe writes, exceeds 500MB/s typical. I didn't bother with long sequentials; at that much bandwidth, you're close to exceeding bus capability. Obviously, it's not an option since they insist on rushing it - another thing I don't let customers do - but it's basically the optimal configuration.

They do have a SAN, but the requirement for this project, and therefore, this machine is that it will be VLAN'd off, and cannot touch the SAN (but they do have one), or any other shared network devices. This machine has to be standalone, or have dedicated resources. $20k to these folks is a lot for a computer, so the idea that they would buy additional dual fabric config, RAID controller, and disk for this machine it out of the question.

Sigh. iSCSI is not a SAN, it's a joke. SAN is 4Gbit FC, 8Gbit FC. The point of a pre-processor in their scenario is really pretty simple. Rather than muss around with an extremely expensive and complex workstation, the majority of the heavy lifting is done by the pre-processor, which would typically be an xSeries or similar connected to FC disk, with a Tesla or Stream processor in it. This system performs the majority of the heavy lifting, allowing them to do less processing work on the workstation. In some situations, it can take a pair of $28k workstations down to a single $12k pre-processor, and two $5k workstations. But it's also generally more suited to a larger scale deployment, rather than a one-off.

The 30" screens that they went with were for the resolution, solely. they weren't at all interested in the overall IQ of the displays, response times, etc. Although they do intend to manipulate the objects in the environment on one screen, and have it rendered on the other, they don't do fluid animations, or renders that require fast response times, color gamut reproduction, etc. It's not that kind of application. They're doing Mechnical and Electrical work in AutoCAD, not video production, video editing, or animation stuff.

Uuuuuuugh, I think I know exactly what it is now. This really is a good illustration of why I work the way I do, and don't let clients specify components. You'll need to verify with nVidia that they don't still have the no dual-acceleration restrictions in place; used to be that if you wanted two accelerated displays from one card, tough, go buy another Quadro or a Quadro MVS (which was a glorified GeFarce2 MX.) The problem with MCAE is that despite application similarity, it's vastly different from CAD/CAE/CFD. Given the datasets you've mentioned, I don't know what performance will be; I haven't done any major MCAE since the REALiZM 800 was current. I can see why they would think dual Quadros; especially with nVidia's old single accelerated display requirement - but in hindsight.. I'm doubtful they would actually need them for any other reason.

Be interesting to see what nVidia and PNY say, or if they take the easy out and say that the 5520 isn't a supported chipset and you're stuck.

sabregen · Jun 9, 2009

AreEss said:
Sigh. iSCSI is not a SAN, it's a joke. SAN is 4Gbit FC, 8Gbit FC. The point of a pre-processor in their scenario is really pretty simple. Rather than muss around with an extremely expensive and complex workstation, the majority of the heavy lifting is done by the pre-processor, which would typically be an xSeries or similar connected to FC disk, with a Tesla or Stream processor in it. This system performs the majority of the heavy lifting, allowing them to do less processing work on the workstation. In some situations, it can take a pair of $28k workstations down to a single $12k pre-processor, and two $5k workstations. But it's also generally more suited to a larger scale deployment, rather than a one-off.

Uuuuuuugh, I think I know exactly what it is now. This really is a good illustration of why I work the way I do, and don't let clients specify components. You'll need to verify with nVidia that they don't still have the no dual-acceleration restrictions in place; used to be that if you wanted two accelerated displays from one card, tough, go buy another Quadro or a Quadro MVS (which was a glorified GeFarce2 MX.) The problem with MCAE is that despite application similarity, it's vastly different from CAD/CAE/CFD. Given the datasets you've mentioned, I don't know what performance will be; I haven't done any major MCAE since the REALiZM 800 was current. I can see why they would think dual Quadros; especially with nVidia's old single accelerated display requirement - but in hindsight.. I'm doubtful they would actually need them for any other reason.

Be interesting to see what nVidia and PNY say, or if they take the easy out and say that the 5520 isn't a supported chipset and you're stuck.

Perhaps I confused you with the mention that they had a SAN, and then also mentioning VLAN'ing, but the two are mutually exclusive to eachother. Their SAN is a DS4800 + 4 shelves of 15k FC disk, dual fabrics on QLogic 5802v's w/8GB SFPs. The VLAN requirement for this machine is on the 1GbE network for ethernet connectivity. Sorry if I mashed that up.

The pre-processing system would be awesome, and is something like what we would do for LANL or Sandia, but not these guys. Oh, and I hear you on the customer-specified hardware...and the compressed time line. I don't like it, either.

I am hopeful and apprehensive regarding the SM/PNY/Nvidia interactions that seem to be taking place over the issue...

AreEss · Jun 9, 2009

sabregen said:
Perhaps I confused you with the mention that they had a SAN, and then also mentioning VLAN'ing, but the two are mutually exclusive to eachother. Their SAN is a DS4800 + 4 shelves of 15k FC disk, dual fabrics on QLogic 5802v's w/8GB SFPs. The VLAN requirement for this machine is on the 1GbE network for ethernet connectivity. Sorry if I mashed that up.

That you did, sir. iSCSI gets thrown around as SAN so much, when it really isn't...

Three not-cheers for the DS4k's. 11.36 sold me on never selling them again. I got bit very, very badly by that not once, or twice, but four times. With IBM and LSI present for two of them. Probably, if it's being fed off the DS4k, I'd go ahead and LACP team 'em. They'll burn through the bandwidth anyways one way or another.
EDIT: Have they given consideration to NFS or CIFS mounting data?

The pre-processing system would be awesome, and is something like what we would do for LANL or Sandia, but not these guys. Oh, and I hear you on the customer-specified hardware...and the compressed time line. I don't like it, either.

Yeah. That's why I own/operate myself - people upstairs always agree to the worst ideas, and never protect you from the customer. Anything that goes wrong flies right by them agreeing to it over your protests, and falls on your head. And they'll never, ever say no. I've turned down as many as I've accepted, either because they needed it too fast, or insisted on the unreasonable. About half of them end up coming back a year later, complaining about what they bought instead. Sigh.

I am hopeful and apprehensive regarding the SM/PNY/Nvidia interactions that seem to be taking place over the issue...

Same.. PNY, and nVidia especially, have a vested interest in just saying "5520 isn't compatible, we didn't say it was, tough luck." Their customer attitude is basically hate; they don't care if they lose "a" customer, they assume they'll make it up in people who buy the sidegrade every release.

sabregen · Jun 10, 2009

SM sent me a 1.3 BIOS (current was 1.0a) that is not yet released, and said that they replicated the config on their end with 1.3, and could not reproduce the issue. I just tried flashing 1.3 on the board and was successful, but now I am getting a CMOS checksum error, despite having cleared the CMOS on the board, and reflashing 1.3 just to make sure it took.

Annoyed...

sabregen · Jun 10, 2009

Response from SuperMicro on the tested config:

I've tried one more time with RAID 0+1 and I don't see the issue. Below is my configuration:

Rev: 1.3
BIOS: 6/6/09
CPU: 1 x Intel W5590
MEM: 1 x 2GB Hynix DDR3-1333
HDD: 4 x Western Digital 300GB Velociraptors
VGA: 2 x Quadro FX5800
VGA Driver: 7.15.11.8265 (latest on nvidia website)

They sent me BIOS 1.0b, not 1.3. I've still got the CMOS checksum error on 1.0b that I can't get past, for some reason. I am calling SM now, and going to see if I can get this mythical 1.3BIOS.

crazjayz · Jun 10, 2009

Does it matter that SM isn't using a dual cpu configuration like you are? If they're trying to replicate the problem, at least they should try to use the same components.

sabregen · Jun 10, 2009

Apparently the BIOS dates for the 1.0b that I got and their "claimed" 1.3 are the same, so I do infact have the same build that they verified on.

I would take your single vs dual CPU config into consideration...if I could get past the BIOS Checksum error and get into the OS, and test with dual cards, post-BIOS update. However, I haven't been able to get past the BIOS Checksum error, yet.

AreEss · Jun 10, 2009

sabregen said:
Apparently the BIOS dates for the 1.0b that I got and their "claimed" 1.3 are the same, so I do infact have the same build that they verified on.

I would take your single vs dual CPU config into consideration...if I could get past the BIOS Checksum error and get into the OS, and test with dual cards, post-BIOS update. However, I haven't been able to get past the BIOS Checksum error, yet.

What's the EXACT error it's giving you on the BIOS screen, in full?

sabregen · Jun 10, 2009

At the end of the POST cycle, on initial boot:

CMOS Settings Wrong
CMOS Time/Date Set Wrong

F1 to SETUP
F2 to Load Defaults and Continue

setting date/time is the only thing that sticks. nothing else stays

subsequent boots:

CMOS Checksum Error

F1 to SETUP
F2 to load defaults and continue

nitrobass24 · Jun 10, 2009

What happens if you press f2 to continue?

sabregen · Jun 10, 2009

because the SATA ports were set up (when the OS was installed) as AHCI, amongst many other changes, the system BSODs during Windows Boot. I need to get in to change the BIOS options back to how they were when the OS was installed to get it to boot, but I can't get past this damned CMOS Checksum Error.

AreEss · Jun 10, 2009

Going to have to wait on Supermicro to fix this one.. sounds like the BIOS has an internal check error occuring.

sabregen · Jun 10, 2009

So close (possibly) yet so far away.

sabregen · Jun 11, 2009

I was able to get the 6/6/09 8DA36069 BIOS flashed and cleared the CMOS issue successfully. Turns out that SM shipped me the afudos.exe and the .rom file without the batch file that destroys the CMOS checksum as a post-flash process. I had to dig on my server at home for the afudos command line switches that I used when I built my KFN32D-SLI 2x quad opteron system last year, as it used the same BIOS....wonderful waste of a day to find that out. after the CMOS checksum was destroyed, the machine booted.

The first boot went directly to the login screen, and I was able to login, and get the desktop. Device manager showed both cards, only 1 display was active. As I went into the Display manager, and enabled the second display, the machine brought up the second screen. As soon as I hit OK on the Display manager window, the machine powered off

The second time that the machine was booted, I was able to login, and get tot he desktop. Both cards were present in Device Manager. This time I used the Nvidia control panel to activate the second display, and the machine stayed up after I closed the Nvidia control panel. I went back to my desk to download some tools that would put a load onto the cards, but by the time I went back to the machine (about 5 minutes), it was off.

Now, I get to the login screen, but as soon as I type the password, and hit enter, the machine powers off. I have asked SM to verify all driver levels, move their test HDD config to the LSI 1068e controller, as they have confirmed that they did their test on the ICH10R in RAID mode. I have also asked for Chipset driver revision, Audio driver revision, OS type and patch level, and for them to populate the second CPU socket, as well.

On my end, I have tested the power supplies extensively. I kept the 1200w Silverstone for just powering the system level components (everything but the cards), and powered each of the cards with their own power supplies, to no effect. I have unplugged all extraneous pieces of hardware from the system (all fans except for CPU, SATA BD-RE drives, USB front ports, etc), and had no luck there, as well. While I am waiting for SM, I will likely pull the board from the chassis, and test it all laid out on anti-static mats.

sabregen · Jun 11, 2009

Loaded optimized defaults, only changed a few settings to get the system to boot: ACPI version, boot device settings, AHCI mode for SATA.

boots are now crashing randomly with 2 cards:

boot #1: entered password, got to desktop. verified second card present, enabled second monitor on second card, machine powered off

boot#2: got to login screen screen, entered password, machine immediately powered off

boot #3: got to login screen, entered password, got to desktop, verified second card, enabled second monitor, started to copy benchmark tools from USB thumbdrive, machine powered off mid-copy

boot #4: machine powered off after entering password at login screen

boot #5: got to login screen, entered password, got to desktop, verified second card, enabled second screen, machine crashed after 46 seconds at desktop

boot #6: crashed after 1:46 at desktop

boot #7: crashed at login screen, after entering password

sabregen · Jun 11, 2009

card #1 only, running Prime95 x64, 16 threads (16 logical cores in the box) = 5 minutes and counting. I will be adding ViewPerf into the mix, as soon as it's done copying to my thumbdrive.

I will run for 30 minutes on each card, and report back.

nitrobass24 · Jun 11, 2009

Have SM send you their setup, and you can send them yours....cards + mobo.
I mean theirs works

sabregen · Jun 11, 2009

I wish!

sabregen · Jun 11, 2009

card #1 passed the Large FFT Prime95 x64 + PerfView 4 thread test @ 30 minutes. Running card #2 now.

nitrobass24 · Jun 11, 2009

I have an idea...
Send me one of those Quadros, and ill send you a brand new BFG GTX 275 OC.
Same SP count.
Then your system will work.

AreEss · Jun 11, 2009

Okay, here's what I need you to do.
I need you to put everything EXCEPT the hard drives onto the Silverstone, and switch the drives to one of the standalones. I think there's a power problem with the configuration you're running - actually, I know there is, because you have no load on the 3.3V or the 5V line. That really, really pisses off ATX. Doesn't matter how good the PSU is; they're designed to have those loads. And you don't have enough loading resistors laying around, I'd wager, to correct it.

Let's see how it goes with everything but the drives on the 1200W Silverstone. I got a funny feeling it'll start crashing instead of powering off.

sabregen · Jun 11, 2009

card #2 has passed the same set of tests that card #1 has passed, for over an hour.

sabregen · Jun 11, 2009

AreEss said:
Okay, here's what I need you to do.
I need you to put everything EXCEPT the hard drives onto the Silverstone, and switch the drives to one of the standalones. I think there's a power problem with the configuration you're running - actually, I know there is, because you have no load on the 3.3V or the 5V line. That really, really pisses off ATX. Doesn't matter how good the PSU is; they're designed to have those loads. And you don't have enough loading resistors laying around, I'd wager, to correct it.

Let's see how it goes with everything but the drives on the 1200W Silverstone. I got a funny feeling it'll start crashing instead of powering off.

So you want me to test with the drives (keeping in mind that they are powered through a backplane) on a separate PSU? fine by me.

sabregen · Jun 11, 2009

drive backplane (and therefor, drives) were connected to a separate PSU. System powered off before bouncing green load screen completed, tried 3x.

sabregen · Jun 11, 2009

also have tried taking the following out of the equation:

SAS drives and backplane (did single SATA drive install, same result)
Optical drives removed from power and data connections
FP headers (power button, reset button, etc) removed
disconnected FP USB
disconnected all chassis fans
disconnected chassis intrusion detection
removed motherboard from system, pulled all components from motherboard, checked standoff location, reseated all hardware
tried 3 different BIOS's
tried 7 different drivers
tried 5 OS's
tried cards inidividually
ran CPU/RAM stress tests without issue
ran video stress tests on individual cards
ran 3x PSUs = 1 for system 1 for card #1, 1 for card #2

issue only appears when both cards are in the system, with intermittent uptime, but always the same result...power off.

I'm really beginning to think it's the board. Tomorrow a co-worker and I are each bringing in out GTX260's to test those cards in the system. I have notified SM that they have not responded to my requests for:

verification of OS type and architecture, and code level
verification of use of SAS backplane in their config
addition of second CPU and RAM to their test config
type of their PSU, and voltage ratings
if Optimized defaults are applied in BIOS, what additional settings they have changed to get their config working
verification of load testing using the same tools I am: Prime95 x64 16threads, and PerfView10, 4 threads

Barring this information, I have requested the form for an overnight board RMA, to be tested on the config in question, before being shipped to me.

sabregen · Jun 11, 2009

Advanced RMA request submitted for the motherboard. Everyone is suspecting a single board issue.

AreEss · Jun 11, 2009

Yeah, the system powering off with the SAS on the ATX is the clincher for me as well. There's a problem in the power regulation or PCIe lines, I'd say. Most likely it's in the power regulation, since it continued with a better load profile on the PSU and trips only when the second card starts drawing significantly from PCIe.

Supermicro should have you a new board very quickly.

sabregen · Jun 15, 2009

replacement board came in this morning. System has been tested, and is working now. All along, it was the damned board.

Can anyone tell me if there is a command line option that I could use to designate what screen a given program can run on, either by modification of the shortcut, or command line batch syntax?

delvryboy · Jun 15, 2009

I was just wondering what the outcome was this morning.

Happy everything worked out for you.

nitrobass24 · Jun 15, 2009

sabregen said:
replacement board came in this morning. System has been tested, and is working now. All along, it was the damned board.

Can anyone tell me if there is a command line option that I could use to designate what screen a given program can run on, either by modification of the shortcut, or command line batch syntax?

This is what I use

http://www.realtimesoft.com/ultramon/

Definately worth the money

axan · Jun 15, 2009

ya ultramon kicksass, make sure u get the beta if you're using it on vista

AreEss · Jun 15, 2009

nVidia has a built-in with the driver, from the right click menu on the title bar. It does not work reliably. Ultramon is okay.

sirmonkey1985 · Jun 15, 2009

ultramon is better then the built in one from nvidia.. also as long as you dont restart/shut down your system.. ultramon will memorize which screen a program is opened up on.. also if you open if you open a short cut on your right monitor it will open on the right monitor or if its on the left monitor it will open on that monitor.. instead of opening everything on the primary display only..
been using it for a few years now on XP and couldnt survive running dual monitors without it..

nitrobass24 · Jun 15, 2009

AreEss said:
nVidia has a built-in with the driver, from the right click menu on the title bar. It does not work reliably. Ultramon is okay.

Nvidia is unreliable at best
Multimon is okay if you only have two monitors.
Whats better than UltraMon? Ive yet to find anything.

sirmonkey1985 said:
ultramon is better then the built in one from nvidia.. also as long as you dont restart/shut down your system.. ultramon will memorize which screen a program is opened up on.. also if you open if you open a short cut on your right monitor it will open on the right monitor or if its on the left monitor it will open on that monitor.. instead of opening everything on the primary display only..
been using it for a few years now on XP and couldnt survive running dual monitors without it..

Yes....but

If you go to the properties of the shortcut there is now a new tab.
There you can tell it to always open on Screen #1,2,3 etc.
http://www.realtimesoft.com/ultramon/tour/shortcuts.asp

sirmonkey1985 · Jun 16, 2009

nitrobass24 said:
Nvidia is unreliable at best
Multimon is okay if you only have two monitors.
Whats better than UltraMon? Ive yet to find anything.

Yes....but

If you go to the properties of the shortcut there is now a new tab.
There you can tell it to always open on Screen #1,2,3 etc.
http://www.realtimesoft.com/ultramon/tour/shortcuts.asp

im still using v2.7.1.. because they remove the right click start menu item/move to other monitor option with v 3.0.4 or some crap.. and ive just never bothered to update it again..

sabregen · Jun 16, 2009

it's a non-issue, at this point, but thank you for the responses. the system was delivered 2 hours ago to a very happy client.

nitrobass24 · Jun 16, 2009

Awesome
I bet your releived.......for now

sabregen · Jun 16, 2009

oh shut it. actually, other than the stability issues, the second scariest moment was when I was using the drill to start the pilot hold into the Silverstone PSU's steel casing, incase I slipped, and went clean through it. After that, the hand-tapping of the 6/32 holes to do the custom mounting was kind of hair raising, as well.

I am happy to have it delivered, so we can get paid...that being said...I sure could use a sweet gaming/encoding rig like that!

crazjayz · Jun 17, 2009

Do you have any pics of the system?

And it's good to hear that this problem is now behind you. I wonder if the client knows the trouble you went through for this build, because that's some hardcore technical service. Kudos to you

sabregen · Jun 17, 2009

The client had to be made aware of the technical difficulties that we were having with the system. I never did get around to taking pictures of it, because I was spending every second (until delivery) troubleshooting, and then testing the damned thing.

I just called the client about 5 minutes ago, and they are going through the AutoCAD licensing stuff, because this is the second box that they are installing AutoCAD on (technically, although the previous system will no longer be in use). The guy said that they were going to have to have their HVAC guys come in and change the location of the air return, as the room is getting hot now (gee, who'd have thunk it

). He did say that he is just amazed at how fast the system is. My response:

"That was the point, right?"

CAD system for client - Intel 5520 chipset + 2x Quadro FX5800's = issues

2[H]4U

Fully [H]

2[H]4U

Fully [H]

Fully [H]

2[H]4U

Fully [H]

2[H]4U

Fully [H]

[H]ard|DCer of the Month - December 2009

Fully [H]

2[H]4U

Fully [H]

Fully [H]

Fully [H]

Fully [H]

[H]ard|DCer of the Month - December 2009

Fully [H]

Fully [H]

[H]ard|DCer of the Month - December 2009

2[H]4U

Fully [H]

Fully [H]

Fully [H]

Fully [H]

Fully [H]

2[H]4U

Fully [H]

2[H]4U

[H]ard|DCer of the Month - December 2009

[H]ard|Gawd

2[H]4U

[H]ard|DCer of the Month - July 2010

[H]ard|DCer of the Month - December 2009

[H]ard|DCer of the Month - July 2010

Fully [H]

[H]ard|DCer of the Month - December 2009

Fully [H]

2[H]4U

Fully [H]