Which components get hot on an LSI 9207-8i HBA?

Mastakill

Limp Gawd
Joined
Apr 19, 2007
Messages
188
Hi all,

I think my current HBA (Dell H310) might be broken, as my ZFS pool is constantly getting corrupted data. Although it is actively cooled (by a 1000rpm 120mm fan at 7 cm distance), I can't think of any other reason than that heat has caused it to die in less than a year. As these HBAs are meant for servers with overkill airflow (explaining the tiny heatsinks), I wanted to use "overkill-heatsinks" on my replacement HBA

I already found an old CPU heatsink that should fit (after sawing the bottom off) on the HBA chip itself. But I was wondering if there are perhaps other components on the HBA that could use a little additional cooling?
If so, then I can attach some more memory heatsinks to those components as well...

My electronic component knowledge isn't enough I'm afraid to know which of these tiny components generate a lot of heat. Hopefully someone can help.

Below some pictures where I tried to number the likely candidates...

The front:
20201229_160903.jpg


The back:
20201229_174714.jpg


The CPU cooler I'm planning to use for the main chip
20201229_161412.jpg


Thanks!
 
Only the main chip gets hot enough to worry, that flower fan design/active air is enough to cool the other on board chip components.
 
Here is the end result after the mod:

I've used a stock Intel CPU cooler like this
AJ7W_1_201904282057844501.jpg

I used a Dremel to saw off the bottom part of the fins (so that it doesn't touch the PCI-e slot)
I've bend the fins a little to so the bolt fits in between
I used some washers from an old watercooling set (AMD Athlon 64 era)
I used the springs from the original LSI heatsink

And here is the result
20201230_175846.jpg

20201230_175857.jpg

20201230_175909.jpg


It probably won't win any beauty contest, but I don't have an aquarium as NAS (no window as sidepanel), so I don't really care about that ;)

Finding suitable washers was hard though, as it is a VERY close fit with the resistors
20201230_175945.jpg
20201230_175958.jpg


After a bit trying, I managed to point my 120mm fan straight at it
20201230_204726.jpg


It now properly cools my LSI HBA, the x470 chipset, the 10Gbit LAN and my Optane, so I'm quite happy with the result.

But... My data corruption issue still isn't solved :( It seems like the HBA isn't the problem... I've tried replacing the HBA and PSU already. I've thoroughly tested all HDDs (extended self test) and my RAM (ECC reporting works when I overclock it and even underclocked the data corruption still occurs). I've tried removing the Optane and many more things, but still nothing... For more details on my data corruption issue and what exactly I've tried:
https://www.truenas.com/community/t...unhealthy-pool-on-scrub-day.89182/post-622319
 
Last edited:
Moving air around inside of case doesn't do much to improve cooling. Key to keeping things cool is case having good airflow .. case intakes flowing cool air to component coolers need heated air coming out of coolers to flow on out of case without mixing and heating the cool intake air flowing to cooler. I use a cheap indoor outdoor digital thermometer with wired remote sensor and place sensor / probe in airflow into cooler fan. If that air temp is more than 2-3c warmer than room (5-6c max) case airflow needs to be modified to supply cooler air.

Below link is to basics of how airflow works and optimizing case airflow. It's basically for tower cases, but principles apply to all cases.
https://hardforum.com/threads/basic-guide-to-improving-case-airflow.1987938/
 
Moving air around inside of case doesn't do much to improve cooling. Key to keeping things cool is case having good airflow .. case intakes flowing cool air to component coolers need heated air coming out of coolers to flow on out of case without mixing and heating the cool intake air flowing to cooler. I use a cheap indoor outdoor digital thermometer with wired remote sensor and place sensor / probe in airflow into cooler fan. If that air temp is more than 2-3c warmer than room (5-6c max) case airflow needs to be modified to supply cooler air.

Below link is to basics of how airflow works and optimizing case airflow. It's basically for tower cases, but principles apply to all cases.
https://hardforum.com/threads/basic-guide-to-improving-case-airflow.1987938/
Thanks a lot for the very correct feedback!

However, in this case, I think, it may be a slightly different situation...
  • My case temp is 31°C (2x 140mm intake fans, very obstructed by HDDs, and 1x, unobstructed, 140mm outtake fan), so case temp is ok, but airflow that reaches the motherboard is suboptimal (which is why I added the additional internal 120mm fan).
  • This HBA is meant for usage in servers with high airflow. It is similar to server CPUs of more than 100W being "passively" cooled in servers with a small 1U heatsink. This is only possible if you have insane airflow.
  • I'm using this HBA in a low noise desktop, so "crazy" workarounds, like above, are required.
 
Thanks a lot for the very correct feedback!

However, in this case, I think, it may be a slightly different situation...
  • My case temp is 31°C (2x 140mm intake fans, very obstructed by HDDs, and 1x, unobstructed, 140mm outtake fan), so case temp is ok, but airflow that reaches the motherboard is suboptimal (which is why I added the additional internal 120mm fan).
  • This HBA is meant for usage in servers with high airflow. It is similar to server CPUs of more than 100W being "passively" cooled in servers with a small 1U heatsink. This is only possible if you have insane airflow.
  • I'm using this HBA in a low noise desktop, so "crazy" workarounds, like above, are required.
I'm impressed by your work, but think your airflow can be improved.

How are you monitoring for that 31c case temp?

Looking at your pic it looks like you have air moving in all directions which mixes heated air from components with cool air. Keeping in mind that every degree warmer the air is entering cooler translates to same degrees hotter component is (at same fan speed and load). If you are like me and room ambient is 20-21c your case temp is 10-11c higher than room ambient. Lower air temp entering component cooler to 23-5c and components will be 5-7c cooler. If failure is the result of high temp, being 5-7c cooler may be enough to keep it running .. because every degree warmer the air is entering component cooler translates to same degrees hotter component is (at same fan speed and load).c cooler.

What case do you have? Looks to be Fractal Design, and their included fans have such low pressure ratings they are basically garbage as they cannot overcome the resistance of even grill and filter at anything but full speed .. and even then they move little air.

Have you tried running it with case side cover off?


What model fans are the 2x 140mm intakes in front of HDDs? I ask because many computer fans have very low pressure ratings. Actually all do because a static pressure rating of 1.365mmH2O is the same pressure difference as pressure on your feet versus on your chest/neck1.5m / 5feet above feet .. and static pressure rating is with fan at full speed. Static pressure drops off dramatically as fan speed is lowered.
 
Thanks again for your good feedback!

The case temp is monitored by a sensor on the motherboard. It's called "Card Side Temp" in the IPMI, so I suppose it is close to some PCI-e slot.

I suppose you are correct that my case airflow is indeed not perfect. As I'm indeed using low pressure / low noise / low rpm Fractal Design / Scythe fans in a Fractal Design R6 case, with lots of obstruction by the HDDs, there is no "real" "airflow direction" (the air doesn't get propelled far enough for that). There is only cold air being gently pushed in the case (by the 2x 140mm front fans) and warm being gently expelled by 1x 140mm rear exhaust fan. Without this additional fan blowing air to the HBA / motherboard, there is hardly any airflow at all in that area. That is why I think this additional fan doesn't really "hinder" the (hardly-existing) airflow direction and is only positive. Keep in mind that this is a "silent" NAS server, no overclock at all and a Ryzen 3600 CPU. So unless required, I wouldn't want to replace these fans with noisier ones.

Regarding "If failure is the result of high temp":
I suppose you mean the data corruption issue in my server? I already confirmed that this is not temperature related earlier last month, by placing my "good-old-120mm-Data-Center-Delta-fan" in the case with the side panel off. My wife almost kicked me out of the house because of the hellish noise, but I'm quite sure ANY temperature problem would have been solved by that fan ;)

But I do have good news regarding this problem... I found the issue this weekend!!! It is my CPU... After replacing my servers Ryzen 3600 with my desktops Ryzen 3900X, the problem is gone!! I've done 6 scrubs over the weekend in various PCI-e slots with no errors!
Next step is to put the Ryzen 3600 back in to see if it was a mounting issue or a broken CPU...
 
Best places to monitor air temp is 15-30mm center front of component fan/s.

The stock 140mm FD case fans fans are rated 0.61mmH2O @ 1000rpm. While I rarely if ever run case fans above 1100rpm, I like 1300-1500rpm fans with 1.3-1.6mmH2O at max speed. That is about 1.0-1.1mmH2O@1000rpm. I like having 200-400rpm extra speed for rare times I want to do some extreme CPU load applications (like rendering graphics) when ambient temps are very high (hot summer day) and filters are dirty filters. I seem to run into this kind of think every year or two.

I use cheapest indoor/outdoor digital wire remote lead thermometer I can find. I've found them on ebay, Amazon, automovie parts store accesories, etc. for $4-8.00. I fabricate probe holder as shown in link in post #4, mount probe in front of cooler fan, set unit in front of case and monitor air temps entering case (main thermometer unit temp) and in front of cooler fan. Makes it easy to know temperature differential.

LOL. Yeah, probably not temp problem.

I have a bad feeling it's 3600 itself, not it's cooling. Hope I'm wrong.
 
fyi:
It seems like the root cause was a broken CPU. After replacing the CPU (first temporarily with my desktops Ryzen, later by RMAing it for a new Ryzen 3600), the problems are gone.

So the problem was not caused by TrueNAS <-> AMD incompatibility. Although perhaps not perfect, my experience with TrueNAS <-> AMD compatibility has not been a bad one.

It is a bit concerning that the data corruption itself wasn't detected by anything. Only the corrupted data itself got detected by scrub. But as I wasn't even able to trigger any PCIe AER errors for example in Linux or Windows either (I tried this using my Optane instead of the HBA), I am not sure exactly which part of the CPU was broken and if it is a TrueNAS issue or a platform (AMD) issue or perhaps a combination...
 
Back
Top