CSI vs. Hypertransport: The Facts

ElMoIsEviL · Apr 11, 2009

I've seen previous posts by some members indicating that CSI is inferior to Hypertransport. This could not be further from the truth. CSI offers less bottlenecks and lower latencies with nearly identical bandwidth in it's current form.

CSI is a Point to Point interconnect and not a Point to Hub and Hub to Point as some have alluded. CSI does act as a Point to Hub interconnect (the same way HTT does) this HUB is between the CPU QPi link and the PCI Express, onboard sound and other addon peripherals (I/O Controller HUB such as the X58. Both the memory (RAM) and other CPUs have direct CSI point to point interconnects (even in crossbar formation) without going through the HUB. This is much the same way that AMD and nVidia chipsets act as hubs for AMD K8 and newer based HyperTRansport designs.

(As you can see here the point to point QPi Links are in dark solid lines while the cut up lines are the QPi links for I/O (to the I/O hub). See, no interaction with the HUB direct point to point between CPUs and memory. Each CPU (or node) connects directly via it's own QPi link to the I/O hub. There are no bottlenecks. QPI > Hypertransport due to the amount of HOPs which I'll explain later. Only interaction is with the HUB when it comes to I/O (PCI Express and other add-ons peripherals such as PCI slots etc).

Figure 6 2 and 4P CSI System Diagrams [2] [34]

Now compare that with AMD:

Looks nearly the same now doesn't it? Except that QPi can cross link.. therefore actually less of a bottleneck. Hypertransport cannot cross link therefore inferior

CSI is actually superior to HyperTransport. Hypertransport is a 3HOP protocol design while CSI is a 2HOP protocol design (significantly lowering latency).

Figure 5 Critical Path Latency for Two and Three Hop Protocols

In a three hop protocol, such as the one used by AMDs Opteron, read requests are first sent to the home node (i.e. where the cache line is stored in memory). The home node then snoops all peer nodes (i.e. caching agents) in the system, and reads from memory. Lastly, all snoop responses from peer nodes and the data in memory are sent to the requesting processor. This transaction involves three point to point messages: requestor to home, home to peer and peer to requestor, and a read from memory before the data can be consumed.

Rather than implement a three hop cache coherency protocol, CSI was designed with a novel two hop protocol that achieves lower latency. In the protocol used by CSI, transactions go through three phases; however, data can be used after the second phase or hop. First, the requesting node sends out snoops to all peer nodes (i.e. caches) and the home node. Each peer node sends a snoop response to the requesting node. When the second phase has finished, the requesting node sends an acknowledgement to the home node, where the transaction is finally completed.

In the rare case of a conflict, the home node is notified and will step in and resolve transactions in the appropriate order to ensure correctness. This could force one or more processor in the system to roll back, replay or otherwise cancel the effects of a load instruction. However, the additional control circuitry is neither frequently used, nor is on any critical paths, so it can be tuned for low leakage power.

http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&mode=print

The conclusion is that QPi is superior to HyperTransport and Nehalem just clobbers Opteron Shanghai. There is no if and or buts. It is what it is.

I am now curious to see how Nehalem EX (6 core 32nm) will fare against AMD Instanbul (6 core 45nm Shanghai).

Mr. Bluntman · Apr 12, 2009

Great, informative post there. I'm sure with Intel's process and execution engine superiority it will be a slaughter against Istanbul.

AreEss · Apr 20, 2009

The Opteron diagrams are incorrect. Please refrain from using deliberately misleading diagrams.

Any Opteron 2000-series or 8000-series can establish a direct HT link to either I/O hub, providing the motherboard manufacturer isn't an idiot (see also Tyan.)

The SP5100 is not a common part, because it is not a good part. Good luck finding an SP5100 board; they don't exist. All production Opteron boards in the wild are based on the MCP55 and nForce3500-family chipset. There are exactly zero SP5100 based motherboards in production.

I'll deal with the rest of it later. I'm a rather large Sun customer, so frankly, I have a lot on my plate and not a lot of time to correct FUD.

bexamous · Apr 20, 2009

Uh... pretty sure that diagram is the exact same one AMD sends out. At least it looks like it from memory.

ElMoIsEviL · Apr 20, 2009

bexamous said:
Uh... pretty sure that diagram is the exact same one AMD sends out. At least it looks like it from memory.

LOLz yeah it is AMD's diagram:
http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_14112,00.html

It is.. it's from AMD..

ElMoIsEviL · Apr 20, 2009

AreEss said:
The Opteron diagrams are incorrect. Please refrain from using deliberately misleading diagrams.

Any Opteron 2000-series or 8000-series can establish a direct HT link to either I/O hub, providing the motherboard manufacturer isn't an idiot (see also Tyan.)

The SP5100 is not a common part, because it is not a good part. Good luck finding an SP5100 board; they don't exist. All production Opteron boards in the wild are based on the MCP55 and nForce3500-family chipset. There are exactly zero SP5100 based motherboards in production.

I'll deal with the rest of it later. I'm a rather large Sun customer, so frankly, I have a lot on my plate and not a lot of time to correct FUD.

Bug off.. it's from AMD..

You can't correct what isn't FUD. I corrected your post about QPi which was full of FUD. This is why I made this.

Bug AMD about the diagram. That is also how their technical documents explain HTT because that's how it works.

Block_Diagram_for_Socket_F1207White_Background_375W.jpg

Notice the 3 HT Links for the new 45nm Shanghai Opterons? It's taken from this page: http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8796_15223,00.html

Look at the diagram above. You would need a crossbar link for what you propose for the TOP Opterons which AMD currently do not allow.

Here's a diagram of the nForce you quoted (actually newer variant with HT3 support):

CPU0 needs to go through CPU1 in order to get to the I/O. No cross link as you see with QPi. Again all your posts on this topic (QPi) have been FUD and I am correcting you as some people actually believe you.

jimmyb · Apr 20, 2009

ElMoIsEviL said:
Looks nearly the same now doesn't it? Except that QPi can cross link.. therefore actually less of a bottleneck. Hypertransport cannot cross link therefore inferior

What makes you think you can't crosslink with Hypertransport? To my knowledge there's nothing in the spec which dictates the topology it can construct.

Edit: From the Hypertransport specification:

Hypertransport specification said:
HyperTransport I/O fabrics are implemented as one or more daisy chains of HyperTransport
technology-enabled devices, with a bridge to the host system at one end.

It's still possible to create a fully connected topology though. It requires multiple HT chains connected with switches, something explicitly documented in the spec, as well as in other places. I presume a fully connected network with QP requires a very similar implementation.

sphinx99 · Apr 21, 2009

The last link in the previous post (http://www.hypertransport.org/docs/HT_General_Overview_02-08.pdf) - page 40 suggests full cross-linking capability.

jimmyb · Apr 21, 2009

sphinx99 said:
The last link in the previous post (http://www.hypertransport.org/docs/HT_General_Overview_02-08.pdf) - page 40 suggests full cross-linking capability.

Indeed.

I think the OP is mistakenly confusing a HT-based cache coherency implementation (and IO connect), with HT itself.

AreEss · Apr 21, 2009

bexamous said:
Uh... pretty sure that diagram is the exact same one AMD sends out. At least it looks like it from memory.

You'll note a lack of me arguing otherwise.

AMD's marketing makes IBM's OS/2 marketing department look like absolute GENIUSES. "Here let's use a bad and incorrect diagram linked to a chipset we don't even sell that isn't available in production!" That doesn't make it any less incorrect, just oh so typical for AMD materials.

Also, HT is a self-routed packet-based system. Any HT link can link to any other HT link at any time in any place, provided wires are available. CPU0 can go to CPU4 and CPU4 can go to MCP55 at the same time on different links. CPU0 and CPU4 can both go to MCP55 at the same time. Etcetera. Also, HT does full crosslinking, multiple crosslink, and multilink.
You are correct that QPI does require a full routing implementation including wiring, like HT. However, QPI requires a greater wire count than HT and also has limitations on how far out you can go. QPI also has a significant implementation limitation in that you cannot do two key things; you cannot mix versions/types without translating down and up again, and you cannot implement differing clock or channels. So a system with 16-bit QPI at CPU and 20-bit QPI at the IOHub must have both implemented on the IOHub OR must have an interposer which up-rates the 16-bit to 20-bit. Partial channel implementations are reserved for resiliency functions only. HT can be implemented at any clock divisible by base and at any bit width divisible by 8.

EDIT: Also, that's a high-level, incomplete reference implementation diagram for a "low-cost" solution where CPU1 is required to enable additional HT links to IO. Tyan is the only manufacturer which implemented that design. CPU1 is still not a required component to access additional links, as it can be wire-bypassed. Literally, I think it's 4 wires to jumper the connection total? (The spec mentions minimum wire count.)

ElMoIsEviL · Apr 23, 2009

sphinx99 said:
The last link in the previous post (http://www.hypertransport.org/docs/HT_General_Overview_02-08.pdf) - page 40 suggests full cross-linking capability.

Opteron CPUs are limited to 3 Hyper transport connections currently per CPU. You cannot Crosslink in that regard. Currently the Specs don't allow for Hypertransport cross linking due to the amount of HT links present on current Opterons.

If you take the original diagram, CPU0 and CPU1 (the top ones) each have a link unused but CPU2 and CPU3 are fully saturated.

CSi does not have this limitation in it's current implementations in the 2P and up coming 4P Xeon based parts (as of now there are no 8P based parts in the works that I know of). The current Xeon Nehalem has two QPi links while the 4P is said to have four.

I made this post after Aress made some statements that were incorrect in that QPi had a bottleneck because it was entirely HUB based (meaning EACH node had to connect to a single I/O for coherency). He stated that QPi was not point to point but rather point to hub. Which is and was at the time an outright lie.

Any Point to Point Connection can crosslink, you just need the available links to do so.

jimmyb · Apr 23, 2009

ElMoIsEviL said:
Opteron CPUs are limited to 3 Hyper transport connections currently per CPU. You cannot Crosslink in that regard. Currently the Specs don't allow for Hypertransport cross linking due to the amount of HT links present on current Opterons.

I haven't confirmed this but I'll assume what you're saying is correct. It seems your criticisms are of the Opteron and possibly chipset, and not Hypertransport itself which indeed supports fairly arbitrary topologies.

Any Point to Point Connection can crosslink, you just need the available links to do so.

This is not the case. To support a fully connected network, or just crosslinking, the interface needs to have some notion of a device, and some support for getting data routed across the network. Hypertransport, QP, and PCIe, among others, support this.

As a quick counter example, simpler point to point connections, such as the FSL by Xilinx (I use this as an example since I've been working with Xilinx stuff recently), don't support any of this. It's essentially just a FIFO with some basic handshaking mechanism for transferring data.

Conceivably you can wrap a routing system on top of these interfaces, but I think that misses the point somewhat since many PtoP interfaces support diverse topologies natively.

bexamous · Apr 23, 2009

AMD actually has 4HT links... you just can't use em'. EVENTUALLY when AMD stops being god damn stupid and making junk we'll have fully connected 4way and single hop 8way-- which would be neat since the current 2hop 8way systems are embarrassingly bad.

Anyways whatever, this would be better as "CSI vs AMD's implementations of HT" since who really cares about some spec that doesn't exist.

jimmyb · Apr 24, 2009

bexamous said:
Anyways whatever, this would be better as "CSI vs AMD's implementations of HT" since who really cares about some spec that doesn't exist.

I suspect their HT implementations are fairly complete and done to spec.
The objections have very little to do with HT itself.

sphinx99 · Apr 25, 2009

As an aside this does not change my personal opinion of the sheer compute power of the new Xeon platform. We've just gotten in our first DL380s equipped as such and they are amazing under real world computing scenarios.

bexamous · Apr 27, 2009

jimmyb said:
I suspect their HT implementations are fairly complete and done to spec.
The objections have very little to do with HT itself.

Sorry what I wrote didn't exactly make sense... I only meant to say that the bigger problem(s) with AMD (eg 4S not fully connected & 8S including 2hops) is not a flaw with HT but rather how AMD is implemented it. And the thread title being 'CSI vs. HT' is not the same as 'Inte's implementation of CSI vs AMD's implementation of HT' which matters more (IMO).

nicfolder · Sep 13, 2011

ElMoIsEviL said:
I've seen previous posts by some members indicating that CSI is inferior to Hypertransport. This could not be further from the truth.

The conclusion is that QPi is superior to HyperTransport and Nehalem just clobbers Opteron Shanghai. There is no if and or buts. It is what it is.

I am now curious to see how Nehalem EX (6 core 32nm) will fare against AMD Instanbul (6 core 45nm Shanghai).

Patriot Begs to Differ...

Patriot said:
One of my 4p boxen is intel... e7-4870 @2.4 GHZ its slower than AMD 4p considerably... It's rather sad actually...

Sad ?...

CSI vs. Hypertransport: The Facts

ElMoIsEviL

Gawd

Mr. Bluntman

Supreme [H]ardness

AreEss

2[H]4U

bexamous

[H]ard|Gawd

ElMoIsEviL

Gawd

ElMoIsEviL

Gawd

jimmyb

2[H]4U

sphinx99

[H]ard|Gawd

jimmyb

2[H]4U

AreEss

2[H]4U

ElMoIsEviL

Gawd

jimmyb

2[H]4U

bexamous

[H]ard|Gawd

jimmyb

2[H]4U

sphinx99

[H]ard|Gawd

bexamous

[H]ard|Gawd

nicfolder

Weaksauce