ElMoIsEviL
Gawd
- Joined
- Feb 7, 2006
- Messages
- 792
I've seen previous posts by some members indicating that CSI is inferior to Hypertransport. This could not be further from the truth. CSI offers less bottlenecks and lower latencies with nearly identical bandwidth in it's current form.
CSI is a Point to Point interconnect and not a Point to Hub and Hub to Point as some have alluded. CSI does act as a Point to Hub interconnect (the same way HTT does) this HUB is between the CPU QPi link and the PCI Express, onboard sound and other addon peripherals (I/O Controller HUB such as the X58. Both the memory (RAM) and other CPUs have direct CSI point to point interconnects (even in crossbar formation) without going through the HUB. This is much the same way that AMD and nVidia chipsets act as hubs for AMD K8 and newer based HyperTRansport designs.
(As you can see here the point to point QPi Links are in dark solid lines while the cut up lines are the QPi links for I/O (to the I/O hub). See, no interaction with the HUB direct point to point between CPUs and memory. Each CPU (or node) connects directly via it's own QPi link to the I/O hub. There are no bottlenecks. QPI > Hypertransport due to the amount of HOPs which I'll explain later. Only interaction is with the HUB when it comes to I/O (PCI Express and other add-ons peripherals such as PCI slots etc).
Figure 6 2 and 4P CSI System Diagrams [2] [34]
Now compare that with AMD:
Looks nearly the same now doesn't it? Except that QPi can cross link.. therefore actually less of a bottleneck. Hypertransport cannot cross link therefore inferior
CSI is actually superior to HyperTransport. Hypertransport is a 3HOP protocol design while CSI is a 2HOP protocol design (significantly lowering latency).
Figure 5 Critical Path Latency for Two and Three Hop Protocols
http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&mode=print
The conclusion is that QPi is superior to HyperTransport and Nehalem just clobbers Opteron Shanghai. There is no if and or buts. It is what it is.
I am now curious to see how Nehalem EX (6 core 32nm) will fare against AMD Instanbul (6 core 45nm Shanghai).
CSI is a Point to Point interconnect and not a Point to Hub and Hub to Point as some have alluded. CSI does act as a Point to Hub interconnect (the same way HTT does) this HUB is between the CPU QPi link and the PCI Express, onboard sound and other addon peripherals (I/O Controller HUB such as the X58. Both the memory (RAM) and other CPUs have direct CSI point to point interconnects (even in crossbar formation) without going through the HUB. This is much the same way that AMD and nVidia chipsets act as hubs for AMD K8 and newer based HyperTRansport designs.
(As you can see here the point to point QPi Links are in dark solid lines while the cut up lines are the QPi links for I/O (to the I/O hub). See, no interaction with the HUB direct point to point between CPUs and memory. Each CPU (or node) connects directly via it's own QPi link to the I/O hub. There are no bottlenecks. QPI > Hypertransport due to the amount of HOPs which I'll explain later. Only interaction is with the HUB when it comes to I/O (PCI Express and other add-ons peripherals such as PCI slots etc).
Figure 6 2 and 4P CSI System Diagrams [2] [34]
Now compare that with AMD:
Looks nearly the same now doesn't it? Except that QPi can cross link.. therefore actually less of a bottleneck. Hypertransport cannot cross link therefore inferior
CSI is actually superior to HyperTransport. Hypertransport is a 3HOP protocol design while CSI is a 2HOP protocol design (significantly lowering latency).
Figure 5 Critical Path Latency for Two and Three Hop Protocols
In a three hop protocol, such as the one used by AMDs Opteron, read requests are first sent to the home node (i.e. where the cache line is stored in memory). The home node then snoops all peer nodes (i.e. caching agents) in the system, and reads from memory. Lastly, all snoop responses from peer nodes and the data in memory are sent to the requesting processor. This transaction involves three point to point messages: requestor to home, home to peer and peer to requestor, and a read from memory before the data can be consumed.
Rather than implement a three hop cache coherency protocol, CSI was designed with a novel two hop protocol that achieves lower latency. In the protocol used by CSI, transactions go through three phases; however, data can be used after the second phase or hop. First, the requesting node sends out snoops to all peer nodes (i.e. caches) and the home node. Each peer node sends a snoop response to the requesting node. When the second phase has finished, the requesting node sends an acknowledgement to the home node, where the transaction is finally completed.
In the rare case of a conflict, the home node is notified and will step in and resolve transactions in the appropriate order to ensure correctness. This could force one or more processor in the system to roll back, replay or otherwise cancel the effects of a load instruction. However, the additional control circuitry is neither frequently used, nor is on any critical paths, so it can be tuned for low leakage power.
http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT082807020032&mode=print
The conclusion is that QPi is superior to HyperTransport and Nehalem just clobbers Opteron Shanghai. There is no if and or buts. It is what it is.
I am now curious to see how Nehalem EX (6 core 32nm) will fare against AMD Instanbul (6 core 45nm Shanghai).