AMD patents co-compute unit (CCU) linked to L3 cache to avoid cache-thrashing of L1 cache (for memory intensive loads such as RT)

Is this AMD's version of RT cores ??

Co-compute unit in lower-level cache architecture​

Abstract​

A processor includes compute units each including a first-level cache and each communicatively coupled to a co-compute unit (CCU) within a lower-level cache. In response to a compute unit receiving instructions to perform operations for an application, the compute unit determines one or more parameters based on the received instructions. The compute unit then sends the parameters and instructions to perform one or more operations on behalf of the compute unit to a respective CCU. The CCU then performs the operations based on the parameters and using the lower-level cache. Once the CCU has performed the operations, the CCU then sends the results of the operations back to the compute unit.

https://patents.google.com/patent/US20240264942A1
 
As such, systems and techniques disclosed herein are directed to performing operations for memory-intensive applications without causing cache-thrashing. To this end, a processing system includes a processor including one or more compute units each including or otherwise connected to a first-level cache. Such first-level caches are part of a cache hierarchy arranged by size with the first-level caches being the smallest caches in the cache hierarchy and the lower-level caches being larger in size than the first-level caches. Each compute unit of the processor is communicatively coupled to a co-compute unit (CCU) located within or otherwise connected to a lower-level cache (e.g., a third-level cache) of the cache hierarchy. That is to say, each compute unit is communicatively coupled to a CCU within or otherwise connected to a cache (e.g., a lower-level cache in a cache hierarchy) that is larger than the first-level cache. The CCUs each include, for example, one or more SIMDs configured to perform one or more operations for an application (e.g., a memory-intensive application) on behalf on a respective compute unit. To have a CCU perform one or more operations for an application (e.g., a memory-intensive application) on behalf on a respective compute unit, each compute unit is first configured to receive one or more instructions indicating one or more operations from an application. In response to receiving the instructions, the compute unit determines one or more parameters based on the received instructions. For example, the compute unit performs one or more operations indicated in the instructions to determine the parameters, identifies one or more parameters from the instructions, or both. Such parameters include data defining one or more values necessary for, aiding in, or helpful for performing one or more operations, for example, required register files for an operation, memory requirements for an operation (e.g., the size of the data needed to perform the operation), default values for variables, formats for values (e.g., floating point format, integral format, pointer format), scalar parameters, vector parameters, or any combination thereof. The compute unit then sends the parameters, instructions to perform one or more operations on behalf of the compute unit, or both to a respective CCU (e.g., the CCU communicatively coupled to the compute unit). In response to receiving the parameters, instructions, or both, the CCU performs one or more operations on behalf of the compute unit based on the parameters using the lower-level cache. For example, the CCU establishes vector registers, scalar registers, or both in the lower-level cache that each store data (e.g., register files, operands) used to perform the operations. As another example, the CCU uses the lower level-cache to store data (e.g., instructions, operation results, values, operands) necessary for, aiding in, or helpful for performing the one or more operations. After performing the operations, the CCU then sends the results (e.g., data resulting from the performance of the operations) back to the compute unit, makes the results available (e.g., in a data buffer) to the compute unit, or both. Because the CCUs use a lower-level cache (e.g., larger cache) to perform the operations of memory-intensive applications on behalf of a respective compute unit, the likelihood that cache-thrashing occurs is reduced as the lower-level cache is large enough to store the data necessary for, aiding in, or helpful for performing these operations. As the likelihood for cache-thrashing is reduced, the likelihood of the CCUs stalling or failing to progress when performing the operations is reduced, increasing the processing speed and processing efficiency of the processing system.
 
Such co-compute units include, for example, one or more SIMDs, scalar registers, vector registers, or both configured to perform one or more instructions of one or more applications 110. In embodiments, each co-compute unit within or otherwise connected to a cache of lower-level caches 124 is associated with and communicatively coupled to a respective processor core 116 (e.g., compute unit) and is configured to perform at least a portion of one or more operations on behalf of (e.g., for) the respective processor core 116.

To perform one or more operations, each co-compute unit is configured to use at least a portion of one or more caches of lower-level caches 124 (e.g., different-level caches). For example, each co-compute unit is configured to use at least a portion of the cache in which the co-compute unit is within or otherwise connected.

In response to performing one or more operations, each co-compute unit is configured to provide one or more results of the operations (e.g., data resulting from the operations) to a respective processor core 116 (e.g., compute unit), make one or more results of the operations available (e.g., in a data buffer) to a respective processor core 116, or both
 
For performing these operations, each CCU 236 is configured to use data (e.g., instructions, values, operands) stored in third-level cache L2 234.

For example, to execute one or more operations, each CCU 236 is configured to establish one or more registers 242 within third-level cache L2 234. Such registers 242 include, for example, respective vector registers, respective scalar registers, or both configured to store data (e.g., operands, results) used by a CCU 236 to perform one or more operations.

Such registers 242, for example, have a fixed size (e.g., have a predetermined size), have a dynamic size, or both. According to embodiments, each CCU 236 is configured to establish a register 242 in third-level cache L2 234 representing both a vector register and scalar register for the CCU 236, also referred to herein as a uniform register. In embodiments, one or more CCUs 236 are configured to establish one or more registers 242 as local registers.

Such local registers, for example, are not flushed from third-level cache L2 234 to memory 106. For example, one or more vector registers, scalar register, uniform registers, or any combination thereof established by a CCU 236 are local registers.


Additionally, one or more CCUs 236 are configured to establish one or more registers 242 as non-local registers. Such non-local registers, for example, are flushed from third-level cache L2 234 to memory 106.
For example, one or more scalar registers established by a CCU 236 are non-local registers. Though the example architecture 200 of FIG. 2 illustrates CCUs 236 establishing eight registers (242-1, 242-2, 242-3, 242-4, 242-5, 242-6, 242-7, 242-8) in third-level cache L2 234, in other embodiments CCUs 236 may establish any number of registers 242 in third-level cache L2 234.
 
OH NO... Zen 6 AI Cache marketing incoming.

This is interesting if they solve any latency issues this may imply.

So Intel has a "thread director" AMD has a "cache director".
 
This is GPU. note references to SIMD all over the place
 
