MCP: Memory Centric Processing
Data movement between the processing units and the memory leads to significant latency and power consumption overheads. As shown by [1] memory access consumes roughly 1000X the energy of complex addition. Additionally, [2] shows that up to 60% of the total power consumption of a system is due to data movements between the memory and the processor. The technology trend is toward processing near or where the data resides. The key idea of Memory Centric Processing approaches is to bring the computation close to where data resides to diminish the performance and energy overheads. MCP devices act as accelerators for data-intensive applications such as neural networks, analytical processing, and graph processing. This article introduces the MCP, easing the understanding of academic research proposals of this area.
In General, we can categorize MCP approaches into two groups (1) near-memory processing (2) in-memory processing.
The Near-Memory Processing approach is incorporating memory and logic in an advanced IC (Integrated Circuits) package, meeting the challenge by building 3D chips consisting of logic and memory layers. Famous examples of this branch are HMC (Hybrid Memory Cube) and HBM (High Bandwidth Memory) [3, 4]. HMC was co-developed by Samsung and Micron in 2011, and HBM is a product from Samsung, AMD, and SK Hynix. In particular, an HMC chip consists of several DRAM layers plus a logic layer connected through high bandwidth links named TSVs (Through Silicon Vias). The following figure shows the architecture of an HMC. The execution model is offloading a task from the host processor, which is connected with a memory-mapped accelerator interface.
Compared to Near-Memory Processing, In-Memory processing brings tasks inside the memory by integrating logic circuits beside memory cells. A well-known example of this category is UPMEM PIM. In UPMEM PIM, DDR4 chips embed in-order multithreaded DPUs (DRAM Processing Units), resulting in massive speedup with energy gains. The following figure demonstrates the architecture of a single PIM chip.
MCP chips are accelerators for data-intensive applications, specifically analytical processing and neural networks. UPMEM PIM programming resembles a GPU computing model in which a kernel of computation gets offloaded to the GPU. This computation finishes in three steps. In the first step, the data moves from the host CPU to the accelerator. Then in the second step, the processing is done by the accelerator. Finally, the processed information transfers back to the CPU. For example, the training and testing phases of a CNN (Convolutional Neural Network) could be offloaded to a PIM like what [5] proposes. Results from the previous example show 303.2X and 48.3X faster and more energy-efficient as compared to GPU.
In a nutshell, MCP is an approach to meet the data movement bottleneck in conventional computing systems, which follow Von Neumann’s architecture. The issue causes performance and energy consumption overheads. The real-world examples of near- and in-memory approaches are HMC, HBM, and UPMEM PIM, with variants proposed in academic papers.
For Future Reading
- C. Zhang, T. Meng and G. Sun, “PM3: Power Modeling and Power Management for Processing-in-Memory,” 2018 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, 2018, pp. 558–570, DOI: 10.1109/HPCA.2018.00054.
References
[1] Dally, HiPEAC 2015.
[2] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, “Mitigating Data Movement Bottlenecks,” 2018. Google Workloads for Consumer Devices, SIGPLAN Not. 53, 2, pp. 316–331. doi:https://doi.org/10.1145/3296957.3173177
[3] J. T. Pawlowski, “Hybrid memory cube (HMC),” 2011 IEEE Hot Chips 23 Symposium (HCS), Stanford, CA, USA, 2011, pp. 1–24, DOI: 10.1109/HOTCHIPS.2011.7477494.
[4] J. C. Lee et al., “High bandwidth memory(HBM) with TSV technique,” 2016 International SoC Design Conference (ISOCC), Jeju, Korea (South), 2016, pp. 181–182, DOI: 10.1109/ISOCC.2016.7799847.
[5] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision,” 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 2019, pp. 802–815.