PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing In-Memory Architecture

Ehsan Yousefzadeh-Asl-Miandoab
5 min readMar 16, 2021

Processing In-Memory (PIM) architectures have been introduced as accelerators for data-intensive applications like Neural Networks because the memory wall crisis impacted traditional computing systems’ performance and power consumption adversely [1, 2, 3, 4]. However, integrating the PIM architectures with existing systems is challenging because of unconventional programming models for in-memory computation units, at least for the near future. The solution proposed by J. Ahn et al. [5] puts forward a new PIM architecture that does not change the existing sequential programming models and decides whether to execute PIM operations in memory or on processors depending on the data locality. New specialized instruction (PIM-Enabled Instructions (PEI)) are added to the ISA to invoke in-memory computation operations to be interoperable with existing programming models, cache coherence, and virtual memory. For monitoring data locality accessed by PIM-enabled instructions, the authors propose a simple hardware structure to enable adaptive execution of an instruction at the host processor when that can benefit from large on-chip caches. Their evaluations show significant performance improvements over conventional systems and PIM-only systems.

The authors show the potential of the PEIs on improving the performance of the Page Rank algorithm. For this algorithm update part, sending a word to the Hybrid Memory Cube (HMC), as the main memory, to update the value is beneficial because this algorithm does not utilize the caches. However, cache-friendlier workloads like p2p-Gnutella31 (A sequence of snapshots of the Gnutella peer-to-peer file-sharing network from August 2002), execution on the host processor can be far better than in-memory processing. When a PEI is issued, the proposed mechanism dynamically decides the best location to execute it between memory and the host processor on a per-operation basis. This proposal limits the memory region accessible by a single PIM operation to a single cache block. The aforementioned limitation provides three benefits. First, bounding PIM operations to a single DRAM module ensures that the PIM operations always use only TSVs. Second, PIM operations and normal accesses have the same memory access granularity, so the hardware support for coherence, virtual memory for PIM operation becomes simple. Third, determining the best location to execute the PEIs with profiling the locality is made simple by utilizing the last-level tag array. On the memory consistency model, this work supports the atomicity between the PEIs. However, to preserve atomicity between the host and the PEIs, programmers must use fence instructions in their code. Also, because this mechanism automatically decides where to execute the PEIs, it does not burden compilers.
The following figure shows the proposed architecture by [5]. They added two important components to the CPU and to the HMC memory:

  • PCU: PEI Computation Unit, responsible for executing the PEIs
  • PMU: PEI Management Unit coordinating all PCUs in the system.

It is crucial to note that PCUs inside the HMC are considered for each vault.

PCUs are composed of computation logic and operand buffers. The purpose of the operand buffer is to exploit memory-level parallelism during PEI execution. The operand buffer is a small SRAM buffer that stores information of in-flight PEIs. For each PEI operation, an operand buffer entry is allocated to keep its type, target cache block, and input/output operands. When the operand buffer is full, future PEIs are stalled until space frees up in the buffer. Simultaneously, PMU performs three tasks: (1) atomicity management of PEIs, (2) cache coherence for PEIs, and (3) data locality profiling for locality-aware execution of PEIs. The PIM Directory inside the PMU unit is responsible for ensuring the atomicity of PEIs and implementing fence instruction. On coherence management, when the PMU receives a PEI, it knows which cache block the PEI will access. Thus, it simply requests back-invalidation for write PEIs, or back-writeback for reader PEIs for the target cache block to the last-level cache before sending the PIM operation to memory. The locality monitor inside the PMU unit is a tag array with the same number of sets/ways as the Last Level Cache (LLC). Each entry contains a valid bit, a 10-bit partial tag, and replacement information bits. Each LLC access leads to hit promotion and block replacement for the corresponding locality monitor. On the virtual memory support, because PEIs are part of the conventional ISA when a processor issues a PEI, it simply translates the virtual address of the target cache block by accessing its TLB. So, all PCUs, and the PMU handle PEIs with a physical address.
Additionally, the paper provides two examples to illustrate the host-side and memory-side PEI execution.

Host-side Execution of PEIs
Memory-side execution of PEIs

To sum up, due to the memory wall and emergence of data-intensive applications, PIM architectures are rising in popularity as accelerators for those applications. The PIM-Enabled Instructions architecture by [5] proposes a practical model for processing in-memory, which is compatible with existing cache hierarchy, coherence protocols, and virtual memory mechanisms. Also, the proposed architecture is capable of dynamically optimizing PEI execution according to the data locality of applications and PEIs. Thus, this architecture is much nearer to practicality compared to the previous work for near future implementations.

For Future Reading

  • M. Imani, S. Gupta, Y. Kim and T. Rosing, “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision,” 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 2019, pp. 802–815.

References

[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 105–117, doi: 10.1145/2749469.2750386.

[2] R. Balasubramonian et al., “Near-Data Processing: Insights from a MICRO-46 Workshop,” in IEEE Micro, vol. 34, no. 4, pp. 36–42, July-Aug. 2014, doi: 10.1109/MM.2014.55.

[3] Ignatowski, Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da. “A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM.” (2013).

[4] Wm. A. Wulf and Sally A. McKee, “Hitting the memory wall: implications of the obvious,” SIGARCH Computer Architecture. News 23, 1, pp. 20–24, March 1995, doi: https://doi.org/10.1145/216585.216588

[5] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 336–348, DOI: 10.1145/2749469.2750385.

--

--