A Scalable Processing In-Memory Accelerator for Parallel Graph Processing (Tesseract)

Ehsan Yousefzadeh-Asl-Miandoab
5 min readMar 20, 2021

Introduction

The big data era applications demand high bandwidth, but traditional computing systems are not well-designed for this purpose. Processing In-Memory (PIM) shows potentials for efficient processing of data-intensive applications. In particular, large-scale graph processing has gotten attention because of its broad applications from social science to machine learning. However, efficient graph processing suffers from poor data locality. Also, memory access latency cannot overlap with computation because of the little compute per node[1,2,3]. The solution proposed by Junwhan Ahn et al. [4] puts forward a novel programmable PIM accelerator named Tesseract. Tesseract is a new architecture fully utilizing the available memory bandwidth with an efficient communication method between memory partitions. Also, it incorporates a programming interface and two specialized prefetchers suitable for graph processing applications’ memory access patterns. The authors’ comprehensive evaluation of their proposed architecture shows 10X performance improvement and 87% average energy reduction.

Proposed Mechanism

On the motivation of moving toward PIM acceleration, Junwhan Ahn et al. show that just by focusing on the external bandwidth of Hybrid Memory Cubes (HMCs), the host processors’ bandwidth will limit the system performance. Furthermore, it is more evident that the large internal bandwidth is overlooked just simply by increasing the number of processing cores and HMCs. Besides, the authors list the main challenges of incorporating the PIM paradigm in the current computing systems: (1) full utilization of internal bandwidth in an energy-efficient way, (2) communication between different memory partitions, (3) programming interface well aligned with the hardware design.

Overview of the Tesseract Architecture

The following figure demonstrates the architecture of the Tesseract. For enabling computation inside memory, a single-issue in-order core is placed at the logic layer of each vault. Note that a simple single-issue in-order core can fit into a vault due to the small size of an in-order core.

The Tesseract architecture acts like an accelerator connected to a host processor. The host processor has its main memory without PIM capability. The Tesseract is mapped to a part of a non-cacheable memory region of the host processor, which removes the need for cache coherence. The Tesseract does not support virtual memory because in-memory big data workloads usually do not need it. In addition to that, virtual memory support causes non-trivial performance overheads.

The Tesseract Architecture

Message Passing

The Tesseract cores only have access to their local DRAM partition. A low-cost message passing mechanism is used for communication between the cores. This message-passing communication between the cores avoids cache coherence issues among L1 data caches of the cores. Also, it eliminates the need for locks to ensure atomic updates and facilitates the hiding of remote access latencies through asynchronous message synchronization. The Tesseract design employs two different message passing mechanisms: blocking and non-blocking. In a blocking remote function call, a local core requests a remote core to execute a specific function and send back the processed value. The blocking function calls performance inefficiency (due to stalled cores waiting for responses from remote cores, and context switching) motivates non-blocking remote function calls. In non-blocking function calls, a local core can continue its execution after invoking a request. Results of remote function calls are guaranteed to be visible after the execution of a barrier. The message queue buffer is added to each vault to enable batch execution of non-blocking function calls as the execution of them can be delayed. Functions in the aforementioned queue are executed once either the queue is full or a barrier is reached. This batching helps to avoid the latency overhead of context switching induced by frequent interrupts. Note that remote function execution is not preempted by other remote functions so that the Tesseract does not need locks and synchronization primitives.

Prefetching

The single-issue in-order core is not capable of fully utilizing the memory bandwidth because of the stalls incurred by L1 cache misses. To enable better bandwidth utilization, two simple hardware prefetchers are added to the Tesseract: list and message-triggered prefetchers. The list prefetcher is a stride prefetcher based on [5] with the capability of accepting information about the start address, the size, and the stride of each list from the application software, keeping them in the four-entry list table. However, the stride prefetcher is not suitable for random access patterns of graph processing applications. So, the message prefetcher’s key idea is to prefetch data that will be accessed during a non-blocking remote function before the execution of the function call. For this purpose, a field is added for each non-blocking remote function call packet, indicating a memory address to be prefetched. When a request containing the hint address is inserted into the message queue, the message-triggered prefetcher issues a prefetch based on the hint and marks the message as ready when the prefetch is serviced. Finally, when more than a predefined number of messages in the queue are ready, the message queue sends an interrupt to the core to process the ready messages. As shown in the above figure, there is a prefetch buffer storing prefetched blocks instead of the L1 cache. Its purpose is to avoid the eviction of prefetched blocks by other L1 accesses based on the replacement policy before they can be utilized.

Programming Interface

For utilizing the Tesseract, [4] provides several primitives for blocking and non-blocking remote function calls, enabling/disabling interrupts from remote function calls, copying data from local memory to a remote one, and updating the list table (contains hints for list prefetching), and synchronization barrier across all Tesseract cores.

Conclusion

Due to the memory wall limitation, PIM architectures have emerged as accelerators for data-intensive applications, such as graph processing. The Tesseract architecture by [4] proposes a programmable PIM accelerator for large-scale graph processing. This novel architecture features many in-order cores inside a memory chip, a message-passing mechanism for remote function calls, new hardware prefetchers specialized for graph processing, and a programming interface that exploits the new hardware design.

For Future Reading

  • P. Chi et al., “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 27–39, DOI: 10.1109/ISCA.2016.13.

References

[1] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin, “PowerGraph: distributed graph-parallel computation on natural graphs,” 2012, In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, USA, 17–30.

[2] Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger, “KLA: A new algorithmic paradigm for parallel graph computations,” 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), Edmonton, AB, Canada, 2014, pp. 27–38, DOI: 10.1145/2628071.2628091.

[3] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, “Pregel: a system for large-scale graph processing”, 2010. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ‘10). Association for Computing Machinery, New York, NY, USA, 135–146. DOI:https://doi.org/10.1145/1807167.1807184.

[4] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 105–117, DOI: 10.1145/2749469.2750386.

[5] Tien-Fu Chen and Jean-Loup Baer, “Effective hardware-based data prefetching for high-performance processors,” in IEEE Transactions on Computers, vol. 44, no. 5, pp. 609–623, May 1995, DOI: 10.1109/12.381947.

--

--