PRIME: A Novel Processing In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Introduction
The Processing In-Memory (PIM) paradigm has recently emerged as an encouraging solution to address the “memory wall” challenge of conventional computing systems following the Von Neumann architecture [1,2,3,4]. Neural Networks (NN) and Deep Learning (DL) provide optimal solutions in different applications such as image/speech recognition and natural language processing. These applications require a large memory capacity as the size of the NN increases. Also, the high-performance acceleration of these applications demands high memory bandwidth because the processing units demand fetching weights. The data movement including input and output data besides weights is the main barrier to performance improvement and energy saving. The solution proposed by Ping Chi et al. [5] puts forward a novel PIM architecture for efficient NN computation built upon Resistive Random Access Memory (ReRAM) cross arrays, called PRIME. The PRIME executes NN applications with full-function subarrays to improve performance and energy efficiency. The author’s experiment results show ~2360x performance improvement and ~895x energy reduction compared to the state-of-the-art neural processing unit design.
Background
Resistive RAM
Resistive Random Access Memory (ReRAM) is a non-volatile memory that stores information by changing cell resistance. The following figure shows the metal-insulator-metal structure of a ReRAM cell. By applying an external voltage across a ReRAM cell, it switches between a High Resistance State (HRS) and a Low Resistance State (LRS), which represent the logic “0” and “1”, respectively.
Switching a cell from HRS to LRS and vice versa are called set and reset, respectively. To set/reset the cell, a positive/negative voltage that can generate sufficient write current is required. The following figure shows the crossbar structure of ReRAM.
The read latency of ReRAM is comparable to that of DRAM while its write latency is significantly longer (~5x) than that of a DRAM.
The Multi-Level Cell (MLC) structure is an efficient approach for improving the density and cost of ReRAM. In the MLC structure, ReRAM cells can store more than one bit of information in a single cell with various levels of resistance. This MLC characteristic can be realized by changing the resistance of the ReRAM cell gradually with finer write control.
Neural Networks Acceleration in ReRAM
Due to the challenges of the recent work on neuromorphic systems, ReRAM is becoming an encouraging candidate to build area-efficient synaptic arrays for NN computation [6,7], as it emerges with crossbar architecture. The following figure depicts a 3 X 3 ReRAM crossbar array to execute the neural network shown on the left side of the ReRAM. Note that the synaptic weights are programmed into the cell conductances in the crossbar array.
Implementing NNs with ReRAM crossbar arrays requires specialized peripheral circuit design. For example, digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) are needed for analog computing. Also, a sigmoid unit, as well as a subtraction unit, is required, since matrices with positive and negative weights are implemented as two separated crossbar arrays.
Proposed Mechanism
The PRIME architecture, an accelerator for NNs, instead of adding logic to memory, utilizes the memory arrays for computing, hence area overhead is very small. Note that the PRIME does not rely on the 3D stacking technology. The following figure shows an overview of the PRIME architecture. Despite the fact that most of the previous work need additional processing units, the PRIME directly utilizes ReRAM cells to perform computation without the need for extra PUs. The PRIME partitions a ReRAM bank into three regions: memory subarrays (only having data storage capability), Full Function (FF) subarrays (having both computation and data storage capability), and buffer subarrays (serving as data buffers for FF subarrays). The PRIME controller specifies the operation and the reconfiguration of the FF subarrays. Note that the PRIME uses the memory subarrays that are closest to the FF subarrays as buffer subarrays. They are connected to the FF subarrays through private data ports, so that buffer accesses do not consume the bandwidth of the Mem subarrays. Also, when buffer subarrays are not utilized, they can be used as normal memory.
The following figure shows the PRIME stack to support NN programming, which allows developers to easily configure the FF subarrays for NN applications. From software programming to hardware execution there are three stages:
- Programming (coding)
- Compiling (code optimization)
- Code Execution
Conclusion
The PRIME architecture proposes a new processing in ReRAM-based main memory design, which significantly improves the performance and energy efficiency for neural network applications, benefiting from both the PIM architecture and the efficiency of ReRAM-based NN computation. In the PRIME, a part of the ReRAM memory arrays are enabled with NN computation capability. They can either perform computation to accelerate NN applications or serve as memory to provide a larger working memory space. The experimental results show that the PRIME can achieve a high speedup and significant energy saving for various NN applications using MLP and CNN.
For Future Reading
- M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision,” 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 2019, pp. 802–815.
- S. W. Keckler, W. J. Dally, B. Khailany, M. Garland and D. Glasco, “GPUs and the Future of Parallel Computing,” in IEEE Micro, vol. 31, no. 5, pp. 7–17, Sept.-Oct. 2011, DOI: 10.1109/MM.2011.89.
- Prezioso, M., Merrikh-Bayat, F., Hoskins, B. et al. “Training and operation of an integrated neuromorphic network based on metal-oxide memristors,” Nature 521, 61–64 (2015). https://doi.org/10.1038/nature14441
- Alibart, F., Zamanidoost, E. & Strukov, D., “Pattern classification by memristive crossbar circuits using ex situ and in situ training,” Nat Commun 4, 2072 (2013). https://doi.org/10.1038/ncomms3072
- B. Liu et al., “Reduction and IR-drop compensations techniques for reliable neuromorphic computing systems,” 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, 2014, pp. 63–70, doi: 10.1109/ICCAD.2014.7001330.
References
[1] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland and D. Glasco, “GPUs and the Future of Parallel Computing,” in IEEE Micro, vol. 31, no. 5, pp. 7–17, Sept.-Oct. 2011, DOI: 10.1109/MM.2011.89.
[2] B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using 3D-stacked DRAM,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 131–143, DOI: 10.1145/2749469.2750397.
[3] Pugsley, Seth H., et al. “NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads.” 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014.
[4] J. T. Pawlowski, “Hybrid memory cube: breakthrough DRAM performance with a fundamentally re-architected DRAM subsystem,” In Proceedings of Hot Chips Symposium (HCS), 2011.
[5] P. Chi et al., “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 27–39, DOI: 10.1109/ISCA.2016.13.