FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision
Introduction
Recently, Processing In-Memory (PIM) architectures drew attention to themselves as accelerators for data-intensive applications, such as Neural Networks and real-time Big Data applications. These structures suggest high bandwidth and less data movement compared to the conventional systems following the Von Neumann architecture as these architectures have separate memory and computing units. PIM architectures have shown great potential in accelerating Convolution Neural Networks (CNN), specifically on inference tasks. However, existing PIM architectures do not provide high precision computation like floating point, which is crucial for training accurate CNN models. In addition to that, most of the existing PIM approaches proposed based on the Non-Volatile Memories (NVM) like Resistive Random Access Memory (ReRAM) require mixed circuits (for conversion between analog and digital domains)[2,4]. These circuits result in scalability issues of PIM architectures. The proposed mechanism, FloatPIM, by Mohsen Imani et al. [1] suggests a fully digital scalable PIM architecture that accelerates CNNs in both training and testing phases. Also, FloatPIM provides floating-point representation so that it enables accurate CNN training. Also, the proposed mechanism provides rapid communication between adjacent memory blocks to reduce internal data movements. Evaluations show FloatPIM provides 5.1% higher classification accuracy compared to existing PIM architectures with limited fixed-point precision. Also, the experiments manifest significant speedups and energy savings compared to GTX 1080 GPU, PipeLayer PIM accelerator [2], and ISAAC PIM accelerator [3].
Background
Conventional memristor processing uses ADC/DACs to convert data between analog and digital domains [4]. On the other hand, digital PIM performs the computation directly on the stored values in the memory without reading them out or using any sense amplifier. Digital PIM has been designed by S. Kvatinsky et al. [5] and J. Borghetti et al. [6] and is fabricated in [7] to implement logic using memristor switching. The following figure shows how a NOR gate is implemented in a row of ReRAM by applying appropriate voltage. Thus, as NOR is a universal logic gate, every logic expression can be implemented with the help of it.
Digital PIM achieves maximum performance when the operands are present in the same row because all the bits of an operand are accessible by all the bits of the other operand, which increases the flexibility in implementing operations in memory. In-memory operations generally are slower than the corresponding CMOS-based implementations. This is because memristor devices are slow in switching. However, the PIM architecture can provide significant speedup with considerable parallelism.
Proposed Mechanism
FloatPIM Overview
The following figures show the overview of the FloatPIM architecture consisting of multiple crossbar memory blocks. Layers of CNNs are mapped to the FloatPIM memory blocks for performing feed-forward computation. Each memory blocks represent layers and stores the data used in either testing (like wights) or training (weights, the output of each neuron before activation, and the derivative of the activation function). FloatPIM consists of two performing phases: the compting phase in which all memory blocks work in parallel and the data transfer phase in which the memory blocks transfer their outputs to the blocks corresponding to the subsequent layers. Each memory block supports in-memory operations for CNN computations, including vector-matrix multiplication, convolution, and pooling. Also, FloatPIM supports in-memory activation functions like ReLU and Sigmoid. Also, MAX/MIN pooling operations are implemented using in-memory search operations. Besides, for increasing the performance of the FloatPIM, shifter circuits are included in it to accelerate the convolution operations, which require shiftings.
CNN computation in a FloatPIM memory block
The following figure shows a high-level illustration of a CNN layer training/ testing task on a FloatPIM memory block. In the feed-forward step, FloatPIM processes the input data in a pipeline stage. For each data point, FloatPIM stores two intermediate neuron values: (1) the output of each neuron after the activation function, (2) the gradient of activation function for the accumulated results.
In the back-propagation, the loss function is measured in the last output layer and accordingly updates the weights of each layer using the intermediate values stored during the feed-forward step.
The in-memory operations on digital data can perform in a row-parallel way, by performing the NOR-based operations on the data located in different columns. Thus, the input-weight multiplication can be processed by the row-parallel PIM operation. In contrast, the subsequent addition cannot be done in the row-parallel way as its operands are located in different rows, which hinders achieving the maximum parallelism that the digital PIM operations offer. For this purpose, the FloatPIM stores multiple copies of the input vector horizontally and the transposed weight matrix in memory. This architecture first performs the multiplication of the input columns with each corresponding column of the weight matrix, then the result is written in another column of the same memory block. Finally, the FloatPIM accumulates the stored multiplication results column-wise with multiple PIM addition operations to the other column. On the other side, convolution is an expensive operation that consisted of many multiplications. The FloatPIM addresses this method by replacing the convolution with lightweight interconnect logic for the multiplication operation. It writes all convolution weights in a single row and then copies them in another row using the row-parallel write (writes the same value to all rows only in two cycles) operation happening in just two cycles, which enables the input values to be multiplied with any convolution weights stored in another column. It exploits a configurable interconnect to virtually model the shift procedure of the convolution kernel. This interconnect is a barrel shift that connects two parts of the same memory.
Activation Functions
For the activation functions, the FloatPIM performs them with a sequence of in-memory NOR operations. For example, the ReLU function is implemented by subtracting all neurons’ output from the ReLU threshold value. This subtraction can happen in a row-parallel way, where the same threshold value is written in another column of the memory block (all rows). Finally, the threshold value is written in a row-parallel way in all memory rows that the subtracted value in a row-parallel way in all memory rows that the subtracted results have positive sign bits. For the neuron’s output with negative sign bits, we can avoid subtraction and instead write 0 value on all such rows. Also, the FloatPIM supports non-linear activation functions like Sigmoid using the PIM-based multiplication and addition based on Taylor expansion. Moreover, FloatPIM does not use separate hardware modules for any layers but implements them using basic memory operations. Hence, with no changes to memory and minimal modifications to the architecture, FloatPIM can support the fusion of multiple layers.
FloatPIM Architecture
The following figure shows the architecture of the proposed mechanism processing multiple CNN layers. It consists of 32 tiles, where each tile has 256 crossbar memory blocks that have row and column drivers. To support the convolution kernel, the barrel shifter in each memory block is considered. This architecture uses switches that enable parallelized data transfer between the neighboring blocks. The controller block calculates the loss function and controls the row- and column-driver and the switches used for data transfer.
The fast data transfer between the blocks happens with rotation and write operations. In the feed-forward step, the generated vertical output vector needs to be rotated and copied into several rows of the next memory block. The detailed depiction of the circuits that make this capability possible could be checked on the paper.
The following figure shows the functionality of the FloatPIM memory blocks working in a pipeline structure. Each memory block models the computation of either a fully connected or a convolution layer. At the first cycle (T0), the switches are disconnected, and all the memory blocks are in the computing mode and work in parallel. Then, the FloatPIM works in the data transfer mode for two-cycle. In the first cycle (T1), all odd blocks send their output values to their neighboring blocks with even indices. In the second transferring cycle (T2), the even blocks send their generated output values to their adjacent odd blocks.
On supporting the floating-point computations in the FloatPIM, these operations are implemented NOR operations.
Conclusion
As the conventional computing systems performance and energy consumption issues are getting unbearable, new computing paradigms, such as processing in-memory, are getting popular. These new architectures are accelerators for data-intensive applications like graph processing and neural networks. The FloatPIM [1] idea proposes the first PIM-based DNN architecture that exploits analog properties of the memory without explicitly converting data into the analog domain. The FloatPIM is an accelerator supporting floating-point precision alongside fixed-point. This architecture addresses the internal data movement issue of the PIM by enabling in-parallel data transfer between the neighboring blocks.
For Future Reading
- A. Shafiee et al., “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 14–26, DOI: 10.1109/ISCA.2016.12.
- L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Austin, TX, USA, 2017, pp. 541–552, DOI: 10.1109/HPCA.2017.55.
References
[1] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision,” 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, AZ, USA, 2019, pp. 802–815.
[2] A. Shafiee et al., “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 14–26, DOI: 10.1109/ISCA.2016.12.
[3] L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning,” 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Austin, TX, USA, 2017, pp. 541–552, DOI: 10.1109/HPCA.2017.55.
[4] P. Chi et al., “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 27–39, DOI: 10.1109/ISCA.2016.13.
[5] Kvatinsky, S. et al. “MAGIC — Memristor-Aided Logic.” IEEE Transactions on Circuits and Systems II: Express Briefs 61 (2014): 895–899.
[6] Borghetti, J., Snider, G., Kuekes, P. et al, “Memristive switches enable stateful logic operations via material implication,” Nature 464, 873–876 (2010). https://doi.org/10.1038/nature08940
[7] Jang, Byung Chul et al. “Memristive Logic‐in‐Memory Integrated Circuits for Energy‐Efficient Flexible Electronics.” Advanced Functional Materials 28 (2018): 1704725.