PM3: Power Modeling and Power Management for Processing-in-Memory

Ehsan Yousefzadeh-Asl-Miandoab
6 min readMar 8, 2021

Processing In-Memory (PIM)architectures are accelerators for data-intensive applications like real-time Big Data applications and neural networks. These architectures provide this speedup by relying on their high internal bandwidth[1, 2, 3, 4, 5], which always comes with the cost of high power consumption. So, it is crucial to have a model to manage their power consumption because neglecting it could lead to potential power supply failure and memory reliability problems when a PIM is working at its peak performance or underutilization issue when data processing and corresponding memory accesses are not intensive. PM3 paper [6] proposes a power model for PIM architectures, and further puts forward three techniques to improve energy efficiency and performance. Their experiments on an HMC-based (Hybrid Memory Cube-based) PIM show power reduction from 20W to 15W, and an RRAM-based PIM speedup from 69x to 273x after pushing the power usage from 1W to 10W.

This work clarifies the difference between 3D-based and NVM-based PIM architectures by mentioning that the former one takes advantage of the existing architecture of decoupled memory and logic die stacked with TSV for reducing the manufacturing costs, but the latter relies on the memory cells to carry out the computing tasks [3]. Furthermore, the authors propose their model, which estimates the PIM power consumption under various bandwidth, capacity and memory types. In the following figure, I explain the model with all of its parameters.

BP model

For setting some of the parameters in the above model, the authors do regression using collected data from previously validated simulation tools like NVSim and cacti-3DD, and literature.
Furthermore, three power management techniques are proposed to fully utilize the provided power and meet power constraints too. The first technique, PAST (Power-Aware Subtask Throttling) guarantees that power constraint will not be violated (exceeding the power supply constraint by a PIM task) by throttling the bandwidth usage. The following figure shows the architecture that adopted the PAST mechanism.

The purpose of two-level arbitration is to reduce the control intensity and to provide more flexible power management. Before an execution phase of any bank, the subtask queue acquires power permission from the level 1 power arbitrator. The queue issues a subtask to a memory bank. Otherwise, the calculation in the bank will be paused. The bank is not activated until the power budget is enough. The subtask queue is holding all subtask decides whether enough power is available to issue a subtask. A subtask can be issued when (1) All its existing dependent entries have been completed (using a mask for having dependencies) (2) The power requirement can be satisfied. For ensuring fairness among subtasks, the power arbitrator needs to record the failed power acquisition. Once the power arbitrator decides not to send a START signal to a PU (Processing Unit — a mix of memory cells and logic), the index of the unit will be recorded to the tail of a pending subtask queue.

Furthermore, another technique named PUB (Power Unit Boost) is proposed to meet the underutilization problem. This problem is due to the dependencies among subtask in the queue which limits the numbers of PUs that can work together parallelly. This technique is a greedy algorithm for power arbitrators based on the DAG (Direct Acyclic Graph) of the subtasks. This algorithm works as a three-state FSM (Finite State Machine): Ready, Update, and Check. In the initialization phase, the FSM is put into the Ready state. If any subtask finishes, the Update state is triggered and the DAG and the counter for available power are updated and returned to Ready state. If any update happens, the state is transferred to the Check state, and an algorithm (can be reviewed in more detail in [6]) is executed to determine the power mode of subtasks that will be issued. It is important to know that the voltage of a PU in boost mode is 1.5X of the active mode voltage, and the power consumption of boost mode is roughly 2X of the active, and the latency is reduced by a factor of 1.5X.

Additionally, this work proposes another technique called PS (Power Sprinting) to solve the conflict between the fixed power supply and the variant power requirement to achieve an optimal energy efficiency issue. The basic idea is to provide an over-loaded power for a short time and return to under-loaded power status to recover (over-loaded power is achieved through changing the power cap via providing more current). When the sprinting period ends, the power arbitrator sends an extra PAUSE command to the queue and ongoing PUs and reduces power consumption to its previous power cap. After the recovery phase, the memory returns to the normal stage, where it is ready for the next sprint. The key factor that limits the capability of power sprinting is the thermal capacitance of the package. Previous work [7, 8] use bulk metal or phase change material to store heat and use supercapacitors to store energy. The heat is dissipated by the heat sink finally.

To sum up, Processing In-Memory is considered as an accelerator for data-intensive applications like Neural Networks. These architectures gain much higher performance compared to Von Neumann architecture relying on their high internal bandwidth costing at high power consumption, which must be considered. [6] first proposes a model, and based on it proposes three techniques to manage power consumption, and achieve more speedups.

For Future Reading

  • J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 336–348, DOI: 10.1145/2749469.2750385.

References

[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 105–117, DOI: 10.1145/2749469.2750386.

[2] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 2015, pp. 336–348, DOI: 10.1145/2749469.2750385.

[3] P. Chi et al., “PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 27–39, DOI: 10.1109/ISCA.2016.13.

[4] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi and F. Franchetti, “Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware,” 2013 IEEE High-Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 2013, pp. 1–6, DOI: 10.1109/HPEC.2013.6670336.

[5] A. Shafiee et al., “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 14–26, DOI: 10.1109/ISCA.2016.12.

[6] C. Zhang, T. Meng and G. Sun, “PM3: Power Modeling and Power Management for Processing-in-Memory,” 2018 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, 2018, pp. 558–570, DOI: 10.1109/HPCA.2018.00054.

[7] Lipeng Cao, J. P. Krusius, M. A. Korhonen and T. S. Fisher, “Transient thermal management of portable electronics using heat storage and dynamic power dissipation control,” in IEEE Transactions on Components, Packaging, and Manufacturing Technology: Part A, vol. 21, no. 1, pp. 113–123, March 1998, DOI: 10.1109/95.679040.

[8] Hodes, M., Weinstein, R. D., Pence, S. J., Piccini, J. M., Manzione, L., and Chen, C. (December 12, 2002). “Transient Thermal Management of a Handset Using Phase Change Material (PCM) .” ASME. J. Electron. Package. December 2002; 124(4): 419–426. https://doi.org/10.1115/1.1523061

--

--