iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture

Introduction

Processing In-Memory (PIM) Architectures have drawn the attention of computer architects as an alternative to conventional computation-centric systems due to the latter systems’ memory wall and energy consumption challenges. These architectures are efficient infrastructures for processing data-intensive workloads like Big Data, Image Processing, and Neural Networks. Image processing is a crucial domain on workstations and data centers for various applications such as machine learning, biomedical engineering, and geographic information systems. Image processing workloads require a large amount of data-intensive computations, which motivates the design of accelerators for high performance and energy efficiency. GPUs, the state-of-the-art accelerators for Image Processing, have achieved success, but, memory wall impedes its further performance gains as a result of both the image processing workloads characteristics and the computation-centric architecture limitations. To overcome the aforementioned challenges the 3D stacking processing in-memory (3D-PIM) provides a favorable solution. This architecture consists of multiple DRAM dies on top of a base logic die. The control logic accesses the memory using TSVs (Through Silicon Vias), which are vertical interconnects shared among 3D layers. Scaling TSVs is an arduous task because of the large area overhead. Researchers explore near-bank design integrating simple compute units adjacent to each bank to tackle this challenge without changing DRAM bank circuitry. However, the lack of programmability due to the expensive control core support is very challenging to enable heterogeneous image processing pipelines which have diverse computation and memory access patterns. The proposed mechanism by Peng Gu et al. [1] tries to tackle this programmability challenge by proposing a lightweight decoupled control-execution architecture, and Single Instruction Multiple Bank (SIMB) ISA that supports a wide range of image processing pipelines. Furthermore, they develop an end-to-end compilation flow with new Halide schedules. This compilation flow extends the frontend of Halide for supporting these new schedules and includes a backend with optimization including register allocation, instruction reordering, and memory order enforcement to reduce resource conflict, exploit instruction-level parallelism (ILP), and optimize DRAM row-buffer locality. The evaluations by the authors show that the proposed mechanism can improve the performance and energy consumption by ~11X and ~80% compared to an NVIDIA Tesla V100 GPU.

Background

In this section, we aim to review some terminology used in the paper to have a deeper understanding of the proposed architecture by the paper.

An image processing pipeline is a set of components used between an image source, such as an image renderer or an intermediate block. This pipeline may be implemented as software, in a digital signal processor (DSP), on a Field Programmable Gate Array (FPGA), or as fixed-function ASIC. Typical goals of an imaging pipeline may be perceptually pleasing results, colorimetric precision, a high degree of flexibility, low cost/low CPU utilization/long battery life, or reduction in bandwidth/file size [2].

Halide is a Domain Specific Language (DSL) staged in C++, a library. It helps to write high-performance array, image, and tensor processing kernels. Also, It separates algorithms from optimization [3].

Register allocation is the process of assigning a large number of target program variables onto a small number of CPU registers done by compilers [4].

This technique is an optimization technique to increase the ILP to get higher performance. Noting that it can be performed by the compiler or the hardware.

Memory order describes the order of access to computer memory by a CPU. The term can refer either to the memory ordering generated by the compiler during compile time or to the memory ordering generated by a CPU during runtime [5].

Proposed Mechanism

The authors motivate their work by conducting an extensive experiment on GPUs figuring out that memory bandwidth is the performance bottleneck of these state-of-the-art image processing accelerators. Also, they observe that Halide compiler optimization cannot change the memory-bound behavior of image processing applications on GPU. Furthermore, the experiments show that the index calculation (an important part of programmability support for flexible memory access patterns) ratio is high because image processing requires frequent translations from 2D image to 1D memory space. So, the last observation motivates them to enable architecture support for index calculation in their proposed mechanism.

The proposed architecture uses the 3D-stacking near-bank architecture with a top-down hierarchy of cube, vault, process group, and process engine as illustrated in the following figure.

The iPIM architecture Overview

The base logic of each vault contains an iPIM control core, which is responsible for executing an iPIM program. Each PIM die of each vault contains a process group (PG), which consists of many process engines (PE) and a shared process group scratchpad memory (PGSM). Each PE employs near-bank architecture, where compute-logic and lightweight buffers are integrated with a DRAM bank. Also, each PE adds an address register file (ARF) and an integer ALU to support index calculation. The design principle of the control core is to keep the hardware simple and rely on compiler optimizations. Therefore, iPIM uses a pipelined, single-issue, and in-order core, where the data hazard is eliminated when an instruction is issued, so the hardware does not need complex forwarding logic. For the detailed microarchitecture of the PEs, reading the paper is recommended.

This ISA is proposed to exploit the data-parallelism in image processing by exposing bank-level parallelism. This ISA resembles a RISC-like SIMD ISA that enables bank-parallel computation as well as efficient memory access. To enable SIMB, each SIMB-capable instruction has a field, which a boolean vector indicating whether the corresponding PE should execute this instruction. The vector length is chosen to be 4 to match the local bank’s interface and TSV’s data transfer width, so the internal bandwidth is fully utilized. For each vault, control signals and data signals share the same physical TSVs through time multiplexing, so there is no additional TSV area cost for control signals to each PE.

To support multiple image processing applications composed of heterogeneous pipelines on the iPIM, Halide programming language is used. The front-end support for Halide eases the burden of programmers from two points of view. First, the image processing algorithm written in Halide does not need to be changed for iPIM because Halide separates the algorithm from its schedules. Second, the proposers develop customized schedules to provide an easy-to-use high-level abstraction for indicating workload partition and data sharing among PEs in the iPIM. Thus, the workload partition and data sharing are optimized automatically by the end-to-end compilation flow according to the high-level schedules without programmers’ involvement. The authors developed two customized schedule primitives to exploit hardware characteristics. The following figure shows the end-to-end compilation flow of the iPIM.

The end-to-end compilation flow of the iPIM

Conclusion

The proposed mechanism suggests the first programmable in-memory image processing accelerator using near-bank architecture. It employs a decoupled control-execution architecture to support lightweight programmability. Also, it proposes a novel SIMB ISA to enable various computation and memory patterns for heterogeneous image processing pipelines. Moreover, this work develops an end-to-end compilation flow with new schedules for the proposed architecture. The compiler backend further contains optimizations for iPIM including register allocation, instruction reordering, and memory order enforcement. Evaluations show a small overhead for the programmability support with significant speedup and energy saving compared with GPU.

For Future Reading

  • E. Sadredini, R. Rahimi, M. Lenjani, M. Stan and K. Skadron, “Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching,” 2020 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020, pp. 86–98, DOI: 10.1109/HPCA47549.2020.00017.

References

[1] P. Gu et al., “iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture,” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 804–817, DOI: 10.1109/ISCA45697.2020.00071.

[2] Wikipedia contributors. (2019, September 27). Color image pipeline. In Wikipedia, The Free Encyclopedia. Retrieved 17:57, April 26, 2021, from https://en.wikipedia.org/w/index.php?title=Color_image_pipeline&oldid=918190098

[3] Wikipedia contributors. (2021, February 7). Halide (programming language). In Wikipedia, The Free Encyclopedia. Retrieved 17:57, April 26, 2021, from https://en.wikipedia.org/w/index.php?title=Halide_(programming_language)&oldid=1005442012

[4] Wikipedia contributors. (2021, April 8). Register allocation. In Wikipedia, The Free Encyclopedia. Retrieved 17:58, April 26, 2021, from https://en.wikipedia.org/w/index.php?title=Register_allocation&oldid=1016693143

[5] Wikipedia contributors. (2021, April 7). Memory ordering. In Wikipedia, The Free Encyclopedia. Retrieved 17:58, April 26, 2021, from https://en.wikipedia.org/w/index.php?title=Memory_ordering&oldid=1016403590

Master of Science in Computer Engineering — Computer Architecture