Processing In-Memory Enabled Graphics Processors for 3D Rendering
Today’s game developers employ higher image resolutions and more color effects to render 3D frames to satisfy users' graphics and gaming demands. These demands require high throughput. Modern GPUs gain such throughput via issuing millions of pixels per second, putting substantial pressure on the off-chip memory. As a result, the memory bandwidth on GPUs becomes a severe performance and energy bottleneck. 3D-stacked memory systems such as Hybrid Memory Cube (HMC) provide opportunities to significantly overcome the memory wall by directly connecting logic controllers to DRAM dies (enabling Processing In-Memory) while suggesting high memory capacity and bandwidth to the host system. Processing In-Memory (PIM) reduces the communication overhead between the host and the memory. The observations by Chenhao Xie et al. show that the memory bottleneck problem in 3D rendering is directly contributed by texel fetching in the texture filtering process. They find that anisotropic filtering is the most significant limiting factor for the performance of texture filtering. Chenhao Xie et al. in their work  propose two texture filtering in-memory designs based on their observations and the assumption that leveraging HMC may reduce the data traffic caused by texel fetches. In the first design, S-TFIM, they directly move all the texture units from the host GPU to the logic layer of the HMC. Due to the considerable amount of live-texture information transmission, the performance improvement of the first proposal is trivial. To overcome this challenge, the authors propose an advanced texture filtering in-memory design to split the texture filtering process into two parts: performing bilinear and trilinear filtering in the host GPU while processing the most bandwidth-hungry anisotropic filtering in the logic layer of the HMC. Furthermore, the researchers employ a camera-angle threshold to enhance data reuse on GPU texture caches and control the performance-quality tradeoff of 3D rendering. The evaluations by rendering five real-world games with different resolutions show 43% overall rendering performance, 28% total memory traffic reduction, 22% energy consumption reduction.
3D rendering on GPUs
It is a process that uses 3D vertices (a vertex is a data structure that describes certain attributes, like the position of a point in 2D or 3D space, or multiple points on a surface) data to create 2D images with 3D effects on a computer system. Traditional GPUs were specific purpose processors for 3D rendering algorithms. The following figure shows the baseline GPU architecture for 3D rendering purposes. It employs the unified shader (US) model architecture for vertex and fragment (a fragment is the data necessary to generate a single pixel’s worth of a drawing primitive in the frame buffer) processing.
This 3D rendering process implemented in today’s GPUs consists of three main stages: geometry processing, rasterization, and fragment processing.
- Geometry Processing: In this stage, input vertices are fetched from memory by vertex fetcher and their attributes are then computed in the unified shaders. The input vertices are further transformed and assembled into triangles via a primitive assembly stage, and these triangles pass through the next clipping stage that removes non-visible triangles or generates sub-triangles.
- Rasterization: The rasterizer processes the triangles and generates fragments, each of which is equivalent to a pixel in a 2D image. The fragments are grouped into fragment tiles which are the basic work units for the last stage of fragment processing.
- Fragment Processing: During this stage, fragment properties such as color and depths for each fragment are computed in the unified shader, and the frame buffer (A framebuffer (frame buffer, or sometimes Framestore) which is a portion of random-access memory (RAM) containing a bitmap that drives a video display. It is a memory buffer containing data representing all the pixels in a complete video frame updated with these fragment properties. Unified shaders can fetch extra data by sending texture requests to the texture units for better image fidelity. The texture unit attached to each unified shader cluster takes the sample and filters the requested texture data for a whole fragment tile.
Texture Filtering in 3D Rendering
This process determines the color of 3D textures. This process is deeply pipelined. After receiving a texture request, the address generator first calculates the memory address for each required texel (pixel of the texture) using triangle attributes. Textel Fetch Unit in the texture unit will fetch the texels. If cache hits, the texture unit reads the texel data from the texture cache (L1 or L2). If not, it fetches the texel from the off-chip memory. Once all the texels of the requested texture are collected, the texture unit calculates the four-component (RGBA) color of the texture and outputs the filtered texture samples to the shader. The conducted experiments by the authors show that the texture fetching process in texture filtering accounts for an average of 60% of the total memory access in 3D rendering, a major contributor to the overall bandwidth usage for 3D rendering on GPUs. Therefore, optimizing memory access of texture filtering, especially the fetch process, can significantly decrease the memory bandwidth requirement of 3D rendering.
The texture filtering process on modern GPUs commonly comprises three steps: (1) bilinear filtering, (2) trilinear filtering, and (3) anisotropic filtering. Bilinear filtering is a method of texture filtering used in computer graphic design to smooth out textures when objects shown on the screen are larger or smaller than they actually are in texture memory. Trilinear filtering is an extension of the bilinear texture filtering method, which also performs a linear interpolation between mipmaps. Bilinear filtering has several weaknesses that make it an unattractive choice in many cases: using it on a full-detail texture when scaling to a very small size causes accuracy problems from missed texels, and compensating for this by using multiple mipmaps throughout the polygon leads to abrupt changes in blurriness, which is most pronounced in polygons that are steeply angled relative to the camera. In 3D computer graphics, anisotropic filtering (abbreviated AF) is a method of enhancing the image quality of textures on surfaces of computer graphics that are at oblique viewing angles concerning the camera where the projection of the texture (not the polygon or other primitive on which it is rendered) appears to be non-orthogonal (thus the origin of the word: “an” for not, “iso” for same, and “tropic” from tropism, relating to direction; anisotropic filtering does not filter the same in every direction). Like bilinear and trilinear filtering, anisotropic filtering eliminates aliasing effects but improves on these other techniques by reducing blur and preserving detail at extreme viewing angles. Anisotropic filtering poses significant memory bandwidth requirements in texture filtering.
To show the superiority of the HMC memories on GDDR5 memories, the authors first replace a GPU’s GDDR5 with an HMC. The results show up to 30% and 70% speedup for 3D rendering and texture filtering. However, the bandwidth limitation of the off-chip links still hinders these applications from achieving further speedup: the external bandwidth is much lower than the internal bandwidth in HMC. A possible solution to maximize the performance of 3D rendering via HMC is to migrate the communication from external HMC to internal HMC, thus minimizing expensive off-chip memory accesses.
Simple Texture Filtering In-Memory Design (S-TFIM)
As mentioned earlier, texture fetching in the texture filtering process incurs intensive memory accesses and becomes the major contributor to the overall memory bandwidth usage for 3D rendering on GPU. The logic layer in HMC has the capability to conduct simple logic computation, and fortunately, texture filtering involves relatively light calculation. S-TFIM design directly moves all the texture units from the main GPU to the HMC logic layer renamed to Memory Texture Units (MTUs). The following figure shows the S-TFIM design’s memory texture unit.
MTUs communicate with the host GPU via the transmission (TX) and receive (RX) channels. Whenever there is a texture filtering request from the unified shader, a package is sent from the host GPU to MTU via the TX channel. This package includes the necessary information for texture filterings, such as texture coordinate information, texture request ID (for identifying the corresponding MTU), and starting cycle. Once arriving at the MTU, the request package is buffered into the texture request queue; in every cycle, a FIFO scheduler fetches one request to the MTU pipeline for texture filtering. Upon completing the texture filtering, the texture data is included in a response packet which is then sent back to the host GPU via the RX channel. When the texture request queue is full, MTU sends a “stall” signal to the corresponding shader, suspending the request package until a “resume” signal arrives. The authors by evaluation figured out that S-TFIM’s performance improvements are trivial. This is because of the texture request and response packages containing a considerable amount of data that consumes much higher memory bandwidth than the normal memory read/write operations. The memory bandwidth usage of S-TFIM increases by 5.37X over the baseline GPU-HMC design.
Advanced Texture Filtering In-Memory Design (A-TFIM)
The key idea of A-TFIM is to dramatically reduce memory access from texture fetching, which cannot be addressed by the GPU-HMC and S-TFIM. Texture units need to fetch all the required texels from memory before the filtering process. Anisotropic filtering, which occurs after bilinear and trilinear filterings to further enhance texture sharpness, demands a large number of texels that make the texture filtering processing extremely bandwidth-intensive. To tackle this challenge, the authors only move anisotropic filtering, the step of texture filtering, to the logic layer of HMC. The decision is supported by their observation that the output of the anisotropic filtering is highly reused by other filters like bilinear and trilinear filters. In other words, texture caches shown in the baseline GPU can capture such texture locality and benefit the performance of other filtering phases in the same frame. This is because the added sampling area of anisotropic filtering for each texel of bilinear or trilinear shares the same set of texels if the camera angle remains constant. On the contrary, the outputs of bilinear and trilinear filters are intermediate sampling results rather than texels, which are rarely reused. The paper’s observations through experiments show that the reuse rate of bilinear and trilinear results is less than 0.1% during the entire texture filtering. Thus, moving bilinear and trilinear filtering into the HMC will break the benefits of texture caches for capturing the high texel locality and may subsequently increase memory traffic. first, the authors disable anisotropic filtering on the host GPU as this functionality is implemented in the HMC. However, if the design still follows the same filtering process (bilinear -> trilinear -> anisotropic), the bilinear filter still requires fetching a large number of texels as inputs to satisfy the demands of anisotropic filtering, which is suboptimal. In this way, the texture units on the host GPU can fetch a small number of texels from the stacked memory while the most expensive filtering is processed in memory. The following figure shows the basic filtering compared to the new filtering process. The authors prove the correctness of this new filtering in the paper.
The basic flow of texture filtering in A-TFIM is: first, the texture units on the host GPU fetch the required texels (i.e., the number of texels that bilinear filtering requires) from the memory stack to process texture filtering. The paper defines these texels with anisotropic filtering disabled as parent texels. Once the logic layer in the HMC receives the parent texel information package offloaded by the host GPU, it will generate a set of child texels based on the texture attributes of the required parent texels, and then feed them as inputs through the normal anisotropic filtering process in the HMC to approximate the requested parent texels. Finally, these approximated parent texels will be sent back to the texture units and cached in the texture caches as conventional inputs for bilinear and trilinear filtering. They can then be reused later in the upcoming filtering process. In this way, the A-TFIM not only speeds up the anisotropic filtering but also reduces the memory traffic significantly without sacrificing the high frame quality.
This paper  enables processing in-memory-based GPU for efficient 3D rendering. First, it implements a basic GPU-HMC mechanism showing fair performance and energy consumption improvements. Then, it designs a simple approach that directly moves all texture units of the host GPU into the logic layer of HMC leveraging the high internal bandwidth of the HMC for texture filtering but increasing unnecessary memory traffic for data movements resulting in performance degradations and energy consumption increase compared to the basic GPU-HMC approach. To address the memory traffic issue, an advanced mechanism is proposed that reorders the texture filtering sequence and precalculates the anisotropic filtering for each fetched texel in the logic layer of the HMC. Also, an approximation scheme is proposed to control the performance-quality tradeoff to accompany the advanced architecture. Evaluations show on average 3.97X and 43% performance improvement and energy consumption decrease, respectively.
For Future Reading
- L. Ke et al., “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing,” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 790–803, DOI: 10.1109/ISCA45697.2020.00070.
 C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu, “Processing-in-Memory Enabled Graphics Processors for 3D Rendering,” 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2017, pp. 637–648, DOI: 10.1109/HPCA.2017.37.