Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching
Some computer architecture researchers have recently focused their attention on Processing In-Memory (PIM) due to conventional architectures’ memory wall and energy consumption issues. These architectures provide efficient processing infrastructures for data-intensive applications like pattern-matching kernels in Big Data domains, Image Processing, Neural Networks, etc. Efficient pattern matching is crucial for many applications like network intrusion detection, spam filters, virus scanners, etc. Pattern matching requires high-throughput and real-time processing, which turns into a performance bottleneck in the aforementioned applications. The high demand for accelerated pattern matching has motivated several in-memory architectures for automata processing. E. Sadredini et al. observe that these architectures [2,3,4] are based on 8-bit symbol processing, which results in hardware resources underutilization. Also, multi-stride symbol processing that is a major source of throughput growth is not explored in the existing in-memory solutions . The authors proposed a denser mechanism, Impala, which puts forward multi-stride in-memory automata processing architecture. Their key insight is that transforming 8-bit processing to 4-bit processing exponentially reduces resource requirements for state-matching and improves resource utilization. Furthermore, the proposed mechanism introduces area, throughput, and energy benefits at the expense of increased offline compilation time. The authors’ evaluations show that Impala has on average 2.7X higher throughput per unit area and 1.22X lower power consumption than the state-of-the-art research work .
Non-deterministic Finite Automata (NFA)
An NFA is represented by the 5-tuple shown in the following figure. The transition function determines the next states using the currently active states and the input symbol.
In the Impala, a homogeneous form of NFAs is used. In a homogeneous automaton, all transitions entering a state must happen on the same input symbol. This provides a nice property that aligns well with the hardware implementation that finds matching states in one clock cycle and allows a label-independent interconnect. The following figure shows an example of a classic NFA and its equivalent homogeneous representation.
In-Memory Automata Processing
The following figure shows a two-level pipeline architecture for single symbol processing of common automata accelerators. Memory columns are configured based on the above homogeneous NFA example for STE0-STE3. Automata processing involves two steps for each input symbol, state matching, and state transition. In the state match phase, the input symbol is decoded, and the set of states whose rule or label matches that input symbol is detected through reading a row of memory (match vector). Then, the set of potentially matching states is ANDed with the active state vector, which indicates the set of states that is currently active and allowed to initiate state transitions. In the state-transition phase, the potential next-cycle active states are determined for the currently active states (active state vector) by propagating signals through the interconnect to update the active state vector for the next input symbol operation.
The authors’ experiments show that about 73% of the states only match against a single symbol, which implies that only a single cell in a memory column of 256 cells is set to 1 and the rest are 0. Also, they show that more than 86% of the states are matched against at most eight symbols meaning only 3% of the cells in memory cells are utilized. While a memory column with 256 cells is powerful enough to implement any Boolean function with eight inputs, existing automata computing architectures implement a relatively simple function, for example, 73% of the time, states are comparing against a single symbol, which is a boolean function with a single product term. In other words, a simpler low-cost matching architecture targeting the dominant case of a small number of matching symbols is more efficient than the existing costly matching architecture targeting infrequent complex matching conditions. To improve the utilization of state-matching resources, the authors compress the 8-bit symbols to 4-bit and re-shape the automaton accordingly for correct and lossless functionality. The corresponding hardware change is that state-matching columns now can be designed with shorter subarray, 16 memory cells instead of 256, which reduces the waste significantly. Furthermore, they mention that another work on in-SRAM computation for binary neural networks  concludes that shorter SRAM subarrays (shorter memory columns) provide a better classification accuracy due to a smaller quantization error when calculating the partial sum in convolution operation. However, a 4-bit automaton halves the processing rate. To increase the throughput, Impala proposes to process multiple 4-bit symbols/cycle.
Noting that squashing to 4-bit increases the number of states 2.52X on average. But, the memory column size is exponentially decreased from 2⁸ to 2⁴.
The Impala's compiler first reshapes a given NFA with 8-bit symbols to process 4-bit symbols. The compiler first generates single-bit-automata by replacing each edge in the 8-bit original automaton with a sequence of states of length 8 and then traverses the bit-automaton with paths of length 4 and concatenates edges to generate 4-bit symbols.
Vectorized Temporal Striding
As stated before the 4-bit automata processing halves the processing rate, to increase the throughput the reshaped NFA is reshaped again to find its equivalent automaton that processes multiple 4-bit symbols per cycle. The paper calls this transformation “Vectorized Temporal Striding” as the state’s matching rules are organized in vectors. Temporal striding [6,7] and its vectorized version are transformations that repeatedly square the input alphabet of an input automaton and adjust its matching symbols and transition graph accordingly. The transformed automaton is functionally equal to the original automaton, but it processes multiple symbols per cycle, thus increasing throughput. The following figure (a) shows a classical NFA, (b) shows a 4-bit classical NFA, (c) shows the 4-stride automaton of the one in (b), and (d) converts it to its homogeneous representation. In the notation STEx-y, x is the state index and y is the stride size. The resulting NFA is called vectorized 4-stride because each 4-bit symbol in STE4–0 represents one dimension in a four-dimensional vector, and each dimension will be mapped to a separate memory column in the Impala’s area-efficient matching architecture. The paper uses the vectorization concept for efficient hardware mapping. The temporal striding method is already discussed in .
While the exponential hardware cost of naively increasing the processing rate to achieve higher throughput in the classical 8-bit in-memory architectures discourages a naive memory-based solution for multi-symbol matching, the proposed mechanism redesigns the matching architecture to a set of short and parallel memory columns combined with an AND gate, where every 4-bit increase in processing rate only requires an additional parallel 16-bit memory column. Columns are placed in different subarrays, and each receives 4 bits of the input symbols and processes them independently. This is a linear increment of hardware cost compared to the exponential growth in the naive matching architectures. The Impala calls each of these combined memory columns a capsule. Capsules are suitable for states with a simple matching character set. But, for infrequent matching cases that a single capsule can not precisely implement, and this may lead to generating false-positive reports. For example, in the above figure, (e) shows how using only one capsule to implement STE4–0 generates wrong reports when the input is \xBDEB. For addressing this issue, authors use Espresso  to find the minimum symbol splits to avoid false positives. In the above figure, (f) shows splitting STE4–0 to three states, and (g) illustrates mapping them to three capsules. Noting that the algorithm in the above figure shows the offline pre-processing steps to map an automaton to Impala’s architecture. By this proposed mechanism, the authors mention that the compiling time per automaton increases, but it is much less compared to AP compiler and FPGA synthesis tools.
The following figure shows the Impala mechanism. The matching operation is performed by a group of memory subarrays to distribute the matching burden among them in parallel. Memory columns with the same index in subarrays are combined using an AND gate to form a capsule. Matching of each state will be assigned to one of these capsules. Each memory column in a capsule processes a portion, for example, 4 bits, of the input symbols by a single memory row access. The final single-bit matching result for each state is the output of the AND gate in each capsule. Any additional 4-bit input processing only adds one more column to each capsule. Thanks to this inherent parallelism in the Impala, the latency of the overall matching of every stride value is always equivalent to the read-cycle latency of a short bit line memory subarray plus the latency of the AND gate in a capsule.
The interconnect provides the functionality to move active states toward the next states. A state gets activated if (1) the current symbol matches the state and (2) any of its parents were activated in the previous cycle. The second condition implies that the interconnect should provide the OR-functionality. The Impala interconnect is based on CA’s  full-crossbar design. However, Sadredini et al. in their previous work  showed that the connectivity pattern in real-world automata are diagonal-shaped in a full-crossbar design mapping, and this insight is used to allow for reduced cross-bar switches. They take inspiration from their observation and present a placement solution that addresses the issues (low utilization and if an automaton is larger than 256 states and has long-distance loops, the CA’s placement algorithm cannot handle it) in the CA  placement for reduced crossbars using a genetic algorithm.
The Impala is an in-memory accelerator for multi-stride automata processing. It is co-designed with Vectorized Temporal Squashing and Striding algorithm (V-TeSS). Also, it leverages short and parallel memory columns to implement a dense, high-throughput, and low-power multi-symbol matching architecture. Overall, the benefits of Impala come from two observations: (1) smaller state-matching sub-arrays provide higher utilization of memory cells in memory columns, and this translates to higher density, and (2) V-TeSS transformation provides higher throughput with a linear increase in state-matching memory columns relative to the original 8-bit design. The evaluations show significant performance and throughput per area improvements and lower power consumption compared to the CA .
For Future Reading
- C. Xie, S. L. Song, J. Wang, W. Zhang, and X. Fu, “Processing-in-Memory Enabled Graphics Processors for 3D Rendering,” 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2017, pp. 637–648, DOI: 10.1109/HPCA.2017.37.
- P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes, “An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing,” in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 25, no. 12, pp. 3088–3098, Dec. 2014, DOI: 10.1109/TPDS.2014.8.
- E. Sadredini, R. Rahimi, V. Verma, M. Stan, and K. Skadron, “A Scalable and Efficient In-Memory Interconnect Architecture for Automata Processing,” in IEEE Computer Architecture Letters (CAL), vol. 18, no. 2, pp. 87–90, 1 July-Dec. 2019, DOI: 10.1109/LCA.2019.2909870.
Michela Becchi and Patrick Crowle, “Efficient regular expression evaluation: theory to practice”. In Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS ‘08). Association for Computing Machinery, New York, NY, USA, 50–59. DOI:https://doi.org/10.1145/1477942.1477950
R. L. Rudell and A. Sangiovanni-Vincentelli, “Multiple-Valued Minimization for PLA Optimization,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 6, no. 5, pp. 727–750, September 1987, DOI: 10.1109/TCAD.1987.1270318.
- Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron, “EAP: A Scalable and Efficient In-Memory Accelerator for Automata Processing,” In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ‘52). New York, NY, USA, 87–99. DOI:https://doi.org/10.1145/3352460.3358324
 E. Sadredini, R. Rahimi, M. Lenjani, M. Stan and K. Skadron, “Impala: Algorithm/Architecture Co-Design for In-Memory Multi-Stride Pattern Matching,” 2020 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020, pp. 86–98, DOI: 10.1109/HPCA47549.2020.00017.
 P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes, “An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing,” in IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 25, no. 12, pp. 3088–3098, Dec. 2014, DOI: 10.1109/TPDS.2014.8.
 A. Subramaniyan, J. Wang, R. M. Ezhil Balasubramanian, D. Blaauw, D. Sylvester and R. Das, “Cache Automaton,” 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2017, pp. 259–272.
 E. Sadredini, R. Rahimi, V. Verma, M. Stan, and K. Skadron, “A Scalable and Efficient In-Memory Interconnect Architecture for Automata Processing,” in IEEE Computer Architecture Letters (CAL), vol. 18, no. 2, pp. 87–90, 1 July-Dec. 2019, DOI: 10.1109/LCA.2019.2909870.
 R. Liu et al., “Parallelizing SRAM Arrays with Customized Bit-Cell for Binary Neural Networks,” 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018, pp. 1–6, DOI: 10.1109/DAC.2018.8465935.
 B. C. Brodie, D. E. Taylor, and R. K. Cytron, “A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching,” 33rd International Symposium on Computer Architecture (ISCA’06), 2006, pp. 191–202, DOI: 10.1109/ISCA.2006.7.
 Michela Becchi and Patrick Crowle, “Efficient regular expression evaluation: theory to practice”. In Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS ‘08). Association for Computing Machinery, New York, NY, USA, 50–59. DOI:https://doi.org/10.1145/1477942.1477950
 R. L. Rudell and A. Sangiovanni-Vincentelli, “Multiple-Valued Minimization for PLA Optimization,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 6, no. 5, pp. 727–750, September 1987, DOI: 10.1109/TCAD.1987.1270318.
 Elaheh Sadredini, Reza Rahimi, Vaibhav Verma, Mircea Stan, and Kevin Skadron, “EAP: A Scalable and Efficient In-Memory Accelerator for Automata Processing,” In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ‘52). New York, NY, USA, 87–99. DOI:https://doi.org/10.1145/3352460.3358324