“The Overall View of the Microarchitecture of Superscalar Processors” Summary
Scalar processors process only one data item in a clock cycle, which are classified as Single Instruction Single Data (SISD) processors in Flynn’s taxonomy. On the other side, superscalar processors are capable of executing more than one instruction in a clock cycle by exploiting Instruction-Level Parallelism (ILP). Superscalar processors convert a seemingly sequential program into a more parallel one. A superscalar processor fetches and decodes the incoming instruction stream several instructions at a time. Pipelined processors' initiation rate is one instruction per cycle. it was perceived to be a serious practical bottleneck. But, superscalar processors by initiating more than one instruction at a time into multiple pipelines, broke the single instruction per cycle bottleneck. As part of the instruction fetching process, the outcomes of conditional branch instructions are usually predicted in advance to ensure an uninterrupted stream of instructions. The outcoming instruction stream is then analyzed for data dependencies, and instructions are distributed to functional units, often according to instruction type. Next, instructions are initiated for execution in parallel, based primarily on the availability of operand data, rather than their original program sequence, which is referred to as Dynamic Instruction Scheduling. Upon completion, instruction execution results are resequenced so that they can be used to update the process state in the original program order. This article summarizes the aforementioned phases of superscalar processing, which is extensively described in this paper .
Note that a computer designer is usually faced with maintaining binary compatibility, i.e., maintaining instruction set compatibility and a sequential execution model. However, superscalar processor implementations deviate radically from sequential execution. As a result, the program binary nowadays should be viewed as a specification of what has to be done, not how it is done in reality. A modern superscalar processor takes the sequential specification as embodied in the program binary and removes much of the nonessential sequentially to turn the program into a parallel, higher-performance version, yet the processor retains the outward appearance of sequential execution.
Program Representation, Dependencies, and Parallel Execution
A high-level program (e.g., written in C++) compiles into the static machine level program or the program binary. This static program consists of available instructions in the processor’s Instruction Set Architecture (ISA). This program follows an implicit sequence nature. In other words, the instructions are supposed to execute one after another. The following figure shows a high-level program and its corresponding assembly code.
The sequence of executed instructions of the static program forms a dynamic instruction stream. In other words, by increasing the program counter static instructions enter into a dynamic sequencing. when there is a conditional branch or jump, the program counter updates to a non-consecutive address. Control dependency of instruction is its dependence on the execution of its dynamically precedent instructions. The first step in increasing instruction-level parallelism (ILP) is to overcome control dependencies.
The sample program above can be viewed as a collection of basic blocks of instructions, with a single entry point and a single exit point. That example consists of three basic blocks. Once a basic block is entered by the instruction fetcher, all instructions inside it will be executed finally. Therefore, any sequence of instructions in a basic block can be initiated into a conceptual window of execution. Once instructions have been initiated into this execution window, they can execute in parallel, subject only to data dependence constraints.
Alongside the parallelism in basic blocks for more parallelism, control dependencies, due to conditional branches, have to be overcome. A solution is to predict the outcome of a conditional branch and speculatively fetch and execute instructions from the predicted path. Instructions from the predicted path enter into the execution window. If the prediction is later found to be correct, then the speculative status of the executed instructions is removed, and their effect on the state is like other instructions. However, if the prediction was incorrect, the recovery actions must be initiated to prevent incorrect modification of the architectural process state.
Instructions of basic blocks placed in the execution block may begin execution subject to data dependencies. These data dependencies among instructions happen because the instructions may access the same memory location or register. The data dependencies are referred to as data hazards. For more parallelism, these dependencies must be taken into account to prevent any incorrect operation. Data dependencies are categorized into three groups: (1) Read-After-Write (RAW) or true dependency, (2) Write-After-Read (WAR) dependency, and (3) Write-After-Write dependency. The WAR and WAW dependencies are called artificial data dependencies because they are usually caused by unoptimized code, limited register storage, desire to use less main memory, and loops where an instruction causes a hazard with itself. The following figure shows the data hazards in a basic block of the sample code.
After resolving control dependencies and artificial dependencies, instructions are issued and begin to execute in parallel. In essence, the hardware forms a parallel execution schedule. This schedule takes into account true dependencies and hardware resource constraints of functional units and data paths. A parallel execution schedule means that instructions complete execution in an order different from the dictated one by the sequential execution model.
The architectural storage (visible to the world outside microarchitecture) cannot be updated immediately when instructions complete execution. The results of an instruction are held in a temporary status until the architectural state can be updated. Meanwhile, for higher performance purposes, these results must be usable by dependent instructions. Eventually, when it is determined that the sequential model would have executed an instruction, its temporary results are made permanent by updating the architectural state. This process is called committing or retiring the instruction. The following figure illustrates the parallel execution method used in most superscalar processors. Noting that the only order that the parallel execution of instruction inside an execution window follows is the true dependence constraint (RAW data dependency).
The Microarchitecture of a Typical Superscalar Processor
The major parts of a superscalar processor’s microarchitecture are (1) instruction fetch and branch prediction, (2) decode and register dependence analysis, (3) issue and execution, (4) memory operation analysis and execution, (5) instruction reorder and commit. The following figure illustrates the microarchitecture of a typical superscalar processor.
Note that underlying this organization there is a pipelined implementation where specific pipeline stages may or may not be aligned with the major phases of superscalar execution. Those pipeline stages are considered a lower-level implementation design issue, which this article does not dive into it.
Instruction Fetching and Branch Prediction
This phase supplies instructions to the rest of the processing pipeline. An instruction cache is used to reduce the latency and increase the bandwidth of the instruction fetch process. As a superscalar processor processes several instructions per instruction, so the fetch phase must fetch multiple instructions per cycle. To support this high instruction fetch bandwidth, the instruction cache and data cache are separated. The instruction fetch bandwidth should match the peak instruction decode and execution rate and is usually higher. This allows for instruction cache misses and for situations (e.g., branch instruction) where fewer than the maximum number of instructions can be fetched. Some processors have taken special steps to allow wider fetches, for example by fetching from two adjacent cache lines simultaneously.
To help smooth out the instruction fetch irregularities caused by cache misses and branches, there is an instruction buffer (can be found in the figure above after instruction cache) to hold several fetched instructions. With the help of this buffer, the fetch mechanism can build up a stockpile to carry the processor through periods when instruction fetching is stalled or restricted.
The default instruction fetching is to increment the PC by the number of instructions fetched, and use the incremented PC to fetch the next block of instructions. But, in the case of branches, which redirect the flow, the fetch mechanism must be redirected to fetch from the branch target. The processing of conditional branches can be broken down into the following parts:
- Recognition of instruction as a conditional branch: Identifying instruction types can be sped up by holding some instruction decode information alongside instructions in the instruction cache. These extra bits allow very simple logic to identify the instruction types. There is a pre-decode logic before the instruction cache, which generates pre-decode bits and stores them in the instruction cache.
- Determining the branch outcome (taken/ not-taken): There is a dependence between the branch instruction and a preceding uncompleted instruction. Rather than waiting, the outcome of a conditional branch can be predicted using one of several types of branch prediction mechanisms. In general, the branch prediction methods can be broken into two kinds: (1) Static mechanisms, (2) Dynamic Mechanisms. In static methods, for example, some opcode types might more result in taken branches than others, or a backward branch prediction might be more often taken, or the compiler might be able to set a flag in an instruction to indicate the most likely branch direction based on the knowledge of the high-level language program. Information gathered during a previous run of the program by the compiler (profiling) can also be used to act as an aid for static branch prediction methods. On the other hand, dynamic branch prediction uses information that becomes available as the program executes. This information regards the history of branch outcomes. This history is saved in a branch history table (or branch prediction table) or may be appended to the instruction cache line that contains the branch. The table is typically organized in a cache-like manner and is accessed with the address of the branch instruction to be predicted. Within the table, the history is recorded by using multiple bits, typically implemented as small counters (so the results of several past branches can be summarized, and based on them the prediction is done). The counters are incremented on a taken branch (stopping at a max value) and are decremented on a not-taken branch (stopping at a min value). Consequently, a counter’s value summarizes the dominant outcome of recent branch executions. At some time after a prediction has been made, the actual branch outcome is evaluated. Dynamic branch history information (counters) can then be updated. If the prediction was incorrect, instruction fetching must be directed to the right path. Furthermore, if instructions were processed speculatively based on the prediction, they must be purged and their results must be nullified.
- Computing the branch target: Computing a branch target requires an integer addition. In most architecture, branch targets are relative to the PC and use an offset value held in the instruction, which eliminates the need for reading the registers (However, there may be some branches that do need register contents to compute branch targets). Finding the target address can be sped up by having a branch target buffer that holds the target address that was used the last time the branch was executed.
- Transferring control by redirecting instruction fetch in the case of a taken branch: On a taken branch, it takes at least one cycle delay in recognizing the branch, modifying the program counter, and fetching instructions from the target address. This delay results in pipeline bubbles. The most common solution is using an instruction buffer with its stockpiled instruction to mask the delay. Some of the earlier RISC instruction sets used delayed branches. But, more recent processors, delayed branches are considered to be a complication because of the assumptions they make about the pipeline structure.
Instruction Decoding, Renaming, and Dispatch
In this phase, instructions are removed from the instruction fetch buffers, examined, and control and data dependence linkages are set up for the remaining pipeline phases. Note that RAW hazards are detected in this phase and other hazards (WAW and WAR) are resolved. The WAR and WAW hazards are caused by register reuse. Also, this phase dispatches (or distributes) instructions to buffers associated with hardware functional units for later issuing and execution. The decoding phase specifies (1) an operation to be executed, (2) the identities of storage elements where the input operands reside or will eventually reside, (3) locations where the instruction’s result must be placed.
In the static program, the storage elements are the architectural storage elements, namely the architected, or logical, registers and main memory locations. For overcoming WAR and WAW hazards and increasing parallelism during dynamic execution, there are physical storage elements that differ from the logical storage elements. During parallel execution, there may be multiple data values stored in multiple physical storage elements, which all the values are associated with the same logical storage element. Noting that each of the values corresponds to different points in time during a purely sequential execution process. When an instruction creates a new value for a logical register, the physical location of the value is given a name, which is known by the hardware. Any subsequent instructions, which use the value as an input, are provided with the name of its physical storage location. It is accomplished in this phase (shown in the figure above) by replacing the logical register name in an instruction’s execution information (like register designator) with the new name of the physical storage location. This process is called “Register Renaming,” and it is accomplished for eliminating false dependencies. Generally, there are two register renaming methods as described below:
(1) A larger physical register file (RF) than the logical register file: In this method, there is a mapping table that associates a physical register with the logical register’s value. Instructions are decoded and register renaming is performed in sequential program order. When an instruction is decoded, its logical result register (destination) is assigned a physical register from a free list and the mapping table is adjusted to reflect the new logical to physical mapping. The free list keeps a record of the physical registers that do not have a logical value assigned to them. In the process, the physical register is removed from the free list. Also, as part of the rename operation, the instruction’s source register designators are used to look up their current physical register names in the mapping table. These are the locations from which the source operand values will be read. Instructions dispatch is temporarily halted if the free list is empty. The following figure shows a simple physical register renaming. Note that architected (logical) registers are shown with lower-case letters. Also, pay attention that any subsequent instruction that reads the value produced by the add instruction will have its source register mapped to R2 so that it gets the correct value.
The only matter is reclaiming the physical registers for the free list as a physical register has been used for the last time. It can be placed back into the free list for use by other instructions. One simple method is associating a counter with each physical register. The counter is incremented each time a source register is renamed to the physical register and is decremented each time an instruction issues and reads a value from the register. The physical register is free whenever the count is zero, provided the corresponding logical register has been subsequently renamed to another physical register. A simpler alternative method is to wait until the corresponding logical register has not only been renamed by a later instruction but the later instruction has received its value and has been committed.
(2) A physical register file in the size of the logical register file:
In this method, the customary one-to-one relationship is maintained. Also, there is a buffer with one entry per active instruction. An active instruction has been dispatched for execution but has not yet been committed. This buffer is typically referred to as a reorder buffer because it is also used to maintain proper instruction orderings for precise interrupts. The following figure illustrates a reorder buffer. It is a FIFO storage implemented in hardware as a circular buffer with head and tail pointers. As instructions are dispatched according to sequential program order, they are assigned entities at the tail of the reorder buffer. As instructions complete execution, their result values are inserted into their previously assigned entry, whenever it may happen to be in the reorder buffer.
At the time an instruction reaches the head of the reorder buffer, if it has completed execution, its entry is removed from the buffer and its result value is placed in the register file. An incomplete instruction blocks at the head of the reorder buffer until its value arrives. A logical register value may reside either in the physical register file or may be saved in the reorder buffer. When an instruction is decoded, its result value is first assigned a physical entry in the reorder buffer and a mapping table entry is set up accordingly. That is, the table indicates that a result value can be found in the specific reorder buffer entry corresponding to the instruction that produces it. An instruction’s source designators are used to access the mapping table. The table either indicates that the corresponding register has the required value, or that it may be found in the reorder buffer. The following figure shows the register renaming method with the help of reorder buffer.
Instruction Issuing and Parallel execution
As the instructions are decoded and the execution information is ready, the next step is to determine which instructions can be issued for execution. Instruction issue is defined as the run-time checking for the availability of data and resources. This is an area of the processing pipeline which is at the heart of many superscalar implementations, the area which contains the window of execution. An instruction is ready to execute as soon as its input operands are available. But, other constraints may be placed on the issue of instructions, most notably the availability of physical resources such as execution units, interconnect, and register file or reorder buffer ports.
The following figure is a possible execution schedule of the first code in this article. The assumption is that there are two integer units, one path to the memory, and one branch prediction unit. Note that on the figure only register renamings for “r3” are shown.
There are several different ways of organizing issue buffers (buffers storing instructions from dispatch).
(1) Single Queue Method: if there is a single queue, and no out-of-order issuing, then register renaming is not needed, and operand availability can be managed via simple reservation bits assigned to each register. A register is reserved when an instruction, which modifies the register, issues, and the reservation is cleared when the instruction completes. An instruction may issue (subject to physical resource availability) if there are no reservations on its operands.
(2) Multiple Queue Method: With multiple queues, instructions are issued from each queue in order, but the queues may issue out of order concerning one another. The individual queues are organized according to instruction type (integer, floating-point, load/store). In this way, register renaming may be used in a restricted way. For example, only the registers loaded from memory may be renamed. This allows the load/store queue to slip ahead of the instruction queues, fetching memory data in advance of when it is needed.
Reservations Stations: The idea of reservation stations was introduced in Tomasulo’s algorithm in which instruction may issue out of order. So, all of the reservation stations simultaneously monitor their source operands for data availability. The traditional way of doing this is to hold operand data in the reservation station. But, in newer reservation station implementations pointers to where the data can be found are held, e.g., in the register file or a reorder buffer entry. Traditionally, when an instruction is dispatched to the reservation station, all already available operand values are read from the register file and placed in the reservation station. Then reservation station logic compares the operand designators of unavailable data with the result designators of completing instructions. When there is a match, the result value is pulled into the matching reservation station. When all the operands are ready in the reservation station, the instruction may issue subject to hardware resource availability. The following figure shows an example of a reservation station.
Handling Memory Operations
The assumption is that our processors follow RISC architecture and only support load/store instructions. To reduce the latency of memory operations, memory hierarchies are employed. The expectation is that most data requests will be serviced by data cache memories residing at the lower levels of the hierarchy. Unlike ALU instructions, for which it is possible to identify during the decode phase the register operands that will be accessed, it is not possible to identify the memory locations until after the issue phase because the determination of the memory location that will be accessed requires an address calculation, usually an integer addition. After address calculation, an address translation may be required to generate a physical address. A cache of translation descriptors of recently accessed pages, the translation lookaside buffer (TLB), is used to speed up this address translation process. Once a valid memory address has been obtained, the load or store operation can be submitted to the memory. For more performance, usually address translation and memory access are overlapped. The initial cache access is done in parallel with the address translation and the translated address is then used to compare with cache tag(s) to determine if there is a hit.
Multi-porting the memory hierarchy and using non-blocking caches are solutions toward gaining higher performance. Multi-porting can be achieved by having multi-ported storage cells, having multiple banks of memory, or making multiple requests during the same cycle. To allow memory operations to overlap with other operations, the memory hierarchy must be non-blocking. That is, if a memory request misses in the data cache, other processing operations, including further memory requests, should be allowed to proceed.
Noting that the key to allowing memory operations to be overlapped (to proceed out of order) is to ensure that hazards are properly resolved, and the sequential semantics are preserved. Store address buffers are used to make sure that operations submitted to the memory hierarchy do not violate hazard conditions. A store buffer contains addresses of all pending store operations. Before an operation, which can be either a load or a store, is issued to the memory, the store buffer is checked to see if there is a pending store to the same address. The following figure shows a simple method for the implementation of hazard detection logic.
Once the operation has been submitted to the memory hierarchy, the operation may hit or miss in the data cache. In a missing case, the line containing the accessed data has to be fetched into the cache. Further accesses to the missing line must be made to wait, but other accesses may proceed. Miss Handling Status Registers (MHSRs) are used to track the status of outstanding cache misses and allow multiple requests to the memory hierarchy to be overlapped.
This is the last phase of an instruction’s lifetime. This phase is also called the “retire” phase. In this phase, the instructions are allowed to modify the logical (or architectural) process state. The purpose of this stage is to implement the appearance of a sequential execution model even though the actual execution is very non-sequential.
There are two main techniques to recover a precise state, both of which require the maintenance of two types of state: the state that is being updated as the operations execute and another state that is required for recovery.
In the first technique, the state of the machine at certain points is saved (or checkpointed) in either a history buffer or a checkpoint [2, 3]. Instructions update the state of the machine as they execute and when a precise state is needed, it is recovered from the history buffer. In this case, all that has to be done in the commit phase is to get rid of the history state that is no longer required.
The second mechanism is to separate the state of the machine into two: The physical state is updated immediately as the operations are complete. The architectural state of the machine is updated in sequential program order, as the speculative status of operations is cleared. With this technique, the speculative state is maintained in a reorder buffer; to commit an instruction, its result has to be moved from the reorder buffer into the architectural register file and the memory in the case of a store, and space is freed up in the reorder buffer. Also, during the commit phase of a store instruction, a signal is sent to the store buffer and the store data is written to memory.
In the reorder buffer shown earlier, also for each entry, there is a place for the data value after it has been computed, but before it is committed. And, there is typically a place to hold the PC value of the instruction and any interrupt conditions that might occur as a result of the instruction’s execution. These are used to inform the commit logic when an interrupt should be initiated and the program counter value that should be entered as part of the precise interrupt state.
Although the checkpoint/history buffer method has been used in superscalar processors, the reorder buffer technique is by far the more popular method because, in addition to providing a precise state, the reorder buffer method also helps implement the register renaming function.
The Role of Software
The software can assist superscalar processing by creating a binary so that the instruction fetching process and the instruction issuing and execution process can be made more efficient. This can be done in the following two ways:
- Increasing the likelihood that a group of instructions that are being considered for the issue can be issued simultaneously.
- Decreasing the likelihood that an instruction has to wait for a result of a previous instruction when it is being considered for execution.
Both these techniques can be achieved by scheduling instructions statically. By arranging the instructions so that a group of instructions in the static program matches the parallel execution constraints of the underlying superscalar processor, the first objective can be achieved. The parallel execution constraints that need to be considered include the dependence relationships between the instructions, and the resource available in the microarchitecture. To achieve the second goal, an instruction that produces a value for another instruction can be placed in the static code so that it is fetched and executed far in advance of the consumer instruction.
 J. E. Smith and G. S. Sohi, “The microarchitecture of superscalar processors,” in Proceedings of the IEEE, vol. 83, no. 12, pp. 1609–1624, Dec. 1995, DOI: 10.1109/5.476078.
 W. W. Hwu and Y. N. Patt, “Checkpoint Repair for High-Performance Out-of-Order Execution Machines,” in IEEE Transactions on Computers, vol. C-36, no. 12, pp. 1496–1514, Dec. 1987, DOI: 10.1109/TC.1987.5009500.
 J. E. Smith and A. R. Pleszkun, “Implementing precise interrupts in pipelined processors,” in IEEE Transactions on Computers, vol. 37, no. 5, pp. 562–573, May 1988, DOI: 10.1109/12.4607.