Processors: CPU, GPU, FPGA, Accelerator
Note 1: This post is an introductory video and can be useful for pure software/ algorithm/ data science people who have not taken any course on computer architecture and hardware and want to learn. Also, it can be informative for computer engineering and science bachelor students.
Note 2: The Xilinx company as one of the largest FPGA vendors, from Feb. 14, 2022, is a part of AMD [check here]. It is mentioned because, in the video, Xilinx is mentioned as an independent company.
Introduction
Computer Science people have heard many times that GPUs were the primary enablers of the current Deep Learning surge. The following video reviews the mentioned processors in the title and goes a little far on how CPUs work. Furthermore, it distinguishes them.
CPU
CPU stands for Central Processing Unit. The “central” comes from the fact that these processors are the primary devices that are in charge of the whole system. It means that the operating system runs on them. Also, they handle the data flow throughout the whole system. Other processors like GPUs or other application-specific processors come to help this processor when this processor offloads some of its jobs to them.
CPU is designed in a way that is fast at executing sequential programs. CPU can execute a sequential program very fast compared to other processors as it offers a high working frequency. Also, another reason for this advantage is the employed sophisticated hardware structures, e.g., out-of-order execution, cache hierarchy, pipeline, and branch prediction unit.
Modern CPUs bring several instructions from a program stored in the memory and execute them implicitly parallelly. Actually, CPUs take advantage of Instruction Level Parallelism (ILP), which the programmer does not know anything about it. However, these days, CPUs have more cores and offer execution of more than one thread at a time if the programmer is going to think and develop their program in a parallel way. Besides, CPUs offer fine-grain and efficient time-sharing techniques compared to for example GPUs. Due to this capability, it can easily switch among threads and keep all of them satisfied with how fast it executes.
GPU
GPU stands for Graphics Processing Unit. They emerged as co-processors for graphics applications. But, with the introduction of CUDA and Tesla architecture by Nvidia in 2006, these devices entered the general computing area. These devices encompass a lot of simpler cores compared to CPUs, which enables them to launch a large number of threads at a time. It is the reason why they are good at executing parallel programs. They are parallel processors and work in lower clock frequencies compared to CPUs. Fine-grain sharing and many other mechanisms that are available in CPUs, cannot be found in these parallel processors. The following figure shows the CPU and GPU schematic, which demonstrates their architecture differences.
FPGA
FPGA stands for Field-Programmable Gate Array. These devices offer flexibility compared to CPU and GPU from the connectivity of inner circuits. In FPGA, the designer describes the goal circuit in a Hardware Description Langauge (HDL) such as Verilog or VHDL. The designer specifies input, output, and the circuit that the input will go through it to change into the output. It is better to think about CPUs and GPUs as instruction executors that their circuits are fixed and built during their manufacturing process. However, FPGAs have a grid of logic cells that the designer will connect and build what they want. The following figure shows an FPGA’s architecture. The meaning of flexibility offered by FPGA is that after building a circuit inside an FPGA, the programmer can erase it and build another circuit.
Accelerator
These devices are application-specific devices that implement hardware circuits (like how CPUs and GPUs circuits are built) for solving a specific problem, e.g., deep learning inference. How these devices get input and pass those through their circuits is determined by their designer. Google TPUs and Cerebras WSEs are examples of deep learning accelerators that aim at running deep learning workloads fast and efficiently.
Comparison
To compare the earlier discussed processors from flexibility, ease of use, performance, and power efficiency perspectives, the following figure is a representative demonstration.
Flexibility means the number of different programs that a user can run on the processor without facing any harsh challenges.
Ease of use means the needed effort to develop a program.
CPUs are offering the highest flexibility and ease of use because a user can select a high-level language like Python and develop what they need. While, as they do more work to accomplish computing to make the highest flexibility possible, they are worst from a performance and power efficiency perspective. On the other hand, ASIC or accelerator devices offer the highest performance and power efficiency as they build the physical circuit for an application. But, they are application-specific, and using them requires knowing what they are exactly.