What Is CPU Architecture?
CPU architecture is essentially the design of the Central Processing Unit or CPU. Its design determines its capabilities and features. For the A+ exam, I doubt you will get a question asking you too much about the specifics of anything in this video. This video is just to give you a basic understanding of the concepts and different types of architecture that are available.
Shown here is the basic design of a CPU. Although the manufacturer is free to make the CPU using any design they like, modern CPUs follow this basic design. Before we look at how some of the components in the CPU work, I will first start with the basics of how a CPU is controlled.
CPU Instruction Set
For a CPU to execute software, it needs an instruction set. This is effectively a set of instructions the CPU understands. Although there have been many instruction sets over the years, the main ones used today are x86, x64, and ARM.
In order for the CPU to execute instructions, the program needs to be compiled into machine code. Shown here are some examples of assembly code, which can be compiled into machine code. It is not important for the technician to understand programming at such a low level and nowadays it is a very specialized skill. Most professional programmers will go through their whole career and never need to program assembly code.
I have shown some assembly code to demonstrate how different each of the instruction sets are from each other. Software compiled for one instruction set generally won’t run on a CPU with a different instruction set. However,
x64 is backwards compatible, meaning newer x64 CPUs can run older x86 code. ARM devices typically run their own operating systems and software and are not compatible with x86 systems.
x86 Instruction Set
To gain a deeper insight into the x86 instruction set, I will embark on a brief historical journey, starting with its origins. x86 traces its roots back to 1978, originating with the introduction of the 8086 CPU.
In 1985, Intel released the 80386. This was the first CPU from Intel that was 32-bit. In order to get 32-bit and be compatible with 8086, Intel extended the original instruction set. This meant it was compatible with the older CPUs; however, it also allowed 32-bit instructions. These 32-bit instructions were often referred to as i386 after the CPU.
The evolution of the x86 instruction set saw significant milestones, culminating in a pivotal moment in 1993 with the debut of the Pentium processor. This marked the maturity of the core instruction set. After this, Intel and AMD generally started adding additional features to the CPU rather than adding to the base instruction set. This means the foundational instruction set for x86 is now robust and comprehensive. New instructions get added as a feature rather than changing the base instruction set or adding to it. The question then arises, how does this impact us in the present day?
Since it was released over 30 years ago, nowadays everyone is using at least 32-bit architecture. Thus, when you see x86 it is most likely talking about 32-bit, not anything before this like 16-bit. Since the instruction set matured around 1993, any CPU from the last 20 years will run any code from that instruction set. In the old days, a program may have been compiled for different CPUs or with a particular CPU in mind. This is not a concern anymore because the instruction set has been around for so long.
Due to the long history of the instruction set and how many instructions were added along the way, you may see it referred to by different names. The good news is, with any modern CPU it will run any code for the x86 instruction set, so you don’t need to worry about what it is called, just recognize that it is referring to the x86 instruction set.
The common names include x86-32. The 32 referring to it being 32-bit. IA-32 refers to Intel Architecture 32-bit. i386 refers to a 80386 CPU. Even today you will see this sometimes used as it was the first 32-bit CPU. If you see i486, i586 or i686, they are all referring to CPUs that came out after the 80386. Nothing to worry about, modern CPUs handle all the older codes.
The main takeaway is that the x86 instruction set nowadays is a mature instruction set and, regardless of what it is called, it will run on modern CPUs.
32-bit served our computing needs for a long time, but started having problems as computers became more advanced. The main problem was that 32-bit computing was limited to only being able to access at most four gigabytes of memory. Let’s have a look at how the next instruction set solves that problem.
x64 Instruction Set
In 2003 AMD launched their first 64-bit CPU. 64-bit allowed the CPU to access more than four gigabytes of memory. In order to create it, AMD essentially used the x86 instruction set and extended it to support 64-bit. Intel went in a different direction, creating a new instruction set for 64-bit. However, those CPUs were not that popular in the marketplace and they ended up adopting the AMD instruction set and using it in their CPUs. Thus, the x64 instruction set is known by a number of different names. It is also called x86-64 in reference to it being 64-bit. It is also known as AMD64 and Intel 64. All these names refer to the same x64 instruction set which is essentially an upgraded version of the x86 instruction set for 64-bit.
The base instruction set remains the same for future CPUs; However, what does change is new instructions are added as extended instruction sets. This is generally not a big concern as the CPU can be queried to see which extended instructions it supports. Software that uses it will generally run code that supports that CPU. In some cases, certain features may be required in order to run certain software. For example, virtualization may require the CPU to support certain features in order to operate.
The main takeaway from this is that, unless you are running specialized software like virtualization, you should not have a problem running 64-bit software. Newer CPUs will do things faster and better, but the underlying base instruction set remains the same.
Since the instruction set is essentially an upgraded version of x86, 64-bit CPUs are backward compatible with 32-bit software. You can run both on the same computer without any problems.
The x86 and the x64 instruction sets have been quite popular in computing but do have disadvantages. These mainly being in power use and heat issues. Let’s have a look at a different instruction set which addresses these problems.
Advanced RISC Machines
Advanced RISC Machines, or ARM, developed a design of CPU which uses a very different instruction set to the x86 based ones. Later in the video I will go into more detail about how they are different.
ARM is different from other CPU manufacturers in that they don’t manufacture any CPUs themselves. ARM sells CPU designs to other companies. Those companies are then free to make changes, for example, by adding hardware to the design. They could add a video card, network card or a sound card. Essentially the manufacturer is free to add whatever hardware they like to the design of the CPU assuming (from an engineering point of view) it is possible.
When a manufacturer adds the majority of the components of a computer into the CPU this is referred to as a ‘System on Chip’. Using a System on Chip in a device means you only need a circuit board with minimal components to get it to work. This is ideal for devices like mobiles or credit-card sized computers where you have minimal space to add extra components.
The company then will need to have the CPU manufactured. Due to how difficult it is to make CPUs, there are not too many companies in the world that manufacture them. Even large companies like Apple don’t manufacture their own chips.
So now that we understand how these CPUs are made and what the designs are attempting to achieve, let’s have a closer look at how the internals of these CPUs work.
x86/x64 vs ARM
To understand how ARM works, I will compare it with x86 and x64 CPUs. These CPUs have more instructions, approximately 1000. In contrast, ARM CPUs have fewer instructions, approximately 100.
The designs of the CPUs are very different. The higher-instruction CPUs are like a Swiss army knife – there are so many instructions, it is like having a box full of tools. Having so many instructions means programs require fewer instructions, as a single complex instruction may do the equivalent of a few simple ones.
In contrast, ARM CPUs are more like having a small set of specialized tools. There are not so many of them and thus programs are required to have more instructions. So, this means programs are going to be larger on ARM CPUs as more instructions are required. Given the size of modern storage, this will not be that noticeable, but it can affect how long it takes to complete tasks.
Having more instructions makes the CPU more complex. This means the CPUs are harder to make and cost more. ARM CPUs are simpler in design, this makes them cheaper and more scalable. More scalable in that, from an engineering perspective, it is simpler to add additional cores.
Having more instructions means higher power use. Higher power use means more heat. Depending on the CPU, this may mean that you need to have cooling in the computer, for example, CPU fans. ARM CPUs use less power and that means less heat.
You can start to see why ARM CPUs get used in certain devices. For example, they get used in devices that run off batteries, such as mobiles. Having less heat means you don’t need to have specialized cooling in the device, allowing you to make it smaller. Using less power also means that the battery in your device is going to last longer.
CPU Pipeline
Modern CPUs execute instructions using a pipeline. A pipeline is a set of instructions in a queue waiting to be executed by the CPU. This pipeline is like a conveyor belt with a worker at the end of it.
The worker executes the instructions on the conveyor belt in order; however, sometimes the worker may need more data; for example, if more data is needed from the main memory of the computer. The CPU will then request this data. While the CPU is waiting, a CPU stall occurs. When this happens, the pipeline stops, and the CPU is not doing any work. This is not ideal, as you want the CPU executing instructions the whole time. Let’s have a look at how modern CPUs get around this.
Instruction Pipelining
When a CPU stall occurs, this means the CPU is effectively not being utilized, which reduces its efficiency. Thus, CPU manufacturers want to design a CPU to reduce how often a stall occurs to best maximize the circuits in it for processing rather than being idle.
To understand how the pipeline works, I will break it down into components. The first part being the instructions waiting to be processed. The second part being the pipeline itself. The third part being the completed instructions.
In this example I will use a four-stage pipeline. Some pipelines have a different number of stages, but this is the one covered by CompTIA, although I doubt you will get a question directly on the pipeline in the exam. The first stage is Fetch: In this stage, an instruction is retrieved from memory.
The second stage is Decode. Decode takes the machine code instruction and prepares it for execution by the CPU. Modern CPUs have microcode, which is a layer of internal instructions that control the CPU’s operations at a very granular level. While instruction sets like x86 or ARM are used by programs, the microcode is the CPU’s own internal language that it uses to execute these instructions. During the decode stage, the instruction is translated into the specific microcode operations that the CPU will carry out.
The next stage is Execute. This stage is what runs the instructions. Once the result has been determined, the last step is write-back, which writes the result to memory.
Now that we understand the stages, let’s see how they work. CPUs work off a clock cycle. The computer generates a clock signal which consists of carefully timed consecutive pulses. This forms a CPU’s most basic unit of measurement of time called a cycle. All the components inside the CPU synchronize off of this pulse.
To understand how the pipeline works, let’s now consider that there are some instructions waiting to be executed. On the next cycle, the first instruction will be moved to stage one of the pipeline.
The instruction will be fetched from memory. Fetching is the process of getting the instruction from memory and placing it in the pipeline. Next the instruction will be moved to the next stage and a new instruction moved into stage 1. This is the decode stage. This stage works out what the instruction needs to do and takes appropriate action. For example, whether the instruction needs to be passed to the arithmetic logic unit for basic integer math or floating-point unit for decimal math.
As you would expect, on the next cycle, the instructions are moved downwards again, but this time there is a problem. Unfortunately, the processing in the pipeline does not always go smoothly. In this example, the next step for the purple instruction is to decode it. However, decoding in this example requires the result of the green instruction. However, the green instruction has not been executed yet.
It’s quite common for an instruction further along in the pipeline to depend on the result of the current instruction. If this instruction has not been executed yet, this causes the CPU circuits for stage 2 to become idle – referred to as a bubble. CompTIA does not cover the bubble in their official study guides, so you don’t need to know it. But, knowing it exists does help us in the next topic that I will cover.
In the next cycle, the green instruction has completed, thus the purple instruction can now be executed. You can see the bubble is still present. Thus, the stage 3 circuits are effectively not being used which is the equivalent of a stall. The green instruction has now moved to stage 4, Write-back. In this stage the results are stored, either written to memory or to a CPU register.
In the next cycle, the instruction is moved down as you would expect. You will notice that the bubble has moved to stage 4. Thus, the bubble works like a stall, having a ripple effect through the pipeline. Dividing the pipeline up into stages does help reduce the effects, but it has not removed them completely.
With the next cycle, the CPU will continue to execute instructions. The bubble is no longer having an effect on the pipeline and the CPU will continue to keep executing instructions. You can see the effect a bubble has on the performance of a pipeline.
Although CompTIA does not cover how CPUs deal with a bubble, it does lead us to our next topic, and understanding what effect a bubble has helps us understand how CPUs overcome it.
Out-Of-Order Execution
One method used to help reduce stalls is out-of-order execution. In our example, a stall was caused because the green instruction required the blue instruction to be executed first. This caused part of the CPU circuits to stall.
Out-of-order execution attempts to eliminate the stall by simply changing the order of execution. Think of it as the CPU is a worker on a conveyor belt. If the worker can’t execute the green instruction, the worker stops the conveyor belt. However, rather than doing nothing, the worker moves down the conveyor belt looking for another instruction it can work on while waiting.
The worker moves down the pipeline doing what can be done. At some stage, the green instruction can be executed. When this occurs, the worker returns and continues working where they left off.
In order to change the order of instructions, the CPU must guarantee that any reordering of instructions will not alter the result. This is why the worker needs to go back and start work from where they left off. However, when the instructions reach the worker, some of the work has already been completed, thus allowing for faster execution.
This is a simple example. Modern CPUs use techniques like mini-queues and other sophisticated techniques to eliminate stalls thus increasing the efficiency of the CPU. You can start to understand why CPU microarchitecture changes so often. Manufacturers are always trying new techniques and completely changing the microarchitecture to get a better result.
The CPU requires some memory when it is performing calculations. Let’s have a look.
CPU Registers
The CPU has a limited number of CPU registers to store data temporarily. Shown here are some commonly used ones. The names differ for different types of instruction sets. The registers store data for a very short period of time.
You can see how registers are used to store data temporarily while the computer is performing calculations. They are the fastest memory inside the computer and I will next look at the second fastest.
Computer Cache
The next fastest memory is CPU cache. This is fast memory located inside the CPU designed to hold a copy of frequently used memory, so the CPU does not have to take the longer, slower path to main memory. I doubt you would need to know any more than that, but I have a bit of a deeper dive into cache, so when you are buying a CPU you know what you are getting.
To find out more information about the cache in your CPU, you can use free software called CPU-Z. At the bottom, you can see the information in relation to the cache of this CPU. CPU cache is small, fast memory built into the CPU to store copies of frequently used data and instructions from main memory. Modern CPUs, generally, come with three levels of cache called L1, L2 and L3. Higher levels are generally of increasing size and decreasing speed.
L1 cache is the smallest and fastest, located on the CPU core itself. L2 cache is larger and may be shared among cores, and is still much faster than main memory. L3 cache is the largest and slowest, shared by all cores, although still much faster than RAM. Cache hits (that is data found in cache) improves performance, while cache misses (data not found) requires slower main memory access. Cache size and algorithms are crucial for optimizing performance.
This CPU has ten cores. L1 cache is divided between the cores, so in this CPU there are ten separate L1 caches. It is also subdivided into data cache and instruction cache. So, for this CPU, there are ten 32 Kilobyte data L1 caches and ten 32 Kilobyte instruction caches for a total of 640 Kilobytes of cache.
The rationale for having separate caches for data and instructions lies in their distinct nature: instructions, or machine code, are read-only, whereas data is read/write. At this fundamental level, the CPU benefits from a cache designed exclusively for read-only data.
For this CPU, the L2 cache is divided up between each of the cores. Not all CPUs will divide L2 cache up like this. For example, it is not uncommon for low-power CPUs to have shared cache as it uses less power than dividing the cache up between the cores. However, having separate caches gives you better performance, so like a lot of things in computing, there is always a trade-off.
For this CPU, there is one shared L3 cache. Don’t get confused with the column on the right labeled “11-way” as this refers to the cache associativity. In simple terms, this means the cache is divided into 11 sections that can be searched simultaneously when the computer looks for data. The higher the associativity, the more sections can be searched at once, generally leading to faster data retrieval. Having high associativity does make the cache more complicated to design.
Memory Hierarchy
To help remember which memory is which, it may be helpful to consider the memory hierarchy. The memory hierarchy in computer systems has the fastest memory at the top. This is also the smallest amount of memory. As you go down the hierarchy, the memory gets larger and also less expensive.
There has been a lot covered in this video, so let’s do a quick summary of the major points.
Summary
The vast majority of CPUs on the market will use the x86, x64, or ARM instruction sets. Since the x86 has been around for a long time, it may be referred to by different names. The naming will be based on when the code was first written and what revisions of x86 were available at the time. Since the instruction set matured in the early 90’s, any modern CPU will run any version of x86. Due to Intel having stopped selling 32-bit CPUs to the general public, you should be seeing less and less 32-bit code.
x64 is the 64-bit version of x86. It has been a mature instruction set since the early 2000’s, so any modern CPU will run it. It is also known by a few different names and like x86, you can use them all interchangeably, so it is just a matter of recognizing the term is referring to x64. Additional instructions are added as extended instruction sets or features. With most software, if an instruction set or feature is not available, it will emulate it using software. Therefore, you won’t know the CPU does not have that instruction set or feature. In some rare cases, the software may require a particular feature in the CPU, for example, virtualization.
ARM’s instruction set is still under active development. All machine code for ARM, however, is just called ARM. There have been a number of different versions, currently 1 thru 9. While newer ARM versions often maintain backward compatibility with older instruction sets, it is not guaranteed. Due to software and library compatibility differences, some software may not even work on the same hardware.
Online stores like Google Play Store will automatically generate the required package for your hardware so you don’t need to worry. If you are using a specialized board like a Raspberry Pi, you may need to download packages for the hardware and OS of your device. While certain software packages for the Raspberry Pi are universally compatible, some will only work with certain models and setups.
Since ARM supports 32 and 64-bit CPUs, the software may also be compiled for a particular architecture, although 64-bit was not added until version 8.
The CPU cache is fast memory inside the CPU designed to store frequently used data, enhancing the speed and performance of the processor. Using cached data is more efficient than if it had to retrieve it from the main memory modules. Most modern CPUs use three levels of cache. Level 3 (L3) being the largest and also shared between all the cores. Level 2 (L2) may be restricted to one core. Level 1 (L1) cache is generally restricted to one core and is the smallest, although often divided into two halves: One half being dedicated to data and the other to instructions.
Summary
In order to perform calculations and other processing, the CPU has a small set of CPU registers. These are very fast memory designed only to keep data while the CPU is actively working on it. Different CPUs will have a different number of registers available for processing.
The CPU uses a pipeline to process instructions. A pipeline is the set of instructions in the queue waiting to be executed. The pipeline has four stages to process instructions. CPU manufacturers are always looking at ways to improve the pipeline to make it more efficient. For example, eliminating stalls in which the CPU can’t operate because it is waiting for something. Thus, how they implement these stages can vary dramatically between different CPUs.
End Screen
That concludes this video from ITFreeTraining on CPU architecture. I hope you have found it informative. Until the next video from us, I would like to thank you for watching.
References
“The Official CompTIA A+ Core Study Guide (Exam 220-1101)” pages 72 to 73
“Mike Myers All in One A+ Certification Exam Guide 220-1101 & 220-1102” pages 91 to 96
“Intel® 64 and IA-32 Architectures
Software Developer’s Manual” https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-manual.html
“Instruction pipelining” https://en.wikipedia.org/wiki/Instruction_pipelining
“Instruction Pipeline Architecture” https://www.youtube.com/watch?v=YhGv5AOcz1s
“Picture: 8086 CPU” https://upload.wikimedia.org/wikipedia/commons/e/e1/KL_Intel_D8086.jpg
“Picture: i386 chip” https://en.wikipedia.org/wiki/I386#/media/File:KL_Intel_i386DX.jpg
“Picture: AMD CPU” https://commons.wikimedia.org/wiki/File:AMD_Opteron_146_Venus,_2005.jpg
“Picture: CPU” https://pixabay.com/illustrations/circuit-board-computer-pattern-5936930/
“Picture: ARM CPU Dye” https://en.wikipedia.org/wiki/File:STM32F103VGT6-HD.jpg
Credits
Trainer: Austin Mason http://ITFreeTraining.com
Voice Talent: HP Lewis http://hplewis.com
Quality Assurance: Brett Batson http://www.pbb-proofreading.uk