Pipeline hazards are situations that occur in instruction pipelines, which can lead to incorrect or delayed instruction execution.
- These hazards can degrade the performance of the pipeline and need to be managed or mitigated to ensure efficient operation. There are several types of pipeline hazards:
Types of Pipeline Hazards:
‣ Data Hazards:
Data hazards occur when an instruction depends on the result of a previous instruction that has not yet been produced or written back.
There are three types of data hazards:
• Read-after-Write (RAW):
Also known as a true data hazard, it occurs when an instruction tries to read data before a previous instruction writes it.
• Write-after-Read (WAR):
Also known as an anti-dependency, it occurs when an instruction tries to write data before a previous instruction reads it.
• Write-after-Write (WAW):
Also known as an output dependency, it occurs when two instructions attempt to write to the same location in the same cycle, resulting in unpredictable behavior.
‣ Control Hazards:
Control hazards occur when the pipeline encounters a conditional branch instruction, and the next instruction’s address is not known until the branch instruction has been executed.
• This can lead to pipeline stalls or incorrect speculation.
‣ Structural Hazards:
Structural hazards occur when multiple instructions require access to the same hardware resource simultaneously.
• This can happen when multiple instructions need to access the same execution unit, memory port, or other shared resource.
Methods to Handle Pipeline Hazards:
Stall or Bubble:
Insert a no-operation (NOP) instruction or idle cycle into the pipeline to wait for the hazard to be resolved.
This introduces delays but ensures correct execution.
Forwarding (Data Bypassing):
Forward the result of a previous instruction directly to the input of the dependent instruction, bypassing the pipeline stages.
This eliminates the need to wait for the result to be written back to the register file.
Speculative Execution:
Predict the outcome of conditional branches and continue executing instructions based on the prediction.
If the prediction is correct, no delay occurs. If incorrect, the pipeline needs to be flushed, and execution restarts from the correct branch target.
Branch Prediction:
Predict the outcome of conditional branches based on historical data or heuristics.
This can reduce the impact of control hazards by fetching and executing instructions from the predicted branch path.
Instruction Reordering:
Reorder instructions dynamically to avoid hazards or optimize execution.
This technique is often used in out-of-order execution processors.
Data Dependency:
Data dependency refers to the relationship between instructions in a computer program where the outcome of one instruction depends on the data produced by another instruction.
• It is a fundamental concept in computer architecture and plays a crucial role in determining the execution order of instructions.
Here are some common methods used to handle data dependencies:
Hardware Interlocks(also known as hardware – based data forwarding or pipeline stalls):
Hardware interlocks involve inserting pipeline stalls or bubbles into the instruction pipeline to ensure that instructions dependent on a previous instruction wait until the required data is available.
• This method can lead to decreased performance due to pipeline bubbles, as it stalls the pipeline, causing delays in instruction execution.
Operand Forwarding (also known as data forwarding or bypassing):
Operand forwarding involves directly forwarding the result of a computation from one functional unit to another without writing it to memory first.
• This technique allows dependent instructions to access the required data without waiting for it to be written to memory and fetched again.
• Operand forwarding can improve performance by reducing the latency caused by data dependencies.
Delayed Load:
Delayed load is a technique used to mitigate data dependencies caused by load instructions.
• Instead of stalling the pipeline until the data is loaded from memory, the processor continues executing subsequent instructions, assuming that the data will be available in time.
• However, the result of the load instruction is not available immediately, and dependent instructions may need to be delayed or reordered to ensure correct execution.
Handling Branch Instructions:
Handling branch instructions efficiently is crucial for optimizing the performance of modern processors.
Here’s how various techniques and structures are typically used to handle branch instructions:
Prefetch Target Instruction:
When a branch instruction is encountered, the processor tries to prefetch the target instruction from memory before it’s actually needed.
• This helps in reducing the penalty caused by fetching the instruction after the branch is resolved.
• Prefetching can be done using techniques like instruction prefetch buffers or branch prediction mechanisms.
Branch Target Buffer (BTB):
BTB is a cache-like structure that stores the target addresses of recently executed branches along with their corresponding predictions.
• When a branch instruction is encountered, the processor checks the BTB to see if the target address is already cached.
• If it is, the predicted target address can be fetched directly, reducing the branch resolution latency.
Loop Buffer:
A loop buffer is a specialized structure that caches instructions belonging to frequently executed loops.
• When the processor detects a loop, it stores the instructions in the loop buffer, allowing for faster execution of subsequent iterations.
• This helps in reducing instruction fetch latency and improving overall performance.
Branch Prediction:
Branch prediction is a technique used to predict the outcome of a branch instruction before it’s actually executed.
• There are various branch prediction algorithms, such as static prediction, dynamic prediction (using techniques like two-level adaptive predictors, perceptron, etc.), and hybrid prediction (combining multiple prediction strategies).
• Predicted outcomes are used to speculatively execute instructions, mitigating the performance impact of mis predicted branches.
Delayed Branch:
In some architectures, instructions following a branch instruction may be allowed to start execution before the branch is resolved.
• This technique is known as delayed branching.
• It aims to utilize otherwise idle execution resources and improve instruction-level parallelism.
• However, care must be taken to handle potential hazards and ensure correct program semantics.
RISC Pipeline:
A RISC (Reduced Instruction Set Computer) pipeline is a key architectural feature of modern RISC processors, which are designed to execute instructions in a highly efficient and streamlined manner.
• The pipeline concept allows multiple instructions to be processed simultaneously, with each stage of instruction execution occurring in a separate pipeline stage. This increases overall throughput and performance.
• The method to obtain the implementation of an instruction per clock cycle is to initiate each instruction with each clock cycle and to pipeline the processor to manage the objective of single-cycle instruction execution.
• RISC compiler gives support to translate the high-level language program into a machine language program.
• There are various issues in managing complexity about data conflicts and branch penalties are taken care of by the RISC processors, which depends on the adaptability of the compiler to identify and reduce the delays encountered with these issues.
Example: Three – Segment Instruction Pipeline:
A three-segment instruction pipeline is a simplified model of a CPU pipeline architecture where the execution of instructions is divided into three stages or segments: instruction fetch, ALU operations, and instruction execute.
Instruction Fetch (IF):
In this stage, the CPU fetches the instruction from memory. Typically, the program counter (PC) holds the memory address of the next instruction to be executed. The CPU reads the instruction from this memory location and places it into an instruction register.
ALU Operations:
In this stage, the CPU decodes the fetched instruction. It identifies the operation code (opcode) and any operands that may be part of the instruction. The CPU may also resolve memory addresses and register operands during this stage.
Instruction Execute (EX):
In this stage, the CPU executes the operation specified by the instruction. This could involve arithmetic or logical operations, data transfers between registers or memory, or control flow operations such as branches or jumps.
Delayed Load:
Consider now the operation of the following four Instructions:
1. LOAD: R1←M[address 1]
2. LOAD: R2← [address 2]
3. ADD: R3← R1+R2
4. STORE: M[address]← R3
• If the three segment pipeline proceeds without interruptions, there will be a data conflict in instruction 3 because the operands in R2 in not yet available in the A segment.
• If the compiler cannot find the useful instruction to put after the load, it inserts a no- op ( no – operation) thus wasting a clock cycle. This concept of delaying the use of the data loaded from memory is referred to as delayed load.
Delayed Branch:
When branches are processed by a pipeline simply, after each taken branch, at least one cycle remains unutilized. This is because of the assembly line-like apathy of pipelining. Instruction slots following branches are known as branch delay slots.
• Delay slots can also appear following load instructions; these are defined load delay slots. Branch delay slots are wasted during traditional execution. However, when delayed branching is employed, these slots can be at least partly used.
Principle of Delayed Branching:
In the figure, it can transfer the add instruction of our program segment that initially preceded the branch into the branch delay slot. With delayed branching, the processor implements the add instruction first, but the branch will only be efficient later. Thus, in this example, delayed branching keep the initial execution sequence.
Super pipeline and Superscalar Processor:
Super Pipeline:
A super pipeline is a CPU architecture that divides the execution of instructions into a large number of stages, each performing a specific operation.
• These stages typically include instruction fetch, decode, execute, memory access, and write-back.
• By breaking down the instruction execution process into smaller, more manageable stages, the CPU can achieve higher clock frequencies and better throughput.
Key characteristics of a super pipeline include:
High Clock Frequencies:
• Super pipelines often operate at very high clock frequencies, allowing them to process instructions quickly.
Fine-Grained Pipelining:
• The pipeline stages are finely divided, with each stage handling a specific task. This fine-grained pipelining enables efficient instruction execution.
Increased Instruction Throughput:
• With more pipeline stages, the CPU can process multiple instructions simultaneously, increasing overall instruction throughput.
Potential Hazards:
• Super pipelines may encounter hazards such as data hazards, control hazards, and structural hazards, which can affect performance if not managed effectively.
Superscalar Processor:
A superscalar processor is a CPU architecture that can execute multiple instructions per clock cycle by having multiple execution units, each capable of executing a different instruction.
• Unlike traditional scalar processors, which execute one instruction at a time, superscalar processors exploit instruction-level parallelism to improve performance.
Key characteristics of a superscalar processor include:
Multiple Execution Units:
Superscalar processors have multiple functional units, such as arithmetic logic units (ALUs), floating-point units (FPUs), and load/store units. These units can execute different instructions concurrently.
Instruction Issue:
Instructions are decoded and dispatched to the appropriate execution units in parallel, taking advantage of available resources.
Dynamic Scheduling:
Superscalar processors often employ dynamic scheduling techniques to select and reorder instructions for execution based on data dependencies and available resources.
Out-of-Order Execution:
Instructions may be executed out of their original sequential order to maximize instruction-level parallelism and resource utilization.
Complex Control Logic:
Superscalar processors require sophisticated control logic to manage instruction dispatch, data dependencies, and execution unit utilization efficiently.
Vector Processing:
Vector processing is a CPU architecture paradigm designed to perform operations on arrays or vectors of data elements in parallel.
• In vector processing, a single instruction operates simultaneously on multiple data elements, typically arranged in vectors or arrays, rather than processing individual elements sequentially.
• This approach enables efficient execution of operations that exhibit data-level parallelism, such as mathematical computations and signal processing.
Functional Diagram of Vector Computer:
Applications of Vector Processing:
‣Long Range weather forecasting
‣ Petroleum Explorations
‣ Seismic data analysis
‣ Medical diagnosis
‣Artificial intelligence and expert systems
‣ Mapping the human genome
‣Image processing
Memory Interleaving:
Memory interleaving is a technique used in computer architecture to improve memory access performance by distributing memory addresses across multiple memory modules or banks.
• This technique aims to reduce memory access contention and exploit parallelism in memory systems.
Array Processors:
A processor that performs computations on a vast array of data is known as an array processor.
• Multiprocessors and vector processors are other terms for array processors.
• It only executes one instruction at a time on an array of data. They work with massive data sets to perform computations.
• Hence, they are used to enhance the computer’s performance.
Types of Array Processor:
1.) Attached Array Processor:
The attached array processor is the auxiliary processor connected to a general-purpose computer to enhance and improve the machine’s performance in numerical computational tasks.
• It provides excellent performance by using numerous functional units in parallel processing.
• The attached array processor intends to improve the performance of the host computer in specific numeric computations.
2.) SIMD Array Processor:
SIMD refers to the organization of a single computer with multiple parallel processors.
• The processing units are designed to work together under the supervision of a single control unit, resulting in a single instruction stream and multiple data streams.