Parallel processing and pipelining are two concepts used in computer architecture and design to improve the performance of processors.
Parallel Processing:
Parallel processing is a computing paradigm where multiple processing units or cores work together to solve a problem, execute a task, or process data simultaneously.
- The goal is to improve computational efficiency and overall system performance by dividing a larger problem into smaller subproblems that can be solved concurrently.
- Parallel processing can be implemented at different levels, ranging from hardware-level parallelism to software-level parallelism.
‣ Throughput:
Throughput in parallel processing refers to the rate at which a system can complete tasks or process data when multiple tasks are executed simultaneously or in parallel.
• In parallel processing, tasks are divided into smaller sub-tasks, which are then executed concurrently across multiple processing units, such as CPU cores, GPUs, or distributed computing nodes.
‣ Multiple Functional Units:
A multiple-functional unit refers to a hardware component or module capable of performing various tasks or functions simultaneously or in parallel.
• Having multi-functional units allows for concurrent execution of multiple tasks, which can lead to improved performance and throughput for tasks that can be parallelized.
• This is particularly important in tasks such as scientific computing, data analysis, and simulations, where processing large amounts of data or performing complex calculations can benefit from parallel execution.
Block Diagram of Processor with Multiple Functional Units:
Types of Parallel processing:
‣ Bit- Level Parallelism:
- Processing multiple bits simultaneously.
- Commonly found in SIMD (Single Instruction, Multiple Data) architectures.
‣ Instruction – Level Parallelism:
- Concurrent execution of multiple instructions.
- Techniques include pipelining, superscalar architectures, and out-of-order execution.
‣ Task – Level parallelism:
- Concurrent execution of independent tasks or processes.
- Implemented through multiple processors or cores.
Form of Parallel Processing:
‣Data Parallelism:
- Same operation performed on different data.
- Often used in SIMD architectures and parallel computing frameworks.
‣ Task Parallelism:
- Different tasks or processes executed concurrently.
- Common in multiprocessor systems and distributed computing.
Pipeline parallelism:
- Similar to instruction pipelining, where different stages of a task are executed concurrently.
Parallel Processing Architectures:
‣Single Instruction, Multiple Data (SIMD):
- A single instruction is applied to multiple data elements simultaneously.
- Often used in vector processing.
‣ Multiple Instruction, Multiple Data(MIMD):
- Multiple instructions are executed on multiple data elements concurrently.
- Common in multiprocessor systems.
‣ Shared Memory and Distributed Memory Architectures:
- Shared Memory: Multiple processors share a common memory space.
- Distributed Memory: Each processor has its own local memory.
‣ GPU Parallelism:
- Graphics Processing Units (GPUs) are designed for parallel processing, particularly for graphics rendering and general-purpose computing (GPGPU).
Loosely Coupled and Tightly Coupled computers:
‣ Loosely Coupled Computer:
In a loosely coupled computer, the components are relatively independent of each other and communicate through message passing or some form of inter-process communication (IPC).
- Each component in a loosely coupled system can operate independently and can be replaced or upgraded without significantly affecting the other components.
- Loosely coupled computer are often more flexible and scalable, as they can easily accommodate changes and additions without disrupting the entire system.
- Examples of loosely coupled systems include distributed systems, where different nodes communicate over a network but can function independently.
‣ Tightly Coupled Computer:
- In a tightly coupled computer, the components are highly dependent on each other, and they often share resources and data directly.
- Changes to one component in a tightly coupled system can have significant impacts on other components, and failures in one part of the system can propagate to other parts.
- Tightly coupled computer are often designed for high performance and efficiency, as they can take advantage of shared resources and direct communication.
- Examples of tightly coupled systems include multiprocessor systems, where multiple processors share the same memory and work closely together to execute tasks.
Schematic Diagram of an MIMD Processor:
Amdahl`s Law:
Amdahl’s Law is a principle in computer architecture and parallel computing that predicts the theoretical speedup in execution time of a task as a function of the number of processors used.
• It was formulated by computer scientist Gene Amdahl in 1967.
• The law states that the speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program.
The formula for Amdahl’s Law is:
Speedup= 1/[(1-P)+P/N)]
Where:
- Speedup is the theoretical speedup of the execution time.
- P is the proportion of the program that can be parallelized.
- N is the number of processors.
Example 1:
Suppose we have a program where only 60% of the code can be parallelized, and the remaining 40% is strictly sequential.
P=0.6 (60% can be parallelized)
1−P=0.4 (40% is sequential)
Let’s say we have 10 processors, so N=10.
Using Amdahl’s Law, the speedup would be:
So, theoretically, using 10 processors would result in a speedup of approximately 1.5625 times faster compared to using only one processor.
Example 2:
Let’s consider a scenario where 90% of the program can be parallelized and only 10% is sequential.
P=0.9 (90% can be parallelized)
1−P=0.1 (10% is sequential)
Again, let’s assume we have 10 processors.
Using Amdahl’s Law:
In this case, even though 90% of the program can be parallelized, due to the small sequential fraction, the theoretical speedup is relatively modest. The speedup is approximately 1.0989 times faster when using 10 processors compared to using only one processor.
Advantages of Parallel Processing:
‣ Improved Performance:
- Faster execution of tasks by distributing the workload among multiple processors.
‣ Scalability:
- Adding more processors can potentially scale performance linearly for certain types of problems.
‣ Efficient Resource Utilization:
- Utilizes available processing power efficiently by allowing concurrent execution.
‣ Fault Tolerance:
- Redundancy in tasks can enhance system reliability and fault tolerance.
Pipelining:
Pipelining is a technique used in computer architecture to improve the throughput and efficiency of instruction execution in a processor.
- It breaks down the execution of an instruction into several stages, and each stage is performed concurrently for different instructions.
- This overlapping of instruction execution stages enables the processor to handle multiple instructions simultaneously.
Example:
Let us consider an example of combined multiplication and addition operation to get a better understanding of the pipeline organization.
• The combined multiplication and addition operation is done with a stream of numbers such as:
Ai* Bi + Ci for i = 1, 2, 3, ......., 7
• The operation to be performed on the numbers is decomposed into sub-operations with each sub-operation to be implemented in a segment within a pipeline.
• The sub-operations performed in each segment of the pipeline are defined as:
R1 ← Ai, R2 ← Bi Input Ai, and Bi R3 ← R1 * R2, R4 ← Ci Multiply, and input Ci R5 ← R3 + R4 Add Ci to product
• The following block diagram represents the combined as well as the sub-operations performed in each segment of the pipeline.
• Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular segment.
• The output generated by the combinational circuit in a given segment is applied as an input register of the next segment. For instance, from the block diagram, we can see that the register R3 is used as one of the input registers for the combinational adder circuit.
Content of Registers in Pipeline Example:
General Consideration:
- The diagram shows six tasks T1 through T6 executed in four segments. Initially, task T1 is handled by segment 1. After the first clock, segment 2 is busy with T1, while segment 1 is busy with task T2.
- Continuing in this manner, the first task T1 is completed after the fourth clock cycle.
- From then on, the pipe completes a task every clock cycle. No matter how many segments there are in the system, once the pipeline is full, it takes only one clock period to obtain an output.
- Now consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n tasks.
- The first task T1 requires a time equal to ktp to complete its operation since there are k segments in the pipe. The remaining n – 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n – 1)tp.
- Therefore, to complete n tasks using a k-segment pipeline requires k + (n – 1) clock cycles. For example, the diagram of Fig. 4 shows four segments and six tasks. The time required to complete all the operations is 4 + (6 – 1) = 9 clock cycles, as indicated in the diagram.
- Next consider a nonpipeline unit that performs the same operation and takes a time equal to tn. to complete each task. The total time required for n tasks is ntn.
- The speedup of a pipeline processing over an equivalent non pipeline processing is defined by the ratio S=ntn/(k+n−1)tp
- As the number of tasks increases, n becomes much larger than k – 1, and k + n – 1 approaches the value of n. Under this condition, the speedup becomes S=tn/tp
- If we assume that the time it takes to process a task is the same in the pipeline and non pipeline circuits, we will have tn = ktp. Including this assumption, the speedup reduces to S=ktp/tp=k
Stages of a Pipelined Instruction Execution:
IF – Instruction Fetch:
- Fetch the instruction from memory using the program counter.
ID – Instruction Decode:
- Decode the instruction, determining the operation and operand addresses.
EX – Execute:
- Execute the operation, performing arithmetic or logical calculations.
MEM – Memory Access:
- Access memory if the instruction requires reading or writing data.
WB – write Back:
- Write the result back to the register file.
Benefits of Pipelining:
Increased Throughput:
- Multiple instructions are in different stages of execution simultaneously, leading to improved overall throughput.
Better Resource Utilization:
- The processor’s functional units are better utilized as they can be active in different stages for different instructions.
Reduced Execution Time:
- Overlapping stages reduce the time taken to execute individual instructions.
Parallel Processing VS. Pipelining
Feature | Parallel Processing | Pipelining |
---|---|---|
Nature | Multiple tasks are performed simultaneously | Single task is broken into stages for parallel execution of different instructions |
Units of Work | Multiple independent tasks or processes | Single task divided into stages (fetch, decode, execute, memory access, write back) |
Execution Model | Multiple processors or cores | Single processor with overlapping stages |
Dependencies | Limited dependencies between independent tasks | Dependencies between pipeline stages must be managed |
Communication | Requires communication between processors or cores | Internal communication within the pipeline stages |
Throughput | Can achieve high throughput by processing multiple tasks concurrently | Improves throughput by overlapping stages of a single task |
Resource Utilization | Efficient utilization of multiple processors or cores | Better utilization of functional units within a single processor |
Execution Overlap | Simultaneous execution of independent tasks | Overlapping execution of stages for different instructions |
Hazards | Dependencies between tasks can lead to synchronization and communication overhead | Hazards such as data hazards, control hazards, and structural hazards must be managed |
Scalability | Scales well with the addition of more processors or cores | Limited scalability due to dependencies and potential hazards |
Examples | Distributed computing, multiprocessor systems | Instruction pipelining in CPUs, GPU pipelines for graphics rendering |