Parallel Processing And Pipelining

Parallel processing and pipelining are two concepts used in computer architecture and design to improve the performance of processors.

Parallel Processing:

Parallel processing is a computing paradigm where multiple processing units or cores work together to solve a problem, execute a task, or process data simultaneously.

The goal is to improve computational efficiency and overall system performance by dividing a larger problem into smaller subproblems that can be solved concurrently.
Parallel processing can be implemented at different levels, ranging from hardware-level parallelism to software-level parallelism.

‣ Throughput:

Throughput in parallel processing refers to the rate at which a system can complete tasks or process data when multiple tasks are executed simultaneously or in parallel.

• In parallel processing, tasks are divided into smaller sub-tasks, which are then executed concurrently across multiple processing units, such as CPU cores, GPUs, or distributed computing nodes.

‣ Multiple Functional Units:

A multiple-functional unit refers to a hardware component or module capable of performing various tasks or functions simultaneously or in parallel.

• Having multi-functional units allows for concurrent execution of multiple tasks, which can lead to improved performance and throughput for tasks that can be parallelized.

• This is particularly important in tasks such as scientific computing, data analysis, and simulations, where processing large amounts of data or performing complex calculations can benefit from parallel execution.

Block Diagram of Processor with Multiple Functional Units:

Types of Parallel processing:

‣ Bit- Level Parallelism:

Processing multiple bits simultaneously.
Commonly found in SIMD (Single Instruction, Multiple Data) architectures.

‣ Instruction – Level Parallelism:

Concurrent execution of multiple instructions.
Techniques include pipelining, superscalar architectures, and out-of-order execution.

‣ Task – Level parallelism:

Concurrent execution of independent tasks or processes.
Implemented through multiple processors or cores.

Form of Parallel Processing:

‣Data Parallelism:

Same operation performed on different data.
Often used in SIMD architectures and parallel computing frameworks.

‣ Task Parallelism:

Different tasks or processes executed concurrently.
Common in multiprocessor systems and distributed computing.

Pipeline parallelism:

Similar to instruction pipelining, where different stages of a task are executed concurrently.

Parallel Processing Architectures:

‣Single Instruction, Multiple Data (SIMD):

A single instruction is applied to multiple data elements simultaneously.
Often used in vector processing.

‣ Multiple Instruction, Multiple Data(MIMD):

Multiple instructions are executed on multiple data elements concurrently.
Common in multiprocessor systems.

‣ Shared Memory and Distributed Memory Architectures:

Shared Memory: Multiple processors share a common memory space.
Distributed Memory: Each processor has its own local memory.

‣ GPU Parallelism:

Graphics Processing Units (GPUs) are designed for parallel processing, particularly for graphics rendering and general-purpose computing (GPGPU).

Loosely Coupled and Tightly Coupled computers:

‣ Loosely Coupled Computer:

In a loosely coupled computer, the components are relatively independent of each other and communicate through message passing or some form of inter-process communication (IPC).

Each component in a loosely coupled system can operate independently and can be replaced or upgraded without significantly affecting the other components.
Loosely coupled computer are often more flexible and scalable, as they can easily accommodate changes and additions without disrupting the entire system.
Examples of loosely coupled systems include distributed systems, where different nodes communicate over a network but can function independently.

‣ Tightly Coupled Computer:

In a tightly coupled computer, the components are highly dependent on each other, and they often share resources and data directly.
Changes to one component in a tightly coupled system can have significant impacts on other components, and failures in one part of the system can propagate to other parts.
Tightly coupled computer are often designed for high performance and efficiency, as they can take advantage of shared resources and direct communication.
Examples of tightly coupled systems include multiprocessor systems, where multiple processors share the same memory and work closely together to execute tasks.

Schematic Diagram of an MIMD Processor:

Amdahl`s Law:

Amdahl’s Law is a principle in computer architecture and parallel computing that predicts the theoretical speedup in execution time of a task as a function of the number of processors used.

• It was formulated by computer scientist Gene Amdahl in 1967.

• The law states that the speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program.

The formula for Amdahl’s Law is:

Speedup= 1/[(1-P)+P/N)]

Where:

Speedup is the theoretical speedup of the execution time.
P is the proportion of the program that can be parallelized.
N is the number of processors.

Example 1:

Suppose we have a program where only 60% of the code can be parallelized, and the remaining 40% is strictly sequential.

P=0.6 (60% can be parallelized)

1−P=0.4 (40% is sequential)

Let’s say we have 10 processors, so N=10.

Using Amdahl’s Law, the speedup would be:

So, theoretically, using 10 processors would result in a speedup of approximately 1.5625 times faster compared to using only one processor.

Example 2:

Let’s consider a scenario where 90% of the program can be parallelized and only 10% is sequential.

P=0.9 (90% can be parallelized)

1−P=0.1 (10% is sequential)

Again, let’s assume we have 10 processors.

Using Amdahl’s Law:

In this case, even though 90% of the program can be parallelized, due to the small sequential fraction, the theoretical speedup is relatively modest. The speedup is approximately 1.0989 times faster when using 10 processors compared to using only one processor.

Advantages of Parallel Processing:

‣ Improved Performance:

Faster execution of tasks by distributing the workload among multiple processors.

‣ Scalability:

Adding more processors can potentially scale performance linearly for certain types of problems.

‣ Efficient Resource Utilization:

Utilizes available processing power efficiently by allowing concurrent execution.

‣ Fault Tolerance:

Redundancy in tasks can enhance system reliability and fault tolerance.

Pipelining:

Pipelining is a technique used in computer architecture to improve the throughput and efficiency of instruction execution in a processor.

It breaks down the execution of an instruction into several stages, and each stage is performed concurrently for different instructions.
This overlapping of instruction execution stages enables the processor to handle multiple instructions simultaneously.

Example:

Let us consider an example of combined multiplication and addition operation to get a better understanding of the pipeline organization.

• The combined multiplication and addition operation is done with a stream of numbers such as:

 A_i* B_i + C_i for i = 1, 2, 3, ......., 7

• The operation to be performed on the numbers is decomposed into sub-operations with each sub-operation to be implemented in a segment within a pipeline.

• The sub-operations performed in each segment of the pipeline are defined as:

R1  ← A_i,  R2 ← B_i		Input A_i, and B_i
R3 ← R1 * R2, R4 ← C_i	    Multiply, and input C_i
R5 ← R3 + R4			Add   C_i to product

• The following block diagram represents the combined as well as the sub-operations performed in each segment of the pipeline.

• Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular segment.

• The output generated by the combinational circuit in a given segment is applied as an input register of the next segment. For instance, from the block diagram, we can see that the register R3 is used as one of the input registers for the combinational adder circuit.

Content of Registers in Pipeline Example:

General Consideration:

The diagram shows six tasks T1 through T6 executed in four segments. Initially, task T1 is handled by segment 1. After the first clock, segment 2 is busy with T1, while segment 1 is busy with task T2.
Continuing in this manner, the first task T1 is completed after the fourth clock cycle.
From then on, the pipe completes a task every clock cycle. No matter how many segments there are in the system, once the pipeline is full, it takes only one clock period to obtain an output.
Now consider the case where a k-segment pipeline with a clock cycle time t_p is used to execute n tasks.
The first task T1 requires a time equal to kt_p to complete its operation since there are k segments in the pipe. The remaining n – 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n – 1)t_p.
Therefore, to complete n tasks using a k-segment pipeline requires k + (n – 1) clock cycles. For example, the diagram of Fig. 4 shows four segments and six tasks. The time required to complete all the operations is 4 + (6 – 1) = 9 clock cycles, as indicated in the diagram.
Next consider a nonpipeline unit that performs the same operation and takes a time equal to t_n. to complete each task. The total time required for n tasks is nt_n.
The speedup of a pipeline processing over an equivalent non pipeline processing is defined by the ratio S=nt_n/(k+n−1)t_p
As the number of tasks increases, n becomes much larger than k – 1, and k + n – 1 approaches the value of n. Under this condition, the speedup becomes S=t_n/t_p
If we assume that the time it takes to process a task is the same in the pipeline and non pipeline circuits, we will have t_n = kt_p. Including this assumption, the speedup reduces to S=kt_p/t_p=k

Stages of a Pipelined Instruction Execution:

IF – Instruction Fetch:

Fetch the instruction from memory using the program counter.

ID – Instruction Decode:

Decode the instruction, determining the operation and operand addresses.

EX – Execute:

Execute the operation, performing arithmetic or logical calculations.

MEM – Memory Access:

Access memory if the instruction requires reading or writing data.

WB – write Back:

Write the result back to the register file.

Benefits of Pipelining:

Increased Throughput:

Multiple instructions are in different stages of execution simultaneously, leading to improved overall throughput.

Better Resource Utilization:

The processor’s functional units are better utilized as they can be active in different stages for different instructions.

Reduced Execution Time:

Overlapping stages reduce the time taken to execute individual instructions.

Parallel Processing VS. Pipelining

Feature	Parallel Processing	Pipelining
Nature	Multiple tasks are performed simultaneously	Single task is broken into stages for parallel execution of different instructions
Units of Work	Multiple independent tasks or processes	Single task divided into stages (fetch, decode, execute, memory access, write back)
Execution Model	Multiple processors or cores	Single processor with overlapping stages
Dependencies	Limited dependencies between independent tasks	Dependencies between pipeline stages must be managed
Communication	Requires communication between processors or cores	Internal communication within the pipeline stages
Throughput	Can achieve high throughput by processing multiple tasks concurrently	Improves throughput by overlapping stages of a single task
Resource Utilization	Efficient utilization of multiple processors or cores	Better utilization of functional units within a single processor
Execution Overlap	Simultaneous execution of independent tasks	Overlapping execution of stages for different instructions
Hazards	Dependencies between tasks can lead to synchronization and communication overhead	Hazards such as data hazards, control hazards, and structural hazards must be managed
Scalability	Scales well with the addition of more processors or cores	Limited scalability due to dependencies and potential hazards
Examples	Distributed computing, multiprocessor systems	Instruction pipelining in CPUs, GPU pipelines for graphics rendering

Differences Between Parallel Processing and Pipelining

Microprocessor and Computer Architecture