Microprocessor and Computer Architecture

  1. Home
  2. Docs
  3. Microprocessor and Comput...
  4. Pipelining
  5. Parallel Processing and Pipelining

Parallel Processing and Pipelining

Parallel processing and pipelining are two concepts used in computer architecture and design to improve the performance of processors.

Screenshot 2024 03 07 145942

Parallel processing is a computing paradigm where multiple processing units or cores work together to solve a problem, execute a task, or process data simultaneously.

  • The goal is to improve computational efficiency and overall system performance by dividing a larger problem into smaller subproblems that can be solved concurrently.
  • Parallel processing can be implemented at different levels, ranging from hardware-level parallelism to software-level parallelism.

Throughput in parallel processing refers to the rate at which a system can complete tasks or process data when multiple tasks are executed simultaneously or in parallel.

• In parallel processing, tasks are divided into smaller sub-tasks, which are then executed concurrently across multiple processing units, such as CPU cores, GPUs, or distributed computing nodes.

A multiple-functional unit refers to a hardware component or module capable of performing various tasks or functions simultaneously or in parallel.

• Having multi-functional units allows for concurrent execution of multiple tasks, which can lead to improved performance and throughput for tasks that can be parallelized.

• This is particularly important in tasks such as scientific computing, data analysis, and simulations, where processing large amounts of data or performing complex calculations can benefit from parallel execution.

Screenshot 2024 04 06 184531
  • Processing multiple bits simultaneously.
  • Commonly found in SIMD (Single Instruction, Multiple Data) architectures.
  • Concurrent execution of multiple instructions.
  • Techniques include pipelining, superscalar architectures, and out-of-order execution.
  • Concurrent execution of independent tasks or processes.
  • Implemented through multiple processors or cores.
    • Same operation performed on different data.
    • Often used in SIMD architectures and parallel computing frameworks.
    • Different tasks or processes executed concurrently.
    • Common in multiprocessor systems and distributed computing.
    • Similar to instruction pipelining, where different stages of a task are executed concurrently.
      • A single instruction is applied to multiple data elements simultaneously.
      • Often used in vector processing.
      • Multiple instructions are executed on multiple data elements concurrently.
      • Common in multiprocessor systems.
      • Shared Memory: Multiple processors share a common memory space.
      • Distributed Memory: Each processor has its own local memory.
      • Graphics Processing Units (GPUs) are designed for parallel processing, particularly for graphics rendering and general-purpose computing (GPGPU).

      In a loosely coupled computer, the components are relatively independent of each other and communicate through message passing or some form of inter-process communication (IPC).

      • Each component in a loosely coupled system can operate independently and can be replaced or upgraded without significantly affecting the other components.
      • Loosely coupled computer are often more flexible and scalable, as they can easily accommodate changes and additions without disrupting the entire system.
      • Examples of loosely coupled systems include distributed systems, where different nodes communicate over a network but can function independently.
      • In a tightly coupled computer, the components are highly dependent on each other, and they often share resources and data directly.
      • Changes to one component in a tightly coupled system can have significant impacts on other components, and failures in one part of the system can propagate to other parts.
      • Tightly coupled computer are often designed for high performance and efficiency, as they can take advantage of shared resources and direct communication.
      • Examples of tightly coupled systems include multiprocessor systems, where multiple processors share the same memory and work closely together to execute tasks.
      Screenshot 2024 04 06 190404

      Amdahl’s Law is a principle in computer architecture and parallel computing that predicts the theoretical speedup in execution time of a task as a function of the number of processors used.

      • It was formulated by computer scientist Gene Amdahl in 1967.

      • The law states that the speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program.

      The formula for Amdahl’s Law is:

      Speedup= 1/[(1-P)+P/N)]


      • Speedup is the theoretical speedup of the execution time.
      • P is the proportion of the program that can be parallelized.
      • N is the number of processors.

      Suppose we have a program where only 60% of the code can be parallelized, and the remaining 40% is strictly sequential.

      P=0.6 (60% can be parallelized)

      1−P=0.4 (40% is sequential)

      Let’s say we have 10 processors, so N=10.

      Using Amdahl’s Law, the speedup would be:

      Screenshot 2024 04 06 193757

      So, theoretically, using 10 processors would result in a speedup of approximately 1.5625 times faster compared to using only one processor.

      Let’s consider a scenario where 90% of the program can be parallelized and only 10% is sequential.

      P=0.9 (90% can be parallelized)

      1−P=0.1 (10% is sequential)

      Again, let’s assume we have 10 processors.

      Using Amdahl’s Law:

      Screenshot 2024 04 06 194011

      In this case, even though 90% of the program can be parallelized, due to the small sequential fraction, the theoretical speedup is relatively modest. The speedup is approximately 1.0989 times faster when using 10 processors compared to using only one processor.

        • Faster execution of tasks by distributing the workload among multiple processors.
        • Adding more processors can potentially scale performance linearly for certain types of problems.
        • Utilizes available processing power efficiently by allowing concurrent execution.
        • Redundancy in tasks can enhance system reliability and fault tolerance.
          Untitled Diagram 618

          Pipelining is a technique used in computer architecture to improve the throughput and efficiency of instruction execution in a processor.

          • It breaks down the execution of an instruction into several stages, and each stage is performed concurrently for different instructions.
          • This overlapping of instruction execution stages enables the processor to handle multiple instructions simultaneously.

          Let us consider an example of combined multiplication and addition operation to get a better understanding of the pipeline organization.

          • The combined multiplication and addition operation is done with a stream of numbers such as:

           Ai* Bi + Ci for i = 1, 2, 3, ......., 7

          • The operation to be performed on the numbers is decomposed into sub-operations with each sub-operation to be implemented in a segment within a pipeline.

          • The sub-operations performed in each segment of the pipeline are defined as:

          R1  ← Ai,  R2 ← Bi		Input Ai, and Bi
          R3 ← R1 * R2, R4 ← Ci	    Multiply, and input Ci
          R5 ← R3 + R4			Add   Ci to product

          • The following block diagram represents the combined as well as the sub-operations performed in each segment of the pipeline.

          Screenshot 2024 04 06 194839

          • Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular segment.

          • The output generated by the combinational circuit in a given segment is applied as an input register of the next segment. For instance, from the block diagram, we can see that the register R3 is used as one of the input registers for the combinational adder circuit.

          Screenshot 2024 04 06 195351
          Screenshot 2024 04 06 195741
          • The diagram shows six tasks T1 through T6 executed in four segments. Initially, task T1 is handled by segment 1. After the first clock, segment 2 is busy with T1, while segment 1 is busy with task T2.
          • Continuing in this manner, the first task T1 is completed after the fourth clock cycle.
          • From then on, the pipe completes a task every clock cycle. No matter how many segments there are in the system, once the pipeline is full, it takes only one clock period to obtain an output.
          • Now consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n tasks.
          • The first task T1 requires a time equal to ktp to complete its operation since there are k segments in the pipe. The remaining n – 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n – 1)tp.
          • Therefore, to complete n tasks using a k-segment pipeline requires k + (n – 1) clock cycles. For example, the diagram of Fig. 4 shows four segments and six tasks. The time required to complete all the operations is 4 + (6 – 1) = 9 clock cycles, as indicated in the diagram.
          • Next consider a nonpipeline unit that performs the same operation and takes a time equal to tn. to complete each task. The total time required for n tasks is ntn.
          • The speedup of a pipeline processing over an equivalent non pipeline processing is defined by the ratio S=ntn/(k+n−1)tp
          • As the number of tasks increases, n becomes much larger than k – 1, and k + n – 1 approaches the value of n. Under this condition, the speedup becomes S=tn/tp
          • If we assume that the time it takes to process a task is the same in the pipeline and non pipeline circuits, we will have tn = ktp. Including this assumption, the speedup reduces to S=ktp/tp=k
          Screenshot 2024 04 06 200322
          Screenshot 2024 04 06 200432
            • Fetch the instruction from memory using the program counter.
            • Decode the instruction, determining the operation and operand addresses.
            • Execute the operation, performing arithmetic or logical calculations.
            • Access memory if the instruction requires reading or writing data.
            • Write the result back to the register file.
              • Multiple instructions are in different stages of execution simultaneously, leading to improved overall throughput.
              • The processor’s functional units are better utilized as they can be active in different stages for different instructions.
              • Overlapping stages reduce the time taken to execute individual instructions.
                Differences Between Parallel Processing and Pipelining

                How can we help?

                Leave a Reply

                Your email address will not be published. Required fields are marked *