Skip to main content

Computer Organization: Chapter 11 - Pipeline

Computer Organization

Chapter 11 - Pipeline

Chapter 11: Pipeline

In this chapter, we introduce microarchitecture, which is the connection between logic and architecture. Microarchitecture is the specific arrangement of registers, ALUs, finite state machines, memories, and other logic building blocks needed to implement an architecture. We also define instruction pipelining, hazards, pipelined datapath, and pipelined control.

Objectives

By the end of this chapter you should be able to:

  • Identify five stages in MIPS pipeline
  • Recognize structure hazards, data hazard, and control hazard
  • Demonstrate knowledge of pipelined datapath
  • Clarify pipeline usage in a single-clock cycle
  • Clarify pipeline operation in multi-cycle pipeline diagram

11.1 Instruction Pipelining

R-Type Instruction

The instruction is fetched from memory, and the PC is incremented by 4 in the instruction fetch (IF) stage, as shown in Fig. 11-1. The fetched instruction is used by other parts of the datapath. Program Counter (PC) always holds the next memory address to be fetched, where PC is a byte address, not bit address. PC value is updated by adding 4 to the previous PC value.

Fig. ‑. Instruction Fetch Stage of R-Type Instruction

Fig. 11-2 shows the instruction decode (ID) stage of R-Type Instruction. The two elements needed to implement R-format ALU operations are the register file and the ALU. The register file contains all the registers and has two read ports and one write port. The register file always outputs the contents of the registers corresponding to the Read register inputs on the outputs; no other control inputs are needed. The inputs (RS and RT) carrying the register number to the register file are all 5-bit wide, whereas the lines carrying data values are 32-bit wide. The operation to be performed by the ALU is controlled with the ALU operation signal, which will be 4-bit wide.

Fig. ‑. Instruction Decode Stage of R-Type Instruction

The arithmetic operations are executed in the execute (EX) stage, as shown in Fig. 11-3. Two 32-bit wide inputs from register files are fed into ALU to execute logic operations.

Fig. ‑. Execute Stage of R-Type Instruction

There is nothing happening in memory access stage in R-type instruction.

In the write back (WB) stage of Fig. 11-4, the result from the ALU is written into the register file using bits 15:11 of the instruction to select the destination register.

Fig. ‑. Write Back Stage of R-Type Instruction

Load Instruction

In load instruction of Fig. 11-5, the instruction is fetched from memory, and PC value is increased by 4, which is the same as R-type instruction.

Fig. ‑. Instruction Fetch Stage of Load Instruction

The fetched instruction is used by other parts of the datapath. Program Counter (PC) always holds the next memory address to be fetched, where PC is a byte address, not bit address. PC value is updated by adding 4 to the previous PC value.

Fig. 11-6 shows the ID stage of Load instruction. In this stage the instruction field value [25 – 21] is fed into the register files and produces Read data 1 (32 bits), whereas the instruction field value [15 – 0] is fed into sign-extend function and produces a 32-bit constant/address value.

Fig. ‑. Instruction Decode Stage of Load Instruction

The memory address is calculated with two 32-bit values in the execute stage of Fig. 11-7.

Fig. ‑. Execute Stage of Load Instruction

Fig. ‑. Memory Access Stage of Load Instruction

Fig. 11-8 shows the memory access stage of Load instruction. In this stage, the control bit for MemWrite is set to 1. Data memory contents designated by the address input are replaced by the value on the Write data input.

As shown in the following figure, the control bit for MemtoReg is set to 1 in the write back stage. The value fed to the register Write data input comes from the data memory.

Fig. ‑. Write Back Stage of Load Instruction

Performance Issues

Historically early computers with very simple instruction sets did use this implementation technique. Pipelining improves efficiency by executing multiple instruction simultaneously.

The longest delay determines clock period in the pipeline. In the MIPS instruction sets, the load instruction is the critical path because it includes the following stage:

  • Instruction memory (IF) → register file (ID) → ALU (EX) → data memory (MEM) → register file (WB)

It is not feasible to vary period for different instructions, because that violates design principle, making the common case fast. We can improve performance by pipelining, meaning that each instruction is executed in a different stage simultaneously in the processor.

With pipeline, we can overlap the execution. It is the similar concept to improve the performance with parallelism. The laundry analogy exemplified this parallelism. Ann, Brian, Cathy, and Don each have dirty clothes to be washed, dried, folded, and put away. Assume there are total four laundries and four steps for each laundry, i.e. washer, dryer, folding clothes, and clothes closet. Each step needed 30 minutes to complete. A sequential laundry takes 8 hours for 4 loads of wash, i.e. 4 loads × 2 hours, whereas a pipelined laundry takes just 3.5 hours, i.e. 1.5 hours + 30 minutes × 4).

MIPS pipeline has five stages, one step per stage:

  • IF: Instruction fetch from memory
  • ID: Instruction decode & register read
  • EX: Execute operation or calculate address
  • MEM: Access memory operand
  • WB: Write result back to register

Let’s assume that the time for stages is as follows:

  • 100 ps for register read or write
  • 200 ps for other stages

In the following table, we can compare the total time of the pipelined datapath with a single-cycle datapath:

Table ‑. Pipelined DataPath

Instruction

Instruction fetch

Register

read

ALU op

Memory

access

Register

Write

Total

Time

lw

200 ps

100 ps

200 ps

200 ps

100 ps

800 ps

sw

200 ps

100 ps

200 ps

200 ps

700 ps

R-format

200 ps

100 ps

200 ps

100 ps

600 ps

beg

200 ps

100 ps

200 ps

500 ps

The load instruction includes all the pipeline stage so that the total time of the pipelined datapath is 800 ps, whereas the R-type instruction has a total time of 700 ps because it doesn’t include the memory access stage.

Fig. ‑. Nonpipelined Execution of Three Load Word Instruction

Fig. ‑. Pipelined Execution of Three Load Word Instruction

Figs. 11-10 and 11-11 compare nonpipelined and pipelined execution of three load word instructions. In the nonpipelined execution, a single-cycle is 800ps, thus the total time to execute three load instructions is 3 × 800 ps or 2400 ps in the nonpipelined design. On the other hand, in the pipelined execution, a clock cycle is 200 ps, and the pipelined execution clock cycle must have the worst-case clock cycle of 200 ps, even though some stages take only 100 ps. The total time to execute three load instructions is 200 ps × 5 + 200 ps × 2 or 1400 ps. Notice that the pipelined execution time (1400 ps) is faster than the nonpipelined execution time (2400 ls).

What would happen if we added 1,000,000 instructions in the pipelined and non-pipelined process in the above examples?

For the pipelined process, each instruction adds 200 ps to the total execution time. The total time will be as follows:

  • 1,000,000 × 200 ps + 1400 ps = 200,001,400 ps

For the nonpipelined process, each instruction adds 800 ps to the total execution time. The total time will be as follows:

  • 1,000,000 × 800 ps + 2400 ps = 800,002,400 ps

The ratio of total execution times for real programs on nonpipelined to pipelined processors will be like

fraction numerator 800 , 002 , 400 blank p s over denominator 200 , 002 , 400 blank p s end fraction approximately equal to fraction numerator 800 blank p s over denominator 200 blank p s end fraction equals 4.00

If all stages are balanced, i.e., all stage take the same time, the total time of the pipelined process can be faster (× number of stages) than the total time of the nonpipelined process. If all stages are not balanced, speedup is less. Note that this speedup is due to the increased throughput. The time for each instruction (latency) doesn’t decrease.

Fig. ‑. MIPS Pipelined Datapath

As shown in the above figure, MIPS Pipelined Datapath has IF (Instruction fetch), ID (Instruction decode/register file read), EX (Execute/address calculation), MEM (Memory access), and WB (Write back). Each step of the instruction can be mapped onto the datapath from left to right.

The update of the PC and the write-back step sends either the ALU result or the data from memory to the left to be written into the register file.

11.2 Pipelined Datapath

The pipelined datapath needs registers between stages. The pipeline registers separate each pipeline stage, as shown in the following figure.

Fig. ‑. Pipeline Registers

The pipeline registers are labeled by the stages that they separate; for example, the first is labeled IF/ID because it separates the instruction fetch and instruction decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 32-bit PC address. The pipeline operates cycle-by-cycle flow of instructions through the pipelined datapath. The single-clock-cycle pipeline diagram shows pipelined usage in a single cycle and highlight resources used in the pipeline, whereas the multi-clock-cycle diagram shows the graph of operation over time.

Let’s look at “single-clock-cycle” diagrams for load and store instructions.

Single-clock-cycle Pipeline Diagram

Fig. 11-14 shows the instruction being read from memory using the address in the PC and then being placed in the IF/ID pipeline register.

Fig. ‑. Instruction Fetch Stage for Load and Store

The PC address is incremented by 4 and then written back into the PC to be ready for the next clock cycle. This incremented address is also saved in the IF/ID pipeline register in case it is needed later for an instruction, such as beq. The computer cannot know which type of instruction is being fetched, so it must prepare for any instruction, passing potentially needed information down the pipeline.

Fig. ‑. Instruction Decode Stage for Load and Store

Fig. 11-15 shows the instruction portion of the IF/ID pipeline register supplying the 16-bit immediate field, which is sign-extended to 32 bits, and the register numbers to read the two registers. All three values are stored in the ID/EX pipeline register, along with the incremented PC address. We again transfer everything that might be needed by any instruction during a later clock cycle.

Fig. ‑. Execute Stage for Load

Fig. 11-16 shows that the load instruction reads the contents of register 1 and the sign-extended immediate from the ID/EX pipeline register and adds them using the ALU. That sum is placed in the EX/MEM pipeline register.

Fig. 11-17 shows the load instruction reading the data memory using the address from the EX/MEM pipeline register and loading the data into the MEM/WB pipeline register.

Fig. ‑. Memory Access Stage for Load

Fig. ‑. Write Back Stage for Load

Fig. 11-18 shows the final step: reading the data from the MEM/WB pipeline register and writing it into the register file in the middle of the figure. When the processor executes WB stage of Load instruction, the write register number is not corresponding to the load instruction, because other instructions were executed for the ID stage.

Fig. ‑. Corrected Datapath for Load

Fig. 11-19 shows the corrected datapath for Load instruction. The write register number now comes from the MEM/WB pipeline register along with the data. The register number is passed from the ID pipe stage until it reaches the MEM/WB pipeline register, adding five more bits to the last three pipeline registers. This new path is shown in Red color in the following figure:

Fig. ‑. Execute Stage for Store

Fig. 11-20 shows the execute stage of Store instruction. Unlike the third stage of the load instruction, the second register value is loaded into the EX/MEM pipeline register to be used in the next stage. Although it wouldn’t hurt to always write this second register into the EX/MEM pipeline register, we write the second register only on a store instruction to make the pipeline easier to understand.

Fig. ‑. Memory Access Stage for Store

Fig. 11-21 shows the memory access stage of Store instruction, where the data is written into data memory for the store. Note that the data comes from the EX/MEM pipeline register and that nothing is changed in the MEM/WB pipeline register.

Once the data is written in memory, there is nothing left for the store instruction to do, so nothing happens in the last (WB) stage.

Multi-Cycle Pipeline Diagram

Fig. 11-22 shows the multiple-clock-cycle pipeline diagram for five instructions. Time advances from left to right across the page in these diagrams, and instructions advance from the top to the bottom.

A representation of the pipeline stages is placed in each portion along the instruction axis, occupying the proper clock cycles. These stylized datapaths represent the five stages of our pipeline graphically. In the figure, IM represents the instruction memory and PC in the instruction fetch stage and DM represents data memory.

Fig. ‑. Multi-Cycle Pipeline Resource Usage

Fig. 11-23 shows the more traditional version of the multiple-clock-cycle pipeline diagram. The previous figure shows the physical resources used at each stage, while This figure uses the name of each stage.

Fig. ‑. Multi-Cycle Pipeline Resource Usage

Exercises

Assume that individual stages of the datapath have the following latencies:

IF

ID

EX

MEM

WB

260 ps

360 ps

170 ps

310 ps

220 ps

  1. What is the clock cycle time in a pipelined and non-pipelined processor?
  • Pipelined processor: Clock cycle time = 360 ps
  • Non-pipelined processor: Clock cycle time = 260 + 360 + 170 + 310 + 220 = 1320 ps
  1. What is the total latency of seven LW instructions in a pipelined and non-pipelined processor (assume no stalls or hazards)
  • Pipelined processor: Total latency = 360 × 5 + 360 × 6 = 3960 ps
  • Non-pipelined processor: Total latency = 1320 ps × 7= 9240 ps

11.3 Pipelined Controls

In the pipeline, there are lots of control signals. Depend on the control signals enabled or disabled, the components of the pipeline are executed to complete each stage. The following figure shows that what control signals are used for each stage:

Fig. ‑. Simplified Pipelined Control

  • IF: If PCSrc set to 0, the PC value increased by 4; otherwise, a specific address forwarded from a branch instruction.
  • ID/register file read: the same thing happens at every clock cycle. No optional control lines.
  • Execution/address calculation: the signals, i.e. RegDst, ALUOp, and ALUSrc, are set. Note that we now need the 6-bit funct field (function code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register.
  • Memory access: the control lines, i.e. Branch, MemRead, and MemWrite are set.
  • Write Back: two control lines, MemtoReg and RegWrite.

The effect of each control signal is summarized in the following table:

Table ‑. Effect of Each Control Signal

Signal name

Effect when reasserted

Effect when asserted

RegDst

The register destination number for the Write register comes from the rt field (bits 20:16)

The register destination number for the Write register comes from the rd field (bits 15:11)

RegWrite

None

The register on the Write input is written with the value of the Write data input

ALUSrc

The second ALU operand comes from the second register file output (Read data 2)

The second ALU operand is the sign-extended, lower 16 bits of the instruction

PCSrc

The PC is replaced by the output of the adder that computes the value of PC + 4

The PC is replaced by the output of the adder that computes the branch target

MemRead

None

Data memory contents designated by the address input are put on the Read data output

MemWrite

None

Data memory contents designated by the address input are replaced by the value on the Write data input

MemtoReg

The value fed to the register Write data input comes from the ALU

The value fed to the register Write date input comes from the data memory

The control signals are derived from the instruction, as shown in the following figure:

Fig. ‑. Pipelined Control Signal

Since the control lines start with the EX stage, the control information, i.e. total nine control signals, can be created during ID stage. Four of the nine control lines are used in the EX stage, with the remaining five control lines passed on to the EX/MEM pipeline register extended to hold the control lines. Three are used during the MEM stage, and the last two are passed to MEM/WB pipeline register for use in the WB stage.

Example – Pipeline Control

Let’s look at some example what control signals are created in a given instruction and how these signals are used for each pipeline stage with the following instructions, where we assume that there are no hazard illustrations:

lw $10, 20($1)

sub $11, $2, $3

and $12, $4, $5

or $13, $6, $7

add $14, $8, $9

Fig. ‑. Pipeline Control – Click 1

Fig. 11-26 shows that the LW instruction is fetched in the instruction memory of IF stage. At the end of the clock cycle, the LW instruction is in the IF/ID pipeline registers. Note that since there is no control signal created in this stage, all the control signals are set to 0.

Fig. ‑. Pipeline Control – Click 2

Fig. 11-27 shows the second clock cycle, where the LW instruction moves to the ID stage, and sub instruction enters in the IF stage.

Note that the values of the instruction fields and the selected source registers are shown in the ID stage. Hence register $1 and the constant 20, the operands of LW, are written into the ID/EX pipeline register. The number 10, representing the destination register number of LW, is also placed in ID/EX. Bits 15–11 are 0, but we use the symbol X to show that a field plays no role in a given instruction.

Fig. ‑. Pipeline Control – Click 3

LW instruction enters the EX stage in the third clock cycle, adding $1 and 20 to form the address in the EX/MEM pipeline register.

At the same time, the SUB instruction (sub $11, $2, $3) enters ID stage, reading registers $2 and $3, and the AND instruction (and $12, $4, $5) starts IF stage.

Fig. ‑. Pipeline Control – Click 4

In the fourth clock cycle, LW instruction moves into MEM stage, reading memory using the value in EX/MEM as the address.

In the same clock cycle, the ALU subtracts $3 from $2 and places the difference into EX/MEM pipeline registers, reads registers $4 and $5 during ID stage, and the OR instruction (or $13, $6, $7) enters IF stage.

Fig. ‑. Pipeline Control – Click 5

The final instruction, an ADD instruction in this example, enters IF stage in the datapath. All instructions are engaged in the fifth clock cycle. By writing the data in MEM/WB into the write register 10, LW instruction completes and both the data and the register number are in MEM/WB.

In the same clock cycle, SUB instruction sends the difference in EX/MEM to MEM/WB, and the rest of the instructions move forward.

Fig. ‑. Pipeline Control – Click 6

In the sixth clock cycle, SUB instruction selects the value in MEM/WB to write to the write register number 11, again found in MEM/WB.

The remaining instructions play follow-the-leader: the ALU calculates the OR of $6 and $7 for the OR instruction in the EX stage, and registers $8 and $9 are read in the ID stage for the ADD instruction.

The instructions after ADD are shown as inactive just to emphasize what occurs for the five instructions in the example.

Fig. ‑. Pipeline Control – Click 7

In the seventh clock cycle, the ADD instruction brings up the rear, adding the values corresponding to registers $8 and $9 during the EX stage.

The result of the OR instruction is passed from EX/MEM to MEM/WB in the MEM stage, and the WB stage writes the result of the AND instruction in MEM/WB to the write register $12.

Note that the control signals are deasserted (set to 0) in the ID stage, since no instruction is being executed.

Fig. ‑. Pipeline Control – Click 8

In the eighth clock cycle, the WB stage writes the result to the write register $13, thereby completing OR instruction, and the MEM stage passes the sum of the ADD instruction from EX/MEM to MEM/WB.

The instructions after ADD instruction are shown as inactive for pedagogical reasons.

Fig. ‑. Pipeline Control – Click 9

The WB stage writes the sum in MEM/WB into the write register $14, completing all five-instruction sequences including ADD instruction. The instructions after ADD instruction are shown as inactive for pedagogical reasons.

Fig. ‑. Summary of Pipelined Control

Fig. 11-35 summarized the pipeline control. The control values for the last three stages are created during the instruction decode stage and then placed in the ID/EX pipeline register. All the control values are as follows for each stage:

  • EX stage: ALUSrc, ALUOp, and RegDst
  • MEM stage: Branch, MemWrite, PCSrc, and MemRead
  • WB stage: MEMtoReg and RegWrite

The control lines for each pipe stage are used, and remaining control lines are then passed to the next pipeline stage.

Next Chapter
Chapter 12 - Memory
PreviousNext
Powered by Manifold Scholarship. Learn more at manifoldapp.org