## CSE 560 Computer Systems Architecture

**Pipelining** 

₩ashington University in St.Louis

1

UI

## Performance Review

What metric would you use to compare the performance of computers

- 1. With different ISAs?
- Execution time
- 2. With the same ISA?
- MIPS
- 3. With the same ISA and clock speed?
- IPC

- A. MIPS
- B. Instructions/Program
- C. Execution time (for a program)
- D. IPC
- E. Clock speed

 $\frac{\text{seconds}}{\text{program}} = \frac{\text{instructions}}{\text{program}} \ \ \text{x} \ \frac{\text{cycles}}{\text{instruction}} \ \ \text{x} \ \ \frac{\text{seconds}}{\text{cycle}}$ 

Washington

3

Ur

#### Performance Review

What metric would you use to compare the performance of computers

- 1. With different ISAs?
- 2. With the same ISA?
- 3. With the same ISA and clock speed?
  - A. MIPS
  - B. Instructions/Program
  - C. Execution time
  - D. IPC
  - E. Clock speed

**Washington** 



This Unit: (Scalar In-Order) Pipelining

App App App System software

• Principles of pipelining
• Effects of overhead and hazards
• Pipeline diagrams
• Data hazards
• Stalling and bypassing
• Control hazards (Next lecture)
• Branch prediction
• Predication

**Datapath Background** 

Washington





Multi-Cycle Datapath Register Multi-cycle datapath: attacks slow clock Fetch, decode, execute one complete insn over multiple cycles Micro-coded control: "stages" control signals Allows insns to take different number of cycles (main point) ± Opposite of single-cycle: short clock period, high CPI (think: CISC) Washington

## Single-cycle vs. Multi-cycle Performance

- Single-cycle
  - Clock period = 50ns, CPI = 1
  - Performance = 50ns/insn
- Multi-cycle has opposite performance split of single-cycle
  - + Shorter clock period
  - Higher CPI
- · Multi-cycle
  - Branch: 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles)
  - Clock period = **11ns**, CPI = (20%x3)+(20%x5)+(60%x4) = 4 Why is clock period 11ns and not 10ns?
  - Performance = 44ns/insn
- Aside: CISC makes perfect sense in multi-cycle datapath

👺 Washington

9

**Pipelining Basics** 

Washington

10

# Latency versus Throughput

insn0.fetch, dec, exec Single-cycle insn1.fetch, dec, exec insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec Multi-cvcle

- · Can we have both low CPI and short clock period?
  - · Not if datapath executes only one insn at a time
- · Latency vs. Throughput
  - Latency: no good way to make a single insn go faster
  - + Throughput: luckily, single insn latency not so important
    - · Goal is to make programs, not individual insns, go faster
    - · Programs contain billions of insns
  - · Key: exploit inter-insn parallelism

Washington



- Important performance technique
  - · Improves insn throughput rather instruction latency
- Begin with multi-cycle design
  - One insn advances from stage 1 to 2, next insn enters stage 1
  - · Form of parallelism: "insn-stage parallelism"
  - · Maintains illusion of sequential fetch/execute loop
  - · Individual instruction takes the same number of stages
- + But instructions enter and leave at a much faster rate

Laundry analogy

Washington







More Terminology & Foreshadowing

- Scalar pipeline: one insn per stage per cycle
  - Alternative: "superscalar", e.g., 4-wide (later)
- · In-order pipeline: insns enter execute stage in order
  - Alternative: "out-of-order" (OoO) (later)
- Pipeline depth: number of pipeline stages
  - Nothing magical about five (Pentium 4 had 22 stages!)
  - Trend: deeper until Pentium 4, then pulled back a bit

Washington University in St. Louis

16

15











Pipeline Example: Cycle 6 F/D M/W sw \$6,4(7) lw ₩ashington



Pipeline diagram: shorthand for what we just saw • Convention: X means lw \$4,0 (\$5) finishes execute stage and writes into X/M latch at end of cycle 4

Pipeline Diagram

22

24

| add \$3<-\$2,\$1 F D X M W 1 1 \$4,0(\$5) F D X M W 1 5 Sw \$6,4(\$7) F D X M W 1 |
|-----------------------------------------------------------------------------------|
| add \$3<-\$2,\$1 F D X M W                                                        |
|                                                                                   |
| ∑ ↓ 1w \$4,0(\$5) F D X M W                                                       |
| sw \$6,4(\$7)   F D X M W                                                         |

### Example Pipeline Perf. Calculation

- - Clock period = 50ns, CPI = 1
- Performance = 50ns/insn
- Multi-cycle
  - Branch: 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles)
     Clock period = 11ns, CPI = (20%x3)+(20%x5)+(60%x4) = 4
- Performance = 44ns/insn
- 5-stage pipelined
  - Clock period = 12ns approx. (50ns / 5 stages) + overheads
  - + CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle) + Performance = 12ns/insn
  - Well actually ... CPI = 1 + some penalty for pipelining (next)
    - CPI = 1.5 (on average insn completes every 1.5 cycles)

    - Much higher performance than single-cycle or multi-cycle

🚟 Washington

25

## CPI Calculation: Accounting for Stalls

Why is Pipelined CPI > 1?

- CPI for scalar in-order pipeline is 1 + stall penalties
- Stalls used to resolve hazards
  - · Hazard: condition that jeopardizes sequential illusion
  - Stall: pipeline delay introduced to restore sequential illusion
- Calculating pipeline CPI
  - Frequency of stall x stall cycles
  - · Penalties add (stalls generally don't overlap in in-order pipelines)
  - 1 + stall-freq<sub>1</sub> x stall-cyc<sub>1</sub> + stall-freq<sub>2</sub> x stall-cyc<sub>2</sub> + ...
- Correctness/performance/make common case fast (MCCF)
  - Long penalties OK if rare, e.g., 1 + 0.01 x 10 = 1.1
  - · Stalls have implications for ideal number of pipeline stages

₩ashington

27

28

#### Dependences and Hazards

- Dependence: relationship between two insns
  - Data: two insns use same storage location
  - Control: 1 insn affects whether another executes at all
  - · Not a bad thing, programs would be boring otherwise
  - · Enforced by making older insn go before younger one
    - · Happens naturally in single-/multi-cycle designs
    - · But not in a pipeline
- · Hazard: dependence & possibility of wrong insn order
  - · Effects of wrong insn order cannot be externally visible
    - Stall: for order by keeping younger insn in same stage
  - Hazards are a bad thing: stalls reduce performance

Washington

### Clock Period of a Pipelined Processor

 $Delay_{dp}$  = time it takes to travel through original datapath  $N_{ns}$  = number of pipeline stages

#### Pipeline Clock Period > Delay<sub>dp</sub> / $N_{ps}$

- · Latches add delay
- · Extra "bypassing" logic adds delay
- · Pipeline stages have different delays, clock period is max delay
- · These factors have implications for ideal number pipeline stages
  - Diminishing clock frequency gains for longer (deeper) pipelines

Washington

26

## **Data Dependences, Pipeline** Hazards, and Bypassing

Washington



30

#### Structural Hazards

- · Structural hazards
  - · Two insns trying to use same circuit at same time
    - E.g., structural hazard on register file write port
- To fix structural hazards: proper ISA/pipeline design
  - Each insn uses every structure exactly once
  - · For at most one cycle
  - Always at same stage relative to F (fetch)
- · Tolerate structure hazards
  - · Add stall logic to stall pipeline when hazards occur

Washington University in St. Louis

31



33



## **Example Structural Hazard**

8 9 M X add r1<-r3,r4 D W sub r1<-r3,r5 D Μ W st r6,0(r1)

- Structural hazard: resource needed twice in one cycle
- Example: unified instruction & data memories (caches)
  - - · Separate instruction/data memories (caches)
    - Have cache allow 2 accesses per cycle (slow, expensive)
    - · Stall pipeline

Washington

32



- Are memory data hazards a problem for this pipeline? No
  - 1w following sw to same address in next cycle, gets right value

- Why? D\$ read/write always take place in same stage
  Data hazards through registers? Yes (previous slide)
   Occur because register write is three stages after register read
  - · Can only read a register value three cycles after writing it

Washington

34























Washington

48

#### Performance Impact of Load/Use Penalty

- Assume
  - Branch: 20%, load: 20%, store: 10%, other: 50%
  - 50% of loads are followed by dependent instruction
    - require 1 cycle stall (i.e., insertion of 1 nop)
- · Calculate CPI

47

• CPI =  $1 + (1 \times 20\% \times 50\%) = 1.1$ 

Washington

Reducing Load-Use Stall Frequency 2 3 4 5 6 7 8 9 add \$3<-\$2,\$1 D X M W F D X M W F d\* D X M W lw \$4,4(\$3) addi \$6<-\$4,1 sub \$8<-\$3,\$1 D X M W d\* = data hazard · Use compiler scheduling to reduce load-use stall frequency 1 2 3 4 5 6 7 8 9 add \$3<-\$2,\$1 D X M W lw \$4,4(\$3) D X M W F sub \$8<-\$3,\$1 F D X M W F D X M W addi \$6<-\$4,1



50

## Pipeline Diagram with Multiplier

|                  | 1 | 2 | 3  | 4  | 5  | 6  | 7  | 8 | 9 |
|------------------|---|---|----|----|----|----|----|---|---|
| mul \$4<-\$3,\$5 | F | D | P0 | P1 | P2 | P3 | W  |   |   |
| addi \$6<-\$4,1  |   | F | d* | d* | d* | D  | *X | М | W |

- · What about...
  - Two instructions trying to write regfile in same cycle?
  - Structural hazard!
- · Must prevent:

|                   | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|-------------------|---|---|----|----|----|----|---|---|---|
| mul \$4<-\$3,\$5  | F | D | P0 | P1 | P2 | Р3 | W |   |   |
| addi \$6<-\$1,1   |   | F | D  | Х  | М  | W  |   |   |   |
| add \$5<-\$6,\$10 |   |   | F  | D  | Х  | М  | W |   |   |

Washington University in St.Louis

51

## More Multiplier Nasties

- · What about...
  - · Mis-ordered register writes
  - SW thinks add gets \$4 from addi, actually gets it from mul

|                   | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|-------------------|---|---|----|----|----|----|---|---|---|
| mul \$4<-\$3,\$5  | F | D | P0 | P1 | P2 | Р3 | W |   |   |
| addi \$4<-\$1,1   |   | F | D  | Х  | М  | w  |   |   |   |
|                   |   |   |    |    |    |    |   |   |   |
|                   |   |   |    |    |    |    |   |   |   |
| add \$10<-\$4,\$6 |   |   |    |    | F  | D  | Х | М | W |

- Common? Not for a 4-cycle multiply with 5-stage pipeline
  - More common with deeper pipelines
  - · Frequency irrelevant: must be correct no matter how rare

₩ashington University in St. Louis

52

54

## Corrected Pipeline Diagram

- · With the correct stall logic
  - · Prevent mis-ordered writes to the same register
  - · Why two cycles of delay?

|                   | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|-------------------|---|---|----|----|----|----|---|---|---|
| mul \$4<-\$3,\$5  | F | D | P0 | P1 | P2 | Р3 | W |   |   |
| addi \$4<-\$1,1   |   | F | d* | d* | D  | Х  | М | w |   |
|                   |   |   |    |    |    |    |   |   |   |
|                   |   |   |    |    |    |    |   |   |   |
| add \$10<-\$4,\$6 |   |   |    |    | F  | D  | Х | М | w |

Multi-cycle operations complicate pipeline logic

Washington

53

#### **Pipelined Functional Units**

- Almost all multi-cycle functional units are pipelined
  - · Each operation takes N cycles
  - Can initiate a new (independent) operation every cycle
  - Requires internal latching and some hardware replication
  - + Cheaper than multiple (non-pipelined) units

mulf f0 f1,f2 F D E1 E2 E3 E4 W mulf f3 f4,f5 F D E1 E2 E3 E4 W

- **s\*** = structural hazard, two insns need same structure
  - ISAs and pipelines designed minimize these
- Canonical example: all insns go through M stage
   Washington