



## Code Scheduling

- Scheduling: act of finding independent instructions
  - **Static**: at compile time by the compiler (software)
  - Dynamic: at runtime by the processor (hardware)
- · Why schedule code?
  - · Scalar pipelines: fill load-to-use delays to improve CPI
  - Superscalar: place independent instructions together
    - As above, load-to-use delay slots
    - Allow multiple-issue decode logic to let them execute at the same time

5

## Scheduling Requirements Independent insns • no ILP → game over Large Scheduling Scope • Scope = code region we are scheduling

- The bigger the better (more independent insns to play with)
- Once scope is defined, schedule is pretty obvious
- Trick is creating a large scope (schedule across branches?)
- Enough registers
  - To hold additional "live" values
- Alias analysis
  - Whether load/store reference same memory locations
     Can they be legally rearranged?

6

## Scheduling Techniques

- Stall Removal
  - Separate load-use pairs
- Scope enlarging
  - For Loops: loop unrolling
  - For Non-loops:
  - Superblocks
  - Predication
- Exploit Data-Level Parallelism
  - Vectors

## New Metric: Utilization

Utilization: actual performance / peak performance

- Important metric for performance/cost
- Why pay for hardware you rarely use?
- + Adding hardware usually  $\boldsymbol{1}$  performance,  $\boldsymbol{1}$  utilization
  - New hardware cannot always be exploited
  - Diminishing marginal returns
- Compiler can help make better use of existing hardware
   Important for superscalar

















17





16



18

## Scheduling Techniques

- Stall Removal
  - Separate load-use pairs
- Scope enlarging
  - For Loops: loop unrolling
  - For Non-loops:
    - Superblocks (biased branches)
    - Predication (non-biased branches)
- Exploit Data-Level Parallelism
  - Vectors























29



28

| Cost/benefit analysis                                                              |
|------------------------------------------------------------------------------------|
|                                                                                    |
| Benefit: predication avoids branches                                               |
| <ul> <li>Thus avoiding mis-predictions</li> </ul>                                  |
| <ul> <li>Also reduces pressure on predictor table (few branches to trac</li> </ul> |
| <ul> <li>Cost: extra (annulled) instructions</li> </ul>                            |
| Since branch predictors are highly accurate                                        |
| Might not help:                                                                    |
| <ul> <li>5-stage pipeline, two instruction on each path of if-then-else</li> </ul> |
| <ul> <li>No performance gain, likely slower if branch predictable</li> </ul>       |
| Or even hurt!                                                                      |
| But can help:                                                                      |
| <ul> <li>Deeper pipelines, hard-to-predict branches, and few added ins</li> </ul>  |
|                                                                                    |
|                                                                                    |

30

# Aside: Profiling More do we know whether a branch is biased or not? Profile: statistical information about program tendencies ( Collect from previous program runs (different inputs) ( Works OK depending on information ( Which loads miss frequently independent of inputs? ( Which loads & stores communicate with each other? ( Stable across inputs ( Which loads & stores communicate with each other? ( Stable across inputs ( Which loads eaches usually taken/not-taken? ( Not so stable across inputs ( Popular research topic ( Marce Mar

## Scheduling Techniques

- Stall Removal
  - Separate load-use pairs
- Scope enlarging
  - For Loops: loop unrolling
  - For Non-loops:
    - Superblocks (biased branches)
    - Predication (non-biased branches)
- Exploit Data-Level Parallelism
  - Vectors

## Data-Level Parallelism

### Data-level parallelism (DLP)

- Single operation repeated on multiple data elements
   SIMD (Single-Instruction, Multiple-Data)
- Less general than ILP: parallel insns are all same operation
- Exploit with vectors

Old idea: Cray-1 supercomputer from late 1970s

- Eight 64-entry x 64-bit floating point "Vector registers"
   4096 bits (0.5KB) in each register! 4KB vector register file
- Special vector instructions to perform vector operations
  - Load vector, store vector (wide memory operation)
  - Vector+Vector addition, subtraction, multiply, etc.
  - Vector+Constant addition, subtraction, multiply, etc.
  - In Cray-1, each instruction specifies 64 operations!

33



35





34

## Why Vectorization is Awesome Have your cake and eat it, too All the benefits of a wider machine, without superscalar costs Single instruction fetch Wide reads & writes (without multiple \$ or regfile ports) Wider data to bypass ≠ N² bypass Execution width (implementation) vs vector width (ISA) Example: Pentium 4 and Core 1 execute vector ops at half width Core 2 executes them at full width Intel's Sandy Bridge brings 256-bit vectors to x86 Intel's Larrabee graphics chip brings 512-bit vectors to x86 Vector + superscalar? Sure! Multiple n-wide vector instructions per cycle

36