

## Remainder of CSE 560: Parallelism



- Execute one instruction in parallel with decode of next
   Next: instruction-level parallelism (ILP)
  - Execute multiple independent instructions fully in parallel
  - Today: multiple issue
  - In a few weeks: dynamic scheduling
    - Extract much more ILP via out-of-order processing
- Data-level parallelism (DLP)
  - Single-instruction, multiple data
  - Ex: one instruction, four 16-bit adds (using 64-bit registers)
- Thread-level parallelism (TLP)
  - Multiple software threads running on multiple cores

1





8

















| Superscalar Challenges - Back End                                                                 |    |
|---------------------------------------------------------------------------------------------------|----|
| Wide instruction execution                                                                        | _  |
| Replicate arithmetic units                                                                        |    |
| Perhaps multiple cache ports                                                                      |    |
| <ul> <li>Wide bypass paths</li> </ul>                                                             |    |
| <ul> <li>More possible sources for data values</li> </ul>                                         |    |
| <ul> <li>Order (N<sup>2</sup> x P) for <i>N</i>-wide machine, execute pipeline depth P</li> </ul> |    |
| Wide instruction register writeback                                                               |    |
| <ul> <li>One write port per instruction that writes a register</li> </ul>                         |    |
| • Example, 4-wide superscalar → 4 write ports                                                     |    |
| Fundamental challenge:                                                                            |    |
| <ul> <li>Amount of ILP (instruction-level parallelism) in the program</li> </ul>                  |    |
| Compiler must schedule code and extract parallelism                                               |    |
|                                                                                                   | 16 |

## How Much ILP is There?

- The compiler tries to "schedule" code to avoid stalls
  - Hard for scalar machines (to fill load-use delay slot)
  - Even harder to schedule multiple-issue (superscalar)
- Even given unbounded ILP, superscalar has limits
  - IPC (or CPI) vs clock frequency trade-off
  - Given these challenges, what is reasonable N? 3 or 4 today

 Wide Decode

 Image: provide the stall logic?

 • What is involved in decoding multiple (N) insns per cycle?

 • What is involved in decoding multiple (N) insns per cycle?

 • Actually doing the decoding?

 • Easy if fixed length (multiple decoders), doable if variable

 • Reading input registers?

 • 2N register read ports (latency ∞ #ports)

 • Actually < 2N, most values come from bypasses (more later)</td>

 • What about the stall logic?

20



21

17











 Avoid N<sup>2</sup> Bypass/RegFile: Clustering

 cluster 0
 Image: Cluster 1

 cluster 1
 Image: Cluster 2

 Clustering: group ALUs into K clusters

 Clustering: group ALUs into K clusters

 Clustering: group ALUs into K clusters

 Offer values from regfile with 1-2 cycle delay

 + N/K non-regfile inputs at each mux, N<sup>2</sup>/K point-to-point paths

 - Key to performance: steering dependent insns to same cluster

 - Hurts IPC, helps clock frequency (or wider issue at same clock)

 Typically uses replicated register files (1 per cluster)

 - Alpha 21264: 4-way superscalar, two clusters











- Statically-scheduled (in-order) superscalar
  - + Executes unmodified sequential programs
  - Hardware must figure out what can be done in parallel
  - E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha 21164 (4-wide)
- Very Long Instruction Word (VLIW) + Hardware can be dumb and low power
  - Compiler must group parallel insns, requires new binaries
  - E.g., TransMeta Crusoe (4-wide)
- Explicitly Parallel Instruction Computing (EPIC)
- A compromise: compiler does some, hardware does the rest
- E.g., Intel Itanium (6-wide)
- Dynamically-scheduled superscalar
- Pentium Pro/II/III (3-wide), Alpha 21264 (4-wide)
- · We've already talked about statically-scheduled superscalar

31

## History of VLIW

- Started with "horizontal microcode"
- · Academic projects
  - Yale ELI-512 [Fisher, `85]
  - Illinois IMPACT [Hwu, '91]
- Commercial attempts
  - Multiflow [Colwell+Fisher, '85]  $\rightarrow$  failed
  - Cydrome [Rau, `85] → failed
  - Motorolla/TI embedded processors  $\rightarrow$  successful
  - Intel Itanium [Fisher+Rau, '97] → ?? ⊗
  - Transmeta Crusoe [Ditzel, '99]  $\rightarrow$  mostly failed

33

| Trends in Single-Processor Multiple Issue                                                                                                                                                                                |      |         |           |          |         |           |       |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|---------|-----------|----------|---------|-----------|-------|--|--|
| 1                                                                                                                                                                                                                        | 486  | Pentium | PentiumII | Pentium4 | Itanium | ItaniumII | Core2 |  |  |
| Year                                                                                                                                                                                                                     | 1989 | 1993    | 1998      | 2001     | 2002    | 2004      | 2006  |  |  |
| Width                                                                                                                                                                                                                    | 1    | 2       | 3         | 3        | 3       | 6         | 4     |  |  |
| <ul> <li>Canceled Alpha 21464 was 8-way issue</li> <li>No justification for going wider</li> <li>HW or compiler "scheduling" needed to exploit 4-6 effectively</li> <li>Out-of-order execution (or VLIW/EPIC)</li> </ul> |      |         |           |          |         |           |       |  |  |
| <ul> <li>For high-performance <i>per watt</i> cores, issue width is ~2</li> <li>Advanced scheduling techniques not needed</li> <li>Multi-threading (a little later) helps cope with cache misses</li> </ul>              |      |         |           |          |         |           |       |  |  |



32

## What Does VLIW Actually Buy You? + Simpler I\$/branch prediction + Simpler dependence check logic • Doesn't help bypasses or regfile • Which are the much bigger problems!

- Although clustering and replication can help VLIW, too
- Not compatible across machines of different widthsIs non-compatibility worth all of this?
- How did TransMeta deal with compatibility problem?
  Dynamically translates x86 to internal VLIW