# **CSE 560** Computer Systems Architecture

Performance Modeling

Computer Architecture Simulator Primer

Q: What is an architectural simulator?

A: tool that reproduces the behavior of a computing device

System Outputs Device System Inputs ---Simulator System Metrics

Why use a simulator?

- · leverage faster, more flexible S/W development cycle
- permits more design space exploration
- facilitates validation before H/W becomes available
- level of abstraction can be throttled to design task
- can tell us quite a bit about performance

2

# Functional vs. Behavioral Simulators

### **Functional Simulators**

- Implement instruction set architecture (what programmers see)
  - Execute each instruction
  - · Takes real inputs, creates real outputs

Behavioral simulators (also called *Performance Simulators*)

- · Implement the microarchitecture (system internals)
  - · 5 stage pipeline
  - · Branch prediction
  - Caches

3

- Go through the internal motions to estimate time (usually)
- · Might not actually execute the program



- Previous versions of this class have used:
  - · SimpleScalar: optimizes performance and flexibility
  - · VHDL: optimizes detail
- We will use gem5 in this class
- · Cycle accurate chip multiprocessor
- Used lots of places!

## Simulation Loop

```
sim time ← initial time
while (not done) {
    for each register r {
       new_r ← new value of r based on current register values
    for each register r {
       r ← new_r
    sim_time ← sim_time + 1 clock
}
```

Latency versus Throughput

insn0.fetch, dec, exec insn1.fetch, dec, exec Single-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec Multi-cycle

- Can we have both low CPI and short clock period?
  - · Not if datapath executes only one insn at a time
- · Latency vs. Throughput
  - Latency: no good way to make a single insn go faster
  - + Throughput: luckily, single insn latency not so important
    - Goal is to make programs, not individual insns, go faster
  - · Programs contain billions of insns
  - Key: exploit inter-insn parallelism





Pipeline Terminology Register F/D • Five stage: Fetch, Decode, eXecute, Memory, Writeback · Latches (pipeline registers) named by stages they separate • PC, F/D, D/X, X/M, M/W •  $d_x.a = reg_file[f_d.ir < 25:21 >]$ • d\_x.b = reg\_file[f\_d.ir<20:16>]



Pipeline Example: Cycle 2 Register

add \$3<-\$2,\$1

9

lw \$4,0(\$5)

Pipeline Example: Cycle 3 sw \$6,4(\$7) lw \$4,0(\$5) add \$3<-\$2,\$1





14





15



# Parallel vs. Sequential – start here

Hardware is parallel, simulation software is (typically) not.

- 5 stage pipeline vs. our simulate() method
  - · Can't execute 5 stages in parallel, so... traverse the pipeline backwards
- HW table = software array dm\_cache[index].data, dm\_cache[index].tag
- · Anything more complicated? Serial approximation of parallel structure
  - · Accessing all 4 ways in a set at once? Nope.
  - CAM lookup (find all entries with value X). Nope.
  - Flush entire instruction window? Nope.
- · Simulator is slower b/c it's in software and its serial

# Simulator Types

- · Software Simulators
  - · Processor Core Simulators
  - · Cache Simulators
  - · Full-system Simulators
- · Hardware Simulators (VHDL, Verilog, etc.)
  - · You instantiate every wire
    - 3 Register read ports in SW vs. HW
  - · Less flexible
  - More complex (and complete) model of real system
  - Slower to develop
  - Can use FPGAs for emulation (huge benefit for speed!)

20

22

19

# gem5 Simulator Heritage Authored in C+and Python

# **Simulator Options**

Execution vs. Trace-Driven Simulation

• Reads "trace" of insns captured from a previous execution

· simulator "runs" the program, generating a trace on-the-fly

· Easiest to implement, no functional component needed

Execution-driven Simulator (input = static insns)

· more difficult to implement, but has many advantages

• direct-execution: instrumented program runs on host

Trace-based Simulator (input = dynamic insns)

What is the input to the simulator?

### **Configuration File:**

- · Configure the system being modeled (e.g., ISA, size of cache line, in order vs. out of order execution)
- · Specify the binary executable to simulate
- Control the simulation (start, stop, etc.)
- · Literally is a Python file
  - · Anything available in Python is available here
  - · Python interpreter included in simulator!

21

# Simulator Output

### Three output files:

- config.ini and config.json
  - · Lists every SimObject created and its parameters
  - · Indicates "what did I actually simulate?"
- Results of simulation in stats.txt file
  - Dump of pretty much everything collected during simulation
- · Command line option:
  - -d DIR Specify directory for output files
  - · Overwrites output files if present

Sample Output

## But where is CPI?

- · CPI is not one of the statistics that is provided directly in the stats.txt file
- What if we want to know CPI?
- Definition of CPI is average cycles/instruction
  - Simulator tells us cycles sim ticks (almost! wrong units, however; also need clock period)
  - Simulator tells us dynamic instructions sim insts (don't confuse this with micro-operations, sim ops)

  - · In effect, we are using perf. eqn. to solve for CPI
- · Simulation tick time is 1 picosecond

Simulation and Performance Equation

Program runtime: seconds program =

instructions





- Instructions per program: simulator can tell us directly
- · Including fractions of instruction types (e.g., %loads)
- Cycles per insn: "CPI" also can come from simulation
  - Sometimes indirectly (e.g., output is CPI x t<sub>CLK</sub>)
  - · This is often a complex function of other things:
    - · Branch predictor
    - · Cache behavior
  - Simulator can tell us model inputs (e.g., % predicted right)
- **Seconds per cycle:** clock period,  $t_{\text{CLK}}$  simulator input

25

26

# How to learn more about gem5

· There is a great tutorial text:

https://www.gem5.org/documentation/ follow the "Learning gem5" link

· Tutorial talks available on youtube:

www.youtube.com/watch?v=5UT41VsGTsq

# Honesty is the Best Policy

- · It is your job to design an honest simulator sim cycle = sim cycle/2
  - → 2x performance improvement! Woo hoo!
- · Intel simulators have strict types Latched structures "know" about cycles
  - throw error if you read more than n times per cycle

What about power?

Static power – "charge" each structure for length of run · Cache leaks certain amount of power just sitting there

· Read the cache 10,000 times in a run, charge for that

• Run for 200 ms, charge for that much leakage

- · What about cycle time?
  - · What can you accomplish in hardware?
  - · What can you accomplish in a cycle?

27

28

## Sanity Checks

- · You must convince yourself that your simulator is working
  - If you cannot, you will never convince anyone else!
- · Branch predictor gets 50% performance improvement?
  - · Initial stats showing the phenomenon you exploit
  - · How many branches are there?
  - · What does perfect branch prediction offer?
  - · What does a stupid branch predictor offer?
  - Sensitivity studies showing how your idea changes across different values

If you don't back up your results with secondary data, people will just think you're lying.

· Most academic power numbers are basically worthless

Dynamic power - "charge" per use

What do we "charge"? → really hard to get right

· Squint. Trust the trends, not the numbers.

• In fact,  $0 \rightarrow 1$  different cost than  $1 \rightarrow 0$  (*yikes*)

30