



58

Directory Coherence Protocols
Observe: address space statically partitioned
+ Can easily determine which memory module holds a given line
• That memory module sometimes called "home"
- Can't easily determine which processors have line in their caches
• Bus-based protocol: broadcast events to all processors/caches
± Simple and fast, but non-scalable
Directories: non-broadcast coherence protocol
• Extend memory to track caching information
• For each physical cache line whose home this is, track:
• Owner: which processor has a dirty copy (*i.e.*, M state)
• Sharers: which processors have clean copies (*i.e.*, S state)

- Processor sends coherence event to home directory
  Home directory only sends events to processors that care
- · For multicore w/ shared L3, put directory info in cache tags

59







# Directory Flip Side: Complexity

- · Latency is not the only issue for directories
  - Subtle correctness issues as well
  - Stem from unordered nature of underlying inter-connect
- Individual requests to single cache must be ordered
  - Bus-based snooping: all processors see all requests in same order
    - Ordering automatic
  - Point-to-point network: requests may arrive in different orders • Directory has to enforce ordering explicitly
    - Cannot initiate actions on request B...
    - ...until all relevant processors complete actions on request A • Requires directory to collect acks, queue requests, *etc.*
  - Directory protocols
  - Obvious in principle
  - Complicated in practice

63



65





64





### Hiding Store Miss Latency

- Recall (back from caching unit)
  - Hiding store miss latency
  - How? Store buffer
- · Said it would complicate multiprocessors
  - Yes, it does!

Write Misses and Store Buffers

#### Read miss?

- Load can't go on without the data→must stall Write miss?
- Technically, no one needs data → why stall?

#### Store buffer: a small buffer

- Stores put addr/value to write buffer, keep going
- Store buffer writes stores to D\$ in the background
- Loads must search store buffer (in addition to D\$)
- + Eliminates stalls on write misses (mostly)
- Creates some problems

ing und D\$) WBB Next-level cache

Processo

SB

70



71









# Shared Memory Summary

- · Synchronization: regulated access to shared data
  - Key feature: atomic lock acquisition operation (e.g., t&s)
    Performance optimizations: test-and-test-and-set, queue
  - locks
- Coherence: consistent view of individual cache lines
  - Absolute coherence not needed, relative coherence OK
  - VI and MSI protocols, cache-to-cache transfer optimization
  - Implementation? snooping, directories
- Consistency: consistent view of all memory locations
  - Programmers intuitively expect sequential consistency (SC)
     Global interleaving of individual processor access streams
     Not always naturally provided, may prevent optimizations
  - Weaker ordering: consistency only for synchronization points

75

|                                | Summary                                                                                                                                                                                                                                                                                                                                                                                                              | _  |
|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| App App App<br>System software | <ul> <li>Thread-level parallelism (TLP)</li> <li>Shared memory model <ul> <li>Multiplexed uniprocessor</li> <li>Hardware multithreading</li> <li>Multiprocessing</li> </ul> </li> <li>Synchronization <ul> <li>Lock implementation</li> <li>Locking gotchas</li> </ul> </li> <li>Cache coherence <ul> <li>Bus-based protocols</li> <li>Directory protocols</li> </ul> </li> <li>Memory consistency models</li> </ul> |    |
|                                |                                                                                                                                                                                                                                                                                                                                                                                                                      | 77 |

77

# Flynn's Taxonomy

- Proposed by Michael Flynn in 1966
- SISD single instruction, single data
- Traditional uniprocessor
- SIMD single instruction, multiple data
  - Execute the same instruction on many data elements
  - Vector machines, graphics engines
- MIMD multiple instruction, multiple data
  - Each processor executes its own instructions
  - Multicores are all built this way
  - SPMD single program, multiple data (extension proposed by Frederica Darema)
  - MIMD machine, each node is executing the same code
- MISD multiple instruction, single data
   Systolic array

76