

#### Metastability (What?)

Tom Chaney, tchaney@blendics.com Dave Zar, dzar@blendics.com



© 2010 Blended Integrated Circuit Systems, LLC

# **Metastability Is**



- a fundamental property of all bi-stable circuits (flip-flops and arbiters)
- the cause of ambiguous output voltages and unpredictable behavior
- the reason for setup & hold-time constraints on flip-flops
  - When observed they eliminate metastability
  - When violated may lead to circuit malfunction
  - Satisfying constraints perfectly between multiple independent clock domains is not possible





#### **Results for a D-Latch**



- •Latch output before final inverter (clock is also shown).
- •Rightmost two traces bracket unbounded metastable point





#### **Prototypical Master-Slave DFF**





#### **Results for a Master-Slave**



- Clock is shown in yellow.
- Other traces

   are obtained
   by varying the
   data-clock
   separation and
   observing the
   output of the
   FF before the
   output inverter.



#### **Real plots!**



Times and voltages far from normal experience And History Dependent! – must collect data slowly





# **A Synchronizer Failure**





#### **Probability of Synchronizer Failure** (Noise Free Case First)





#### **Circuit Model Analysis**



#### Use small signal analysis



For  $V_0$  small

Result



# **MTBF for Synchronizers**



The probability of failure is the probability that the synchronizer output is unresolved at the next clock edge:  $P_{unresolved} = \frac{\Delta t}{T_c} = f_c \Delta t$ Resolved 0.67V<sub>DD</sub> With a uniform distribution of data events in a clock period  $\Delta V$  $P_{unresolved} = f_D f_C \Delta t = \frac{1}{MTRF}$  $2V_{P}$ Not Resolved From the definitions of  $G_{tv}$  and the circuit model  $\Delta t = \frac{\Delta v}{G_{tv}}; \qquad \Delta v = 2V_{e}e^{-\frac{T_{c}}{\tau}}$ Distribution of Data Events  $0.33V_{DD}$ Resolveo we see that Data  $\Delta t$  $MTBF = \frac{G_{tv}e^{\frac{T_{c}}{\tau}}}{2V_{c}f_{c}f_{c}}$ **Setup and Hold Region**  $\xrightarrow{t}$   $T_{\rm C}$ Clock  $T_{\rm C}$ 



#### MTBF Based on Aperture Time

The probability of failure is the probability that the synchronizer output is unresolved at the next clock edge:









# Synchronizer Failure Trend



- System failures due to synchronizer failures have been rare, but will be more likely in future
  - Many more synchronizers in use (Moore's Law)
    - Systems with 100s of synchronizers, perhaps 1000s soon
    - Systems with synchronizers in million-fold production
  - Small changes in  $V_t$  cause large changes in  $\tau$ 
    - Growing parameter variability in nano-scale circuits
      - In an IBM 90 nm process  $V_t$  varies for 0.4 to 0.58 volts
    - Transistor aging increases vulnerability
      - An ASU model shows  $V_t$  increasing by 5% over 5 years

- Clock domains may not have uncorrelated clocks



# **Is There A Perfect Solution?**



- Theoretical results show metastability is a fundamental problem of all bi-stable circuits
- Failures caused by metastability are always a possibility
  - between two independently clocked domains
  - between a clock domain and outside world
- One solution uses asynchronous circuits, but real-time applications may still be problematic
- Another solution uses synchronizer circuits and designers must hope failures are rare



#### **Completion Detection**



- It is not possible to bound the amount of time needed for a synchronizer to settle.
- It is, however, possible to detect when the synchronizer has settled!
- This is only useful if the downstream logic can use this asynchronous completion signal





# What Could Go Wrong?

- It's easy to get a synchronizer design wrong
- The three most common pitfalls are:
  - using a non-restoring (or slowly restoring) flip-flop
    - τ needs to be small
  - not isolating the flip-flop feedback loop
  - Using two flip-flops in parallel
- The last pitfall is doing everything "right" but not understanding that influences MTBF!













## **Correlated Clocks**





Although Cores A and B may be clocked at different rates, these rates are based on the same oscillator and are thus correlated. This relationship between the synchronizer's clock and data inputs can be very malicious.



#### **Correlated Clocks & Noise**



- The effects of correlated clocks and the effects of noise can be approached similarly.
- As we will see, circuit noise may be treated as one case of correlated clocks.





#### **Region of Vulnerability:** $\Delta t$



#### **Malicious Data Events**





#### Malicious Data Events Even More Malicious







### **Effects of Thermal Noise**



Bottom Line: Thermal noise pushes as many events into the window of vulnerability as is pushes out.



# Upper Bound on Punresolved





Bottom Line: Thermal noise establishes an upper bound on  $P_{unresolved}$  and a lower bound on MTBF



# **Calculating MTBF**



- Always a stochastic calculation
  - Assume clock and data unrelated

 $MTBF(FF \text{ unresolved at } t) = \frac{G_{tv}e^{t/\tau}}{2V_{e}f_{D}f_{C}}$ 

- If related, thermal noise gives lower bound
  - E.g. clock and data from same source or clockless

$$MTBF(FF \ unresolved \ at \ t) \approx \frac{\sigma}{V_e f_D} e^{t/\tau}$$

- Thermal noise voltage standard deviation:  $\sigma = \sqrt{2 kT/C}$
- This lower bound is 2 to 3 orders of magnitude smaller than when clock and data are unrelated



#### **MTBF Affects System Behavior**



- Assume:
  - Desired probability of system failure = 1 : 2,000,000
  - System lifetime is 30 years (~ 10<sup>9</sup> sec)
  - System has 50 processors with 10 synchronizers each
- Then:
  - Need MTBF of 30 billion years (3.10<sup>10</sup>) per synchronizer
- But:
  - Corner cases can further reduce needed MTBF
  - If clock and data are related, must use lower bound set by thermal noise: MTBF<sub>n</sub>
- Unwise to use conventional MTBF formula without understanding its limitations



#### Master-Slave DFF MTBF Examples



| Clock Frequency (MHz) | MTBF (yrs) | MTBF <sub>n</sub> (yrs) |
|-----------------------|------------|-------------------------|
| 200                   | 9.7E+37    | 2.1E+35                 |
| 300                   | 4.3E+19    | 1.4E+17                 |
| 500                   | 7.5E+04    | 4.1E+02                 |
| 750                   | 2.7E-03    | 2.2E-05                 |

90 nm process

 $\tau$ =39.83 ps, G<sub>tv</sub>=0.375 V/ns, f<sub>d</sub> = 133 MHz

125 ps setup time assumed

MTBF ranges from 1 day to 9.7.10<sup>37</sup> years

 $MTBF_n$  ranges from 11.5 minutes to 2.1.10<sup>35</sup> years



#### Parameter Variations in Master-Slave Process-Voltage-Temperature 200 MHz



|                    | τ (ps) | Gtv (V/ns) | MTBF (yrs) | MTBF <sub>n</sub> (yrs) |
|--------------------|--------|------------|------------|-------------------------|
| -3 sigma           | 106.49 | 0.369      | 5.07E+04   | 1.12E+02                |
| -1 sigma           | 55.50  | 0.543      | 1.37E+23   | 2.06E+20                |
| Nominal 0 degrees  | 39.30  | 0.751      | 1.00E+39   | 1.04E+36                |
| Nominal 27 degrees | 39.83  | 0.375      | 9.79E+37   | 2.13E+35                |
| Nominal 70 degrees | 41.01  | 0.301      | 2.29E+36   | 6.65E+33                |
| 1 sigma            | 28.98  | 0.866      | 1.80E+58   | 1.70E+55                |
| 3 sigma            | 16.69  | 0.031      | 4.16E+110  | 1.09E+109               |

200 MHz Clock; 90 nm process, 125 ps setup time MTBF ranges from  $5.07 \cdot 10^4$  years to  $4.16 \cdot 10^{110}$  years MTBF<sub>n</sub> ranges from 112 years to  $1.09 \cdot 10^{109}$  years



© 2010 Blended Integrated Circuit Systems, LLC

#### Latch Versus Master-Slave FF MTBF @200 MHz



|                 | τ (ps) | Gtv (V/ns) | MTBF (yrs) |
|-----------------|--------|------------|------------|
| Master-Slave FF | 39.83  | 0.375      | 9.8E+37    |
| Latch           | 40.54  | 4.729      | 1.4E+38    |

#### 200 MHz Clock; 90 nm process, 125 ps setup time

