Synonyms:Indeterminate switching, indecisive switching, unresolved switching, synchronizer fault
Found by: Transition Fault
- Summary Diagnosis and Suggested Solution Path
- Related Pathologies
- Suggested Solutions
- Related Topics
- Author Information
- Author Commentary
Summary Diagnosis and Suggested Solution Path
If you were directed here by M1 Oscilloscope Tools' Hidden Anomaly Locator (HAL), then this section contains suggestions for what might happening in your waveform, and possible steps you can take to understand it further or resolve the problem.
HAL has observed waveform behavior that points to the possible presence of metastability or indecisive switching. Suggested next steps:
If what you were measuring was the output of a state-device (flip-flop, latch, register, synchronizer, etc.), then:
For computational structures (i.e. not synchronizers): You might want to verify that the timing requirements (setup/hold) at the input of the state-device are being reliably satisfied.
- If these are "close" or they exhibit dynamics that make them close or a violation every now and then, that timing is probably the what is causing this behavior.
If the input timing and waveform fidelity are correct,
- Verify that the electrical environment for the device is not excessively noisy.
For synchronizers: Your synchronizer is still presenting metastable behavior to the synchronous environment.
- If this is happening only extremely rarely (probably meaning only once, since MTBFs for volume synchronizers are measured in 10s of thousands of years), that might be acceptable and you should consult the design objectives for your synchronizer to determine that.
- If it is happening frequently, it might be that the clock feeding the synchronizer has issues, that the power environment is excessively noisy, that the input to the synchronizer is changing more than once per cycle, or that the design of the synchronizer is insufficient for your application and needs to be re-engineered (more delay, etc.).
- For computational structures (i.e. not synchronizers): You might want to verify that the timing requirements (setup/hold) at the input of the state-device are being reliably satisfied.
- If what you were measuring was the output of something that is not state-oriented, then you are not directly observing metastability. Refer to the Related Pathologies section below.
Metastability is a general concept that applies not only to electronic systems but also to other kinds of system... mechanical, human, etc. There are a number of good ways to think about metastable behavior. Metastability can be considered to be the manifestation of indecision in the way a system (or observer) reacts to inputs that are undefined in some way. It can be a way of characterizing a state which might, upon casual observation, appear stable in that the state appears to persist without apparent change... but which is, in fact, occupying a position between two or more truly stable states, into one of which the system will resolve once some typically very small amount of energy enters the system to cause it to displace slightly out of the metastable state. One of the notable features about metastability is that it is a state which can theoretically persist indefinitely. A second notable feature would be the statistics around persistence time... the longer the persistence, the lower the probability.
Metastability might best be illustrated by the non-electronic example. <EXAMPLE OF METASTABILITY AS A STATE>
Metastability at the electrical level almost always refers to effects seen at the outputs of state-devices. Both flip-flops (edge triggered) and latches (level triggered) can exhibit the effect. A state device that is metastable is exhibiting a tendency to linger at a mid-level in between the stable states of high and low. Metastability is a significant consideration in both compute structures (state-architectures... typically combinational logic paths with state-devices at each end) as well as synchronizers (state devices arranged into a shift register employing a common clock, and used to ensure that signals arriving from a relatively asynchronous timing domain present with a legal, though possibly incorrect, state). The synchronizer reduces the probability that the signal presented to the input of the scope is in an indeterminate state.
While flops (edge triggered) and latches (level triggered) enter metastable states in slightly different ways, their presentation of the symptom is identical. The classic progression is shown in the diagrams at the left. As you move from one diagram to the next, the setup/hold requirement of the state/device has been progressively violated. The nominal operation of the state device with properly timed inputs is shown in the single reference trajectory to the left in each diagram as it attempts to transition from one stable level/state to another.
Once you start to "push" the device's input timing requirements, the first visible symptom to manifest is a degraded transition time. If you look closely, you can also see there is some spreading of the trajectory in the latter half of the transition. This is sometimes called "pre-metastable" behavior.
- In the next diagram, the timing has been further violated. The transition time has degraded slightly more and there is now clear evidence of the trajectories spreading during latter half of the transition. Upon reaching mid-transition, the output appears to slow very slightly (but not yet pause) and then continue on to the correct state. This is "low-grade" metastability, which along with pre-metastable behavior, is significant in that while the output did make it to the correct final state 100% of the time, it did so with a degraded transition time. In this case, the rise time is effectively longer than the design anticipated. Of the four stages of metastable behavior, pre-metastable and low-grade metastability are the two that create the headaches. That subtle delay can cause a wide variety of symptoms downstream, as described below.
- Pushing the input timing violation slightly further... Upon reaching mid-transition, the output may linger for some extended period of time that is significant with respect to the nominal transition time. When this happens, it's almost purely random as to which state into which it will ultimately resolve. This "mushroom cloud state" of indeterminate switching is referred to as "full bloom" metastability. During full bloom, individual trajectories can, theoretically, persist indefinitely. In practice, it will persist only until the energy of the arriving signals for the next cycle disturbs the delicate balance of the circuit and kicks it into one or the other of the stable states. The existence of a large number of incorrect final states usually makes full-bloom metastability very simple to track down.
- Pushing the input timing further, the output may begin to transition and then prior to reaching mid-transition reverse and fall back into the original/incorrect state. This is called "late stage" metastability. This is just "full-on busted" and is, diagnostically speaking, the easiest to track down.
Remember that metastability is a characteristic of a state-device. If you see what appears to be metastability at the output of combinational logic, it is either "passing" metastable behavior through (uncommon in this contributor's experience) from some state output upstream, or more likely it is a different pathology with a similar presentation (cf. below, Related Pathologies ).
A healthy device will manifest metastable behavior when the setup and hold requirements are subtly violated. The most typical set of conditions the author has encountered is some combination of "arrive late (data), sample early (clock)."
Hot spots - There are some "usual suspects" that have come up over a few dozen "rescue consults" over the years with high performance computer systems (multi-phase clocks, heavy pipelines, multiple clock-quality domains, etc.). When the following special cases are present, the author will tend to short-circuit the usual top-down methodical search and immediately look into the health and design correctness of certain places first.
- Boundaries between mutually asynchronous timing domains
In latch based systems,
- For single-phase clocking - check the design rules for poor enforcement of minimum path delays.
- For multi-phase latch based systems - check the design rules for poor enforcement of necessary restrictions on the "communication geometry" within the state architecture.
- Stall/unstall logic boundaries - check the design review process correctly validated this
- Latch/flop boundaries (never mix these if you don't have to)
- High/low/nominal margin clock-stressor circuits
- Circuits that implement dynamic throttling of clock-speed (e.g. power mgt)
- Not that anyone builds these sorts of systems... check along pattern boundaries in systolic arrays.
If we're going to really get down to the TRUE root cause of the majority of the cases this particular contributor has seen, you have to address the belief system and the quality of the decisions of the designers.
- Incorrect or poorly articulated and/or poorly understood design rules for implementing any of the scenarios above.
- Seat-of-the-pants decisions and assumptions like "I'm sampling/clocking way to fast to ever have a problem with the data" or "we've done it that way 1000 times and it always worked."
Metastable behavior is both a cause and an effect. The cause is simple, it is located upstream of the location of the symptom, and is discussed above. The effect can be quite significant, is located downstream and quite possibly very far downstream, making it potentially extremely frustrating to locate. One way of looking at a state-device exhibiting metastable behavior is as either a parametrically degraded transition, or in the extreme parametrically degraded transition followed by resolution into the incorrect state. The latter is easy to locate... full-on failures are boring and take very little effort to track down. However, the scenario where only parametric degradation (i.e. transition slowing) has occurred can be fascinating in the complexity of the ways that it can impact system operation. This degraded transition means there is a delay in reaching it's required final logic state. Some scenarios that play out from there can include:
- Whatever downstream logic that is consuming that transition can switch later than it otherwise would, leading to a delay in the arrival time of the data at a down stream latch/flop, causing THAT one to go metastable. If that metastability is merely more parametric degradation, the effect, or location of the observable symptom can move downstream... and in fact, way downstream... leaving the root cause someplace far upstream in a place you never thought to look.
- If you add to the preceding the statistical nature of when an output goes metastable or switches properly.... and then further factor in the way this particular effect (parametric delay) can "fan out" as it progresses downstream and that the times at which any given combinational logic segment/path is sensitized/pertinent is itself statistical, you can start to appreciate the scope and timing of when/where the symptoms might actually be observable. This behavior might manifest every 100th, or 100,000th, or 100,000,000th, or... operation, and occur virtually throughout the entire system from a simple timing problem far upstream. Tracking this pathology down to a root cause and location can be a significant challenge. The author has seen this take up to months to track down.
Other anomalies with similar presentations.
There are other anomalies that can present similarly to metastability. These would include transition steps due to switching faults, and reflections. To differentiate one from another, you might first look at the prevalence of the phenomena in other cycles.
- If you have low-grade metastability (nearly all of the transitions are resolving into the correct state), you will see only a fraction of the transitions presenting with this behavior on a given test node. That fraction might range from extremely rare (one in numerous acquisitions) to a few ones of a percent of the total, typically. All low-grade metastable transitions will be monotonic.
- If you have moderate to late-stage metastability (at least some of the transitions resolve into the wrong state) to severe metastability (can look like a run pulse at the output with total resolution into the faulty state), you wouldn't actually be reading this because that's just too easy to diagnose!
- If you have a reflection, it's probably there in all, or approaching all, of the transitions. The duration of the ledge will be related to the physical distances of the transmission lines connected to the node.
- If you have a switching fault where the device dwells at a mid-state for a bit, the dynamics on that can range from extremely infrequent to prevalent. Unlike low-grade metastability, the trajectory does not have to remain monotonic (i.e. there can be a "glitch" on the transitions), but it can. The differential is achieved by examining the relative input timings during a faulty edge to determine if the inputs were properly timed. Proper timing points to the device itself being faulty. Improper timing means, of course, it's the inputs. And if it's right on the fuzzy line between both, ask yourself if you're having fun yet... If you say "yes," you should be working in timing; an answer of "no" means you might want to get into management. But to close on that, fuzzy can point to something more "systemic" in nature like an issue in the power environment (noise, etc) or transmission lines feeding the device (noise) or inadequate design rules. Or.... (I'd need more data to suggest next steps at the point).
In compute structures (as opposed to synchronizers):
- If you have observable metastable behavior, you need to identify where the timing fault is (setup/hold violation) and correct that. It may be at the inputs (clock and/or data) of the device that is exhibiting the behavior have a simple design flaw, such as routing. Or it may be the current device is receiving its data late because of parametric degradation due to upstream metastability, or some other cause entirely, but still upstream.
- Advice: don't assume too quickly the effect is local. Spend some time analyzing the incoming data's timing and health before you close the book. The metastability Agent in M1's Hidden Anomaly Locator can monitor a node for thousands of acquisitions unattended and let you know if there's ever any metastability present, or any of the other relatively hidden waveform anomalies that might have caused the faulty output for the current device. You can also tell it to look for setup/hold violations.
- If the metastable behavior is observed in more than one place, or if the location of the symptom migrates, this author has had vastly better success in first interviewing the design team for latent flaws in either their belief system or decision making around system timing requirements, rather than stabbing the system with a phalanx of probes.
In synchronizers - While metastability in a synchronizer is identical to metastability in any other state-device, the impact and the solutions are a bit different.
- If this is happening only extremely rarely (probably meaning only once, since MTBFs for volume synchronizers are measured in 10s of thousands of years), that might be acceptable and you should consult the design objectives for your synchronizer to determine that. It might also be your day to buy a lottery ticket.
Indecision is treated in "knife edge" problems in philosophy and decision theory. There are many non-digital examples, as "decision uncertainty" is an everyday problem. But the grandfather of them all is probably Buridan's "hungry dog paradox". In that scenario, he noted that a hungry dog located precisely midway between two plates of food could very well starve to death while trying to resolve which to approach first.
Mike Williams, president of ASA Corp. Spent >20 years working on clock/timing issues before he experienced the life-ruining misfortune of running out of ideas about how to not have to act like the president of ASA. Prior to his current pathetic existence, the author had significant experience with metastability and related effects in:
- The design of computational timing environments and state-architecture communications schemes (i.e. how you would distribute and receive the clock for max performance and reliability)
- Identifying, fixing and preventing failure modes and failure presentations in the above, including some rather now-esoteric high-performance compute structures.
- Device level phenomena, whether in a state architecture or synchronizers.
Contributor has had only limited contact with current synchronizer structures other than as an observer. We would especially like to hear from anyone who has practical experience to add from that point of view to this entry.
My academics focused on clocking in very high-performance VLSI compute structures and my professional work (back when I was a happy engineer) focused entirely on clock and timing engineering in large mainframe and supercomputer kinds of systems, though I did have to "slum it" in the workstation and pc arena when those became where the action was. Metastability was to both me and my clients, what dirt is to motocross... it was there if you looked for it. I'll try and seed this space with a few war stories and first principles toward making the subject seem a bit more real. If time permits, I'll get a video or two in here sometime showing the effect and the impacts live.
Author Commentary on Metastability
It appears that as with many things in the high-speed digital landscape today, significant amounts of practical institutional knowledge appear to have been lost to the ages. In some ways, the industry is restarting. In preparing this entry, I found a lot of information about metastability written by and from an academic POV, but which interestingly usually ignores the great first contributions and what lead to the initial awareness of the phenomena in the first place. Those works from non-academics were often dry restatements of what the academics had put out without commenting from the skilled practitioner's POV. It does lead to the impression that metastability/etc. is, to many, some rare and elusive mythical effect.
When I started grad school for computer architecture in the late 70's, you'd hear about spontaneous, unpredictable synch failures in the ARPAnet that would often take an IMP (machine connected to the ARPAnet) with it. The hardware heavies and the literature often seemed to cite the belief that a system that large and distributed was vulnerable to "cosmic ray" failures. I recall it ultimately coming out that these failures were indeed not cosmic ray failures (thus freeing that excuse up for something else!) but were discovered to be due to indecisive switching. My impression now was that the discovery was made by Thomas Chaney and the guys around him (Rosenberger, Molnar, etc) at the University of Washington Computer Science Lab. This stream of work seemed regarded by the high-end hardware guys of the day.. actual practitioners.. as the seminal work in metastability and synchronization failure. For my own foundation, the Technical Memorandums from that lab were the most significant work I could find.
Looking at the "engineering bibliography" that I started in grad school and kept current until the late 90's or so, I can distinctly recall that Chaney's "Measured Flip Flop Response to Marginal Triggering" (CSL Technical Memorandum # 296 was the most enlightening work for me. Chaney sent me an impressive bibliography he had assembled on synchronization failures that extended back to the early 50's.
Readers of the formal literature around switching content probably remember numerous papers on "perfect" solutions to the synchronization problem. From the very early 80's up through the early 90's at least, there was a standard cycle of "guaranteed synchronizer" followed by someone that analyzed it differently and shot the guarantee out of the sky. Maybe this is still going on. The analytical techniques, as I recall, involved state-transition diagrams and tables, as well as timing analysis. One repeating issue I recall I personally had with looking at some of these were how to ensure asynchronous arrivals were limited to only one per cycle/state-change.
There were some approaches to synchronizing that were not just simple synchronizers. One I specifically recall was presented in a paper from Stucki and Cox at Cal Tech about metastability detectors, which was an approach I initially attempted to employ in my grad work on large 2D and 3D highly-regular compute structures. It seemed like a great idea for uniprocessor systems but that it didn't pan out for the larger structures. There were also things called clock stretchers which would work with metastability detectors and delay the sample until the asynchronously arriving signal had settled down. There was a complementary problem to "clock stretching" as I recall... something about getting the clock to the sampling point fast enough in latch based systems (remember that latch based architectures require minimum logic delays between stages to prevent races).
As noted above, I've been on a ridiculous number of "rescue consults" for clock and timing problems. Across all of that work, I have three strong impressions about how these effects affect the "average hardware guy".
- Metastable behavior can play a massive role in obfuscating the root problem by moving the timing and location of the symptom (or symptom cluster sometimes) in the system. This was MUCH more visible in large systems with a physical hierarchy comprised of cabinets, boards, and VLSI than the physically smaller systems that came later.
- What the vast majority of engineers know about tolerance management, structuring the state architecture to be less sensitive to clock tolerancing, structuring the clock-distribution/reception architecture to tolerance less, and the very non-digital pathologies is insufficient to prevent the possibility of things like metastability from occurring in their systems.
- PLL-based clock parts, when viewed in-situ, have their own flavor of weird that seems to sometime extend beyond the belief system of even their designers (we never want to believe our "kids" are bad). Often, these operating/failure modes have been the root cause of the metastability we were looking for.
One of the often-proposed solutions to synch errors is to employ "asynchronous design". I actually still have my original notes (early 80's I believe) from my time at DEC in the VAX cpu group on the proposal by Sutherland and Sproul on the benefits and methodologies around doing fully asynchronous design (Sutherland was a significant contributor to early commercial computing.. the reader may be familiar with the still valid and widely employed concept of the "Wheel of Reincarnation" that came from the classic work he did on display processors around the early 70's with Meyers). The notes cover both the proposed approach and the analysis done to implement that. The prevailing attitude at that time in our group was that while ASD did eliminate entire classes of failure attributable to synchronization issues in synchronous design, the design rules needed to implement asynchronous were, at the time, too complex to be executed by human designers. Remember that back then, we designed by drawing every gate and wire. This may not be the case with more automated design systems today. I do recall somewhere during the 90's a comment that a complete asynch implementation of a commercial microprocessor had been done in academia that was a high performer.
When she was very young, my daughter used to fall asleep at night to a metastability display on an HP 54720 scope showing a Moto 10kh131 flop having it's timing precisely insulted by an HP 8133 pulse generator near her crib. When I'd set the persistence to about 800 msec, the dynammics were very soothing to her, and the fan noise from the gear helped too. She liked when I set the traces to red. A $120k lava lamp, but the stuff was lying around our living room going unused anyway.
Contribute to the WIKB
Would you like to contribute?