I am often amazed that any chips actually work. We all know that verification by simulation is based on a sampling of possible stimulus and we also know that the number of samples we provide is a tiny, tiny fraction of all of the possible stimulus patterns. Even so, verification is taking up a larger percentage of our time and increasing. We use things such as functional coverage to try and steer some of those samples towards particular areas of the design, but as system complexity grows and the amount of concurrency increases, the fraction of the total stimulus space that is actually covered is being reduced every day. Some people are also aware that most simulation never actually checks that things work correctly; they just look for obvious signs that nothing has gone wrong. As the sequential depth of designs increase, constrained random is also less likely to hit many interesting cases, so more companies are being forced to spend a greater percentage of their time directing testing. So why is it that any chips actually work?
So first of all, there are the cases where the chips doesn’t fully work, but it works well enough. This can be at two different levels: either the problems are fixable in software (problem avoided), or the functionality is reduced, but still good enough to ship the product (reduced functionality).
One of the big saving graces is that few SoCs use all of the available functionality. The system places constraints that prevent many capabilities from ever being activated, so this means that many problems can exist with no effect. Just think of an IP block – perhaps one that comes from a third party. It is possible that it has been used in many designs in the past, and yet as soon as you integrate it into your design, you find problems with it, and start to question the quality of the block. The same is true with software as well. I can remember back to the days when I actually developed software (yes a long time ago). Each and every time we got a new customer we learn to expect a rash of new bugs being filed and the customer getting upset about our perceived quality levels. Those bugs came because they were using the product in different ways than the previous customers, and as soon as that first batch of bugs were cleared, they were unlikely to find many other problems. Software often constrains the functionality that can be activated in the hardware in addition to the hardware constraints, although if software updates are possible, it can trigger problems being found in the hardware at a later date.
Let’s consider a specific example that I have heard other people mention. If we have a state machine, it is clear that we need to verify that we can reach all of the states, make the right transitions between the states, and that the right things happen in each state. If we add a second state machine, we also need to verify the same for that one. If each state machine executes concurrently, then do we need to verify all combinations of each test in each state machine? In other words do we have 2X the tests or is it a square of the number of tests. Reality is probably somewhere in between based on how one machine can influence the other and produce different outcomes. But how does that get formalized?
So, in the verification world it seems as if we have never been able to get a handle on the real space that needs to be verified. How much time do we spend performing verification at the block level that could never be activated once integrated? How many actual or implied constraints exist in the system that carve out great chunks of the possible verification space? For most chips to work, it would seem as if we must be doing some of this without really thinking about it, but one would think that after doing this stuff for 40 or more years now, we would have found a way to formalize this. How much time do we waste doing simulations that don’t give us any useful information? How much more efficient could verification be?
If anyone has a better handle on this problem, I would love to hear from you. If there is any promising research that can help guide us towards the important test cases to be run I am sure we would all be thankful. Until then it would appear that we enjoy gambling and we have had a pretty good run. When does our luck run out? Do you think it is luck or a sixth sense on the part of the verification engineers?
Brian Bailey – keeping you covered