Trouble Ahead For IC Verification

“Often, multiple factors build on top of each other. “One of the ways that we address safety is things like triple voting flops,” says Graham. “If you triple the number of flops, you are driving gate count higher, and that is part of complexity. It is also driving the need for things like 3D-IC and chiplets. These require a broad array of experts. It’s no longer just a digital designer. We need a digital and analog designer, a digital verification engineer. Now you need a safety verification engineer or a safety designer. And then the system-level expert. You need all of your teams to have some expertise in this, and you need these vertical experts, which is driving up the requirement for more talent. You’re either taking a good verification, or good design engineer, good place-and-route person, and saying you’re going to be our expert in this vertical. Now you need to replace that person. Or you need to bring that in, and suddenly your team is that much larger.”

Functional verification has become totally wrapped into the entire development flow. “Power is a good example,” says Salz. “You need to be able to estimate power early on, and make sure, at the system level, that you’re staying within that envelope. Some things are going to get worse. Thermal is a big thing with multi-dies. You need to start thinking about all of these things. You need to think about how you are going to test it once you’ve integrated the chiplets. Are you going to test as you integrate, or are you going to just build the entire SoC and then test it?”

Some point tools may not yet be ideally integrated into flows. “Thermal is a new consideration, but many of the tools available use a brute-force dumb approach of doing synthesis to get everything to gates, just to figure out the power,” says Swinnen. “Synthesis takes a long time, a lot of resources, and is very capacity-limited. A better approach is to use heuristics that are based on many years of experience. These look at the circuit and divide it into different function types. This is random logic, this is memory access, this is clock logic, this is data path logic, and apply different heuristics to each of these regions. It builds up an estimate of the power in the system. It is often within 5% of the final gate level numbers. The point of this is optimization. ‘I’ve got this architecture. What if I change it? Would that increase the power or lower the power?’ You’re not worried about the last watt. You’re worried about if your optimization is increasing or decreasing the power. It is also the time in the process where power and thermal are being understood, and which cycles need to be identified for use in other parts of the flow.”

Conclusion
There is an increasing amount of newness in many designs today. This is driven by technological advancements, domain requirements, and changing views of the system and the methodologies required to design them. It often requires the influx of new expertise that can be difficult to properly integrate. The need for larger, more diverse teams is hampered by an industry-wide lack of talent, and this is causing significant instability within companies. The fact that there is a dip in first-time success rates is perhaps unsurprising, but the real question is whether this is the beginning of a longer decline, or if this is a wakeup call for change.”

Source: https://semiengineering.com/trouble-ahead-for-ic-verification/
 
sleepeeg3 said:
Not really sure what this article is about (other than semi design), but thanks for keeping the news flowing, erek. ;)
It basically breaks the growing issue that CPU’s have grown disgustingly large, not physically large but the amount of components crammed in there has increased complexity exponentially and that complexity is a problem at the datacenter scale. Even if there’s only a 1 in a trillion chance of a single transistor failing per CPU per year that still going to happen to dozens of CPUs per year.
What do those failures look like, what does it do to output, does it corrupt it slightly how does that corruption get detected and corrected. Does the CPU fail completely what are your redundancies built in place?
Now that CPUs are adopting chiplets or tiles or silos or stacks, how are potential failures on a single one of those handled how does that chip handle it internally? How do adjacent chips verify and shape input. We’re approaching a point where a CPU or a GPU package might have 50 or more individual chips making it up, each of them needs specialized error detection and error correction and that needs its own team specialized in that chip and its functions.
 
Lakados said:
It basically breaks the growing issue that CPU’s have grown disgustingly large, not physically large but the amount of components crammed in there has increased complexity exponentially and that complexity is a problem at the datacenter scale
Why only at the datacenter? What about us schlubs who use PCs at home?
Lakados said:
. Even if there’s only a 1 in a trillion chance of a single transistor failing per CPU per year that still going to happen to dozens of CPUs per year.
Hasn't that been true since the days of the 8080 and Z80?
 
philb2 said:
Why only at the datacenter? What about us schlubs who use PCs at home?
For us plebs a few bits being weird won’t do much, but when you get to the scale of Google or Netflix when you have parallel tasks that are super dependent on one another it does weird things.
Google has some great papers on the topic and they work with Intel very closely to develop solutions and such to prevent it.
Excerpts of their more recent work are here.
https://semiengineering.com/better-screening-needed-for-data-center-errors/
philb2 said:
Hasn't that been true since the days of the 8080 and Z80?
yes but 1 in a trillion when you have a few million transistors on a single CPU means it could take a decade before anything goes south, but when you have hundreds of billions on a single chip it could be a matter of months. And studies are showing that the more complicated chips require fewer failures before inconsistencies start occurring. And there are serious concerns that the newer process nodes themselves are less durable than their predecessors so instead of 1 in a trillion maybe it’s 1 in 800, billion instead.
 
