Nvidia delays Blackwell GPUs

“Product pushed back at least three months due to design problems”

“Google, Meta, and Microsoft are among those betting billions on Nvidia's GPUs amid an AI arms race. Google has ordered more than 400,000 GB200 chips, The Information reports, in a deal valued well north of $10 billion.

Meta also has a $10bn order, while Microsoft was expecting to have 55-65,000 GB200 GPUs ready for OpenAI by the first quarter. That now seems unlikely.

The production issue was discovered by manufacturer TSMC, and involves the processor die that connects two Blackwell GPUs on a GB200.

Nvidia is now adjusting the design and will need to run a new production test with TSMC before it can mass produce more.

It is also considering producing a version of the chip that only contains one Blackwell GPU to speed up delivery.

Due to the last-minute delay, TSMC will also have to leave production lines idle until the issue is fixed.”

1722696796137.png

Source: https://www.datacenterdynamics.com/...pacting-hyperscaler-data-center-plans-report/
 
"Product pushed back at least three months due to design problems"

"Google, Meta, and Microsoft are among those betting billions on Nvidia's GPUs amid an AI arms race. Google has ordered more than 400,000 GB200 chips, The Information reports, in a deal valued well north of $10 billion.

Meta also has a $10bn order, while Microsoft was expecting to have 55-65,000 GB200 GPUs ready for OpenAI by the first quarter. That now seems unlikely.

The production issue was discovered by manufacturer TSMC, and involves the processor die that connects two Blackwell GPUs on a GB200.

Nvidia is now adjusting the design and will need to run a new production test with TSMC before it can mass produce more.

It is also considering producing a version of the chip that only contains one Blackwell GPU to speed up delivery.

Due to the last-minute delay, TSMC will also have to leave production lines idle until the issue is fixed."

Source: https://www.datacenterdynamics.com/...pacting-hyperscaler-data-center-plans-report/
Design Issues May Postpone Launch of NVIDIA's Advanced Blackwell AI Chips

UPDATED by Nomad76 Today, 03:53 Updated: Today, 07:15 Discuss (35 Comments)
NVIDIA may face delays in releasing its newest artificial intelligence chips due to design issues, according to anonymous sources involved in chip and server hardware production cited by The Information. The delay could extend to three months or more, potentially affecting major customers such as Meta, Google, and Microsoft. An unnamed Microsoft employee and another source claim that NVIDIA has already informed Microsoft about delays affecting the most advanced models in the Blackwell AI chip series. As a result, significant shipments are not expected until the first quarter of 2025.
 
Wow this is pretty big news tbh. If AMD was smart (Which is hard for them sometimes) they would be shipping those MI300 cards as fast as they can.

This also means the 5000 series will now be pushed back even more....Plus it looks like the AI bubble is going bye bye.

Not good news at all.
 
Wow this is pretty big news tbh. If AMD was smart (Which is hard for them sometimes) they would be shipping those MI300 cards as fast as they can.

This also means the 5000 series will now be pushed back even more....Plus it looks like the AI bubble is going bye bye.

Not good news at all.
Bloomberg

1722697689068.png
 
This also means the 5000 series will now be pushed back even more....Plus it looks like the AI bubble is going bye bye.
I wonder if this might actually accelerate the release of the 5000 series. The cards running a single die wouldn't be impacted by this, so shipping those would provide some revenue while the multi-die versions get sorted out.
 
“Update 2:
SemiAnalysis's Dylan Patel reports in a message on Twitter (now known as X) that Blackwell supply will be considerably lower than anticipated in Q4 2024 and H1 2025. This shortage stems from TSMC's transition from CoWoS-S to CoWoS-L technology, required for NVIDIA's advanced Blackwell chips. Currently, TSMC's AP3 packaging facility is dedicated to CoWoS-S production, while initial CoWoS-L capacity is being installed in the AP6 facility.
OPBAK9K8BzMPgFhE_thm.jpg
Additionally, NVIDIA appears to be prioritizing production of GB200 NVL72 units over NVL36 units. The GB200 NVL36 configuration features 36 GPUs in a single rack with 18 individual GB200 compute nodes. In contrast, the NVL72 design incorporates 72 GPUs, either in a single rack with 18 double GB200 compute nodes or spread across two racks, each containing 18 single nodes.”

From TPU source linked above
 
The article says "design flaw" but that says TSMC ramp issues with their new advanced packaging facility. It was reported months back that TSMC was encountering unexpected setbacks and issues with their new CoWoS-L packaging process, and that is the process that makes the Nvidia interlink interposer that combines the two chips.
The CoWoS-L process allows TSMC to build interposers that are larger than the previous interposer reticle limit while packaging more communication lanes. Nvidia is 100% reliant on TSMC for these parts and is likely helping them save face, the design "flaw" is likely the new -L process has a slightly different practical limit than their design software indicated so they have to revise the design with the new specs the process is running at.
 
“As we've stated before, Hopper demand is very strong, broad Blackwell sampling has started, and production is on track to ramp in 2H. Beyond that, we don't comment on rumors”

NVIDIA Spokesperson
 
Does this just involve the data center high density AI chips, while consumer grade ones are unaffected or...? The summary kind of read like that.
 
Wow this is pretty big news tbh. If AMD was smart (Which is hard for them sometimes) they would be shipping those MI300 cards as fast as they can.

This also means the 5000 series will now be pushed back even more....Plus it looks like the AI bubble is going bye bye.

Not good news at all.
Well not good news for Nvidia anyway.
 
Does this just involve the data center high density AI chips, while consumer grade ones are unaffected or...? The summary kind of read like that.
Just the datacenter, none of the consumer parts are MCM so they don’t need the communications components that are causing the problem.
 
The article says "design flaw" but that says TSMC ramp issues with their new advanced packaging facility. It was reported months back that TSMC was encountering unexpected setbacks and issues with their new CoWoS-L packaging process, and that is the process that makes the Nvidia interlink interposer that combines the two chips.
The CoWoS-L process allows TSMC to build interposers that are larger than the previous interposer reticle limit while packaging more communication lanes. Nvidia is 100% reliant on TSMC for these parts and is likely helping them save face, the design "flaw" is likely the new -L process has a slightly different practical limit than their design software indicated so they have to revise the design with the new specs the process is running at.
In the industry is there insurance for situations like this?

How do they effectively mitigate such a profound impact & loss?

I know there’s rumors and whispers of other segments in the energy industry that will charge tax payers for the companies loss of profits for other reasons (them saving money)

https://lailluminator.com/2024/07/2...rced-to-pay-utility-company-for-lost-profits/

https://www.businessreport.com/arti...-want-customers-to-pay-for-lost-profits?amp=1

Just curious because this would seem to be a big deal and potentially billions of losses for nvidia?

What’s the reality
 
In the industry is there insurance for situations like this?

How do they effectively mitigate such a profound impact & loss?

I know there’s rumors and whispers of other segments in the energy industry that will charge tax payers for the companies loss of profits for other reasons (them saving money)

https://lailluminator.com/2024/07/2...rced-to-pay-utility-company-for-lost-profits/

https://www.businessreport.com/arti...-want-customers-to-pay-for-lost-profits?amp=1

Just curious because this would seem to be a big deal and potentially billions of losses for nvidia?

What’s the reality
That Louisiana thing I thought was a joke, like an Onion article spread out of control….
Normally energy suppliers just raise rates to cover the gap. People use less so they produce less so demand remains the same but the supply is lower therefore price goes up and margins are maintained.

But Nvidia might take a hit on their margin but the orders aren’t going anywhere. An extra 3 months on a project that will take multiple months just to configure and get online isn’t overly significant in the grand scheme of things. Companies buying these don’t have an abundance of options anyways, it’s not like anybody else builds CUDA compatible hardware.
 
“Update 2:
SemiAnalysis's Dylan Patel reports in a message on Twitter (now known as X) that Blackwell supply will be considerably lower than anticipated in Q4 2024 and H1 2025. This shortage stems from TSMC's transition from CoWoS-S to CoWoS-L technology, required for NVIDIA's advanced Blackwell chips. Currently, TSMC's AP3 packaging facility is dedicated to CoWoS-S production, while initial CoWoS-L capacity is being installed in the AP6 facility.
Additionally, NVIDIA appears to be prioritizing production of GB200 NVL72 units over NVL36 units. The GB200 NVL36 configuration features 36 GPUs in a single rack with 18 individual GB200 compute nodes. In contrast, the NVL72 design incorporates 72 GPUs, either in a single rack with 18 double GB200 compute nodes or spread across two racks, each containing 18 single nodes.”

From TPU source linked above
That might be the important quote right there: "This is partially offset by [redacted] which will have a decent revenue contribution and [redacted] is driving the increase in 3Q revenue"

I'm not holding my breath, but I am hoping that [redacted] is the gaming product. Of course, in order to have any meaningful impact on 3Q financials, that product would have to launch, like, today.
 
Video cards will be using Single die chips for the GPU's, so video card releases I doubt are affected.

The bloomberg post even says "AI chip is delayed".
 
Video cards will be using Single die chips for the GPU's, so video card releases I doubt are affected.

The bloomberg post even says "AI chip is delayed".
Well AI is the new buzz word. It makes for a much better and impactful headline than “gaming chip delayed”
 
(unlike the B100-B200) Nvidia did not even hint a possible release date or took order for the L40 replacement and the gaming card, making it hard to talk about being delayed for a trade headline. Maybe it will (maybe the issue is larger than what they say how just fab timing will push everything back or something and we will never know, if they launch in say March 2025 were they delayed by this or not ? Who would know)
 
Last edited:
Jensen needs his Ego taken down a notch. I'm okay with this.
 
Jensen needs his Ego taken down a notch. I'm okay with this.
seems GPUs now can also get CTE (akin or analogous to Chronic traumatic encephalopathy) problems (i jest) , pretty interesting

"The bridge die placement requires very high levels of accuracy, especially when it comes to the bridges between the two main compute dies as these are critical for supporting the 10 TB/s chip-to-chip interconnect. A major design issue rumored is related to the bridge dies. These bridges need to be redesigned. Also rumored is a redesign of the top few global routing metal layers and bump out of the Blackwell die. This is a primary cause of the multi-month delay.

There has also been the issue of TSMC not having enough CoWoS-L capacity in aggregate. TSMC built up a lot of CoWoS-S capacity over the last couple years with Nvidia taking the lion’s share. Now with Nvidia quickly moving their demand to CoWoS-L, TSMC is both building a new fab, AP6, for CoWoS-L and converting existing CoWoS-S capacity at AP3. TSMC needs to convert the old CoWoS-S capacity as otherwise it would be underutilized and the ramp of CoWoS-L would be even slower. This conversion process makes the ramp very lumpy in nature.

Combine these two issues and it’s clear that TSMC will not be able to supply enough Blackwell chips as Nvidia would like. Consequently, Nvidia is focusing what capacity they have almost entirely on GB200 NVL 36x2 and NVL72 rack scale systems. HGX form-factors with the B100 and B200 are effectively now being cancelled outside of some initial lower volumes."
 

NVIDIA's New B200A Targets OEM Customers; High-End GPU Shipments Expected to Grow 55% in 2025

PRESS RELEASE by TheLostSwede Today, 06:57 Discuss (0 Comments)
Despite recent rumors speculating on NVIDIA's supposed cancellation of the B100 in favor of the B200A, TrendForce reports that NVIDIA is still on track to launch both the B100 and B200 in the 2H24 as it aims to target CSP customers. Additionally, a scaled-down B200A is planned for other enterprise clients, focusing on edge AI applications.“
 
It's a TSMC issue, not a NVIDIA issue.
With that being said, and TSMC a third party vendor for Nvidia

You’d wonder if they assessed liability insurance with tsmc or risk accepted the gap?

Would be interesting to understand their third party engagement processes

—-
 
The article says "design flaw" but that says TSMC ramp issues with their new advanced packaging facility. It was reported months back that TSMC was encountering unexpected setbacks and issues with their new CoWoS-L packaging process, and that is the process that makes the Nvidia interlink interposer that combines the two chips.
The CoWoS-L process allows TSMC to build interposers that are larger than the previous interposer reticle limit while packaging more communication lanes. Nvidia is 100% reliant on TSMC for these parts and is likely helping them save face, the design "flaw" is likely the new -L process has a slightly different practical limit than their design software indicated so they have to revise the design with the new specs the process is running at.
Nah, no company is taking a fall for the other. If they say it's a issue with Nvidia design than it most likely is if Nvidia does not deny it. Neither one wants bad press.
 
Nah, no company is taking a fall for the other. If they say it's an issue with Nvidia design than it most likely is if Nvidia does not deny it. Neither one wants bad press.
Well it turns out the “design issue” is the TSMC CoWoS-L interposer has a small organic component, and it heats and cools at a different rate than the rest of the interposer. It causes it to warp over time. So they need to redesign their circuitry so they can minimize the warping. So TSMC issue…
 
Video cards will be using Single die chips for the GPU's, so video card releases I doubt are affected.

The bloomberg post even says "AI chip is delayed".
Considering how many financial media journalists still call the company “Nuh-vidia”, I wouldn’t rely on that as an accurate guideline regarding which specific products are impacted by the delay. Bloomberg calling a GPU an “AI chip” is basically like your mom calling every console you ever owned a “Nintendo” regardless of the brand.
 
It's a TSMC issue, not a NVIDIA issue.
Not all problems are production problems. From what I've read production has been halted. However production can be halted because of a design flaw just as quickly, as a production flaw.
 
Well it turns out the “design issue” is the TSMC CoWoS-L interposer has a small organic component, and it heats and cools at a different rate than the rest of the interposer. It causes it to warp over time. So they need to redesign their circuitry so they can minimize the warping. So TSMC issue…
Unless Nvidia specified that component. But generally, if another company is causing a delay in product they will call them out, to show investors why there is a delay.
 
Unless Nvidia specified that component. But generally, if another company is causing a delay in product they will call them out, to show investors why there is a delay.
It's the thing that makes the -L process different from the regular one.

CoWoS-L: This uses local silicon interconnect (LSI) along with RDL interposer together forming a reconstituted interposer (RI). In addition to the RDL interposer, it also the preserves the attractive feature of CoWoS-S in the form of through silicon vias (TSVs). This also mitigates the yield issues arising due to the use of large silicon interposer in CoWoS-S. In some implementations, it may also use through insulator vias (TIVs) instead of TSVs to minimize the insertion loss.

cowos-L-package.png
 
They do? I've never felt that way as a customer, and they must not be too bad to their partners with the scale they work with them on.
Talk to EVGA about how Nvidia treats partners.

Let's not bring up how they treated customers regarding the 970......Yeah I know we both got a check from Nvidia on that one.

But, what I will say I do like Nvidia products and I always pick vendor based on their warranty.
 
They do? I've never felt that way as a customer, and they must not be too bad to their partners with the scale they work with them on.
Companies will do business with the winner. They want to make money. I'm not saying they don't make money, don't make excellent products. Engineering, etc, Nvidia is top notch. But their business practices are garbage, have been garbage. Unless you have scale you don't get to do business with Nvidia anymore. Look at all the vendors that were either pushed out of marketing Nvidia products or chose not to because of their predatory practices.
 
Talk to EVGA about how Nvidia treats partners.

Let's not bring up how they treated customers regarding the 970......Yeah I know we both got a check from Nvidia on that one.

But, what I will say I do like Nvidia products and I always pick vendor based on their warranty.
EVGA went out of the videocard business due to their own issues and financials.

I got 20% back and $30 on each of my GTX 970 purchases. I was ultimately satisfied, but disappointed they lied about the specs upfront.

I too like Nvidia products, and have always picked vendor based on warranty. Evga had a great one that was ultimately too costly alongside their step up program for them. :)
 
With that being said, and TSMC a third party vendor for Nvidia

You’d wonder if they assessed liability insurance with tsmc or risk accepted the gap?

Would be interesting to understand their third party engagement processes

—-
It’s a managed risk sort of thing, but there’s so much money and time
Involved that nobody wants to piss anybody off so TSMC is generally good for trying to make things right.
 
