Google Cloud AI Chip - Workload Economics

Picture of DataStorage Editorial Team

DataStorage Editorial Team

In the News 6 min read  ·  May 2026
Google has looked at two fundamentally different problems, admitted that one chip cannot solve both well, and designed two separate pieces of silicon to prove it and what that decision means for the economics of running AI at scale is the real story.

For over a decade, Google's Tensor Processing Units have been one of the industry's best-kept competitive advantages. While the rest of the world scrambled for Nvidia H100s and debated GPU clusters, Google had been quietly running its own chip programme, building custom silicon that powered Gemini, Google Search, YouTube's recommendations, and almost everything else at scale. It worked well enough. But something shifted this year, and the change is more fundamental than a speed bump.

At Google Cloud Next 2026 in Las Vegas last month, the company announced that its eighth-generation TPU would not be one chip. It would be two. The TPU 8t for training AI models. The TPU 8i for running them. That distinction — training versus inference — has always existed on paper. Google just decided to stop pretending the same hardware handles both gracefully.


Why Now, and Why Does It Matter

The timing is not accidental. Inference workloads now account for more than 70 percent of AI accelerator cycles, and the economics of each query have become a business problem rather than just a technical one. When Anthropic reports that Claude handles more than 14 billion requests a day, the cost of answering each one starts to look a lot like a utility bill. At that scale, an 80 percent improvement in cost per query is not a talking point, it is the difference between a sustainable margin and a burning one.

The hyperscalers are building their own chips not because they think they can beat Nvidia on every metric but because they have concluded that purpose-built inference silicon, optimised for their specific workloads and deployed at their specific scale, produces better economics than buying Nvidia GPUs at Nvidia's margins.

42.5
Exaflops Ironwood TPU pod (Gen 7)
Google Cloud 2025
121
Exaflops TPU 8t superpod (Gen 8)
Google Cloud Next 2026
80%
Better perf-per-dollar on TPU 8i vs Ironwood
Google, April 2026
2.8×
Better training price-performance on TPU 8t
Google, April 2026

Amin Vahdat, Google's SVP and chief technologist for AI infrastructure, made a pointed remark that captures the internal logic of this decision. Google designs every layer of its AI stack end-to-end, and that vertical integration is starting to show up in cost-per-token economics that Google says its rivals cannot match. The chip announcement is the most visible part of that stack, but the story underneath involves networking, cooling, data centre design, and software, all of it built and owned by one company.

"We realized two years ago that one chip a year wouldn't be enough. This is our first shot at actually going with two super high-powered specialized chips." - Amin Vahdat, SVP & Chief Technologist for AI & Infrastructure, Google

The Two Chips, and the Two Different Problems They Solve

Designing one chip that is optimal for both training and inference has always been a compromise. Google has decided to stop compromising. The split acknowledges a reality the industry has been approaching for years: the workloads are fundamentally different, and treating them the same is expensive.

TPU 8t - Training
Designed by Broadcom
12.6 PetaFLOPS (FP4)
216 GB HBM per chip
6.5 TB/s HBM bandwidth
128 MB on-chip SRAM
9,600 chips per pod
121 Exaflops total
+2.8× vs Ironwood
vs
TPU 8i — Inference
Designed by MediaTek
10.1 PetaFLOPS (FP4)
288 GB HBM per chip
8.6 TB/s HBM bandwidth
384 MB on-chip SRAM (3×)
1,152 chips per pod
19.2 Tb/s ICI bandwidth
+80% perf-per-dollar vs Ironwood
TPU 8t vs TPU 8i - Architecture comparison, Google Cloud Next 2026

Training: TPU 8t

Training workloads demand maximum compute density and memory bandwidth to process trillions of parameters across weeks of continuous operation. The TPU 8t is built around that reality. A TPU 8t superpod scales to 9,600 liquid-cooled chips knit together by 2 petabytes of shared high-bandwidth memory, doubling interchip bandwidth over Ironwood. Each chip carries 216 GB of high-bandwidth memory at 6.5 TB per second of bandwidth, and up to 12.6 petaFLOPS of 4-bit floating point compute. The headline number is 121 exaflops per pod - nearly three times what Ironwood delivered.

But raw numbers are only part of what matters in training. The other part is goodput - how much of that compute is actually being used productively. Every hardware failure, network stall, or checkpoint restart is time the cluster is not training, and at frontier training scale, every percentage point can translate into days of active training time. Google's Virgo Network fabric and fourth-generation liquid cooling were designed with exactly this in mind.

Inference: TPU 8i

Inference is a different animal. For each token generated, the entire model's active weights need to be streamed through memory. While compute is still important, the main bottleneck tends to be memory bandwidth. The TPU 8i trades some raw floating point horsepower for a much larger on-chip SRAM cache and a faster, higher-capacity memory pool.

The TPU 8i features 10.1 petaFLOPS of FP4 compute fed by 384 MB of on-chip SRAM, and 288 GB of HBM good for 8.6 TB per second of bandwidth. That tripling of on-chip SRAM is not an accident - it is designed to hold agent working sets without trips to off-chip memory, which is where latency gets killed in production workloads. Google says its collective communication latencies are reduced five-fold, which translates into better economics by allowing them to pack more users onto the same hardware.


What This Actually Costs You - The TCO Story

Benchmarks are self-reported and should always be read with appropriate scepticism. But the total cost of ownership conversation around Google's TPUs has been building credibility for a while. Independent benchmarks published by SemiAnalysis put Ironwood's TCO at roughly 44 percent lower than a comparable GB200 server configuration, even accounting for a small shortfall in peak FLOP numbers. The same analysis put Ironwood's cost at $0.18 per million tokens for Gemini inference, versus $0.31 per million tokens on comparable B200 configurations.

Nvidia GB200
1.00× (baseline)
Ironwood (Gen 7)
~0.58× TCO
TPU 8i (Gen 8, proj.)
~0.32× TCO
0% 25% 50% 75% 100%
Relative TCO per inference workload lower is better. TPU 8i figure extrapolated from SemiAnalysis Ironwood data + Google's claimed 80% improvement. Independent benchmarks pending GA.
44%

Lower all-in TCO per Ironwood chip versus a comparable GB200 server, according to SemiAnalysis benchmarks published February 2026 even with a ~10% shortfall on peak FLOPs.

If you are a company spending a hundred thousand dollars a month on AI inference today, the arithmetic is worth doing. A company spending $100,000 monthly on AI inference could potentially reduce costs to $56,000 while maintaining the same performance levels on Ironwood alone before the additional 80 percent improvement claimed for TPU 8i is even factored in.

There is a caveat worth naming here. Google's performance claims are credible for a specific reason: the company designed the chip, the network connecting the chips, the servers hosting them, the cooling systems sustaining them, and the data centers housing all of it. No third-party chip vendor can make that statement. But that tight vertical integration is also the catch those performance advantages do not travel. They are a function of a system you do not own, and cannot negotiate around if you want to move providers in three years.


The Ironwood Foundation - What Came Before and What It Proved

It is easy to read the TPU 8 announcement as a break from the past, but it is better understood as validation of what Ironwood already proved. Ironwood powers every major Google service in production today: Gemini 3.5, Search AI Overviews, YouTube's recommendation stack, Gmail's smart features, and Google Photos' on-device models. TPU utilization exceeded 91 percent network-wide in March 2026, a number that would be commercially implausible if the chip did not deliver on its performance-per-dollar claims.

2015
TPU v1 deployed internally at Google
2017
TPU v2 adds training; first external cloud access
2021–23
TPU v4 & v5 - Google uses TPUs as cloud differentiator
Apr 2025
Ironwood (TPU v7) — 42.5 Exaflops, built for inference era
Late 2025
Ironwood GA; Anthropic commits to TPU for Claude
Apr 2026
TPU 8t & 8i unveiled - first training/inference split in programme history
Google TPU generation timeline — 2015 to 2026

The Supply Chain Underneath the Chips

Broadcom handles the high-performance training silicon under a relationship that has been described as a $46 billion AI contract. MediaTek handles cost-optimised inference, having already proved its ability to deliver I/O modules for Ironwood at 20 to 30 percent lower cost. This is not just an engineering decision, it is a deliberate multi-supplier strategy that reduces Google's dependence on any single partner while keeping cost pressure on each of them.

Google projects 4.3 million TPU shipments in 2026, rising to 10 million in 2027 and more than 35 million in 2028. The capital expenditure to support this is enormous. Google has committed $175 billion to $185 billion in infrastructure spending for 2026, nearly doubling the $91.4 billion it spent in 2025.

TPU 8 Projected Shipment Scale
2026
4.3M units
2027
10M units
2028
35M+ units
Google projected TPU shipment volumes source: The Next Web / Google projections
What to Watch
  • Intel, Marvell, and TSMC are all part of the supply chain supporting the TPU 8 programme.
  • The chips reportedly target TSMC's 2nm process node for the full generation rollout in late 2027.
  • Independent benchmarks from early cloud customers will be the real test of whether Google's claimed economics hold outside of vendor-controlled conditions. General availability is expected later in 2026.

What It Means for Enterprises Evaluating Cloud AI

For enterprises evaluating AI infrastructure, this changes the math on which cloud platform to standardize. If your workload is predominantly inference at scale think customer-facing agents, recommendation systems, summarisation tools, the TPU 8i represents a genuinely different cost structure than what general-purpose GPUs can offer.

But the switching cost conversation is real and deserves honesty. Enterprises that have standardized on Nvidia hardware for inference face real switching costs to move onto TPUs not because the software migration is complicated, but because the performance advantages of Google's stack are a function of the full infrastructure underneath it. The TPU 8i is not a drop-in replacement; it is the centrepiece of a system that Google has designed from network to cooling to orchestration.

Enterprise Decision Framework
Are you primarily running inference workloads at scale (agents, search, recommendations)?
Yes → Evaluate TPU 8i on Vertex AI
Are you training large proprietary models?
Yes → Evaluate TPU 8t availability & goodput SLAs
Are you locked into AWS or Azure through enterprise agreements?
Yes → Migration cost may outweigh savings model first
Do you consume Gemini through Gemini Enterprise?
Yes → You inherit the TPU 8i lift automatically

Where This Leaves Nvidia

Nvidia's position is more nuanced than the chip war framing often suggests. Google still sells Nvidia GB300 systems inside Google Cloud the competition is layered, not head-to-head. Jensen Huang has argued consistently that general-purpose GPUs optimise for the next workload, while ASICs optimise for today's. That is a credible point for organizations that expect their AI workloads to evolve quickly and unpredictably.

But Nvidia's data-centre gross margin, currently above 75 percent, faces meaningful compression as custom silicon captures a larger share of hyperscaler accelerator spend. Google is not the only company building this way. Amazon has Trainium and Inferentia, Microsoft has Maia chips, and Meta has its MTIA accelerators. The Nvidia moat is deep, particularly in software, but it is no longer the only viable option for organizations that have the scale and engineering resources to make custom silicon work.

Custom Silicon Share of Hyperscaler Accelerator Spend (Projected)
Nvidia GPUs now
~85%
Custom silicon now
~15%
Custom silicon 2028
25–30%
0% 25% 50% 75% 100%
Source: Forward Future AI / industry analyst projections. Figures illustrative.

The Bigger Picture

What Google has done with the TPU 8 split is not just an engineering decision. It is a statement about where AI infrastructure economics are heading. McKinsey's analysis predicts that inference spend will outpace training spend in enterprise budgets through 2027. If that forecast is accurate, then the economics of serving AI not building it become the defining competitive battleground. Google has positioned itself to compete on exactly that ground, with hardware designed specifically for the purpose.

Vahdat made a prediction worth noting: as general-purpose CPUs plateau, workloads that matter will demand purpose-built silicon. "Two chips might become more," he said — without specifying whether that means future TPU variants or other classes of specialized accelerators. The frontier compute race has changed character. It used to be about who could secure the most H100s. It is now about who controls the full stack from chip design to data centre fabric to the software that makes all of it run efficiently.

Silicon is the new contract. Google is not selling you a chip it is selling you a relationship with an entire infrastructure stack, priced to make alternatives look expensive.

References

Share this article

🔍 Browse by categories

🔥 Trending Articles

Newsletter

Stay Ahead in Cloud
& Data Infrastructure

Get early access to new tools, insights, and research shaping the next wave of cloud and storage innovation.