This is not a budgeting failure. It is a visibility failure, and it is the most widespread pain point in cloud infrastructure right now.
AWS, Google Cloud, and Azure GPU compute is fundamentally different from the EC2 instances and Cloud Run jobs that FinOps teams got good at managing over the last decade. The billing is abstracted behind tokens, model units, and throughput reservations. The cost drivers are buried inside prompt structures, context window sizes, and agent orchestration chains. And unlike traditional compute, a single misconfigured job can burn through a weekly budget overnight without a single alert firing. This connects directly to a broader pattern covered in Hidden Costs in Cloud Billing: What Your Provider Isn't Telling You.
This guide goes deep on all three major managed AI platforms: AWS Bedrock, Google Vertex AI, and Azure AI. Not a feature comparison. A cost management playbook.
Before you can control it, you need to understand what makes GPU and inference spend so difficult to manage.
Regular cloud compute scales linearly. You add servers, costs go up predictably. AI workloads do not behave this way.
Figure 1: The AI cost iceberg. What shows up on model billing is just the tip. The real spend lives below the waterline in services that never get tagged as AI.
The visible spend on a platform like Bedrock or Vertex AI is the surface. The real cost sits below the waterline in shared compute, storage, and networking that never gets tagged as AI. A team running a RAG pipeline on Bedrock is not just paying for model inference. They are paying for OpenSearch Serverless, S3 storage, CloudWatch logging, Lambda invocations, and data transfer. None of those line items say "AI" on the bill.
GPU compute is an order of magnitude more expensive than general-purpose cloud compute, and the billing model used by managed AI platforms abstracts the underlying GPU economics from the teams generating the cost. When you see a token price, you are looking at a derivative of GPU time, not GPU time itself. That abstraction is convenient when you are building. It becomes a liability when you are trying to attribute spend or forecast costs. For a deeper look at how this plays out across compute types, see GPU vs CPU: Choosing the Right Compute for AI Workloads.
This is worth stating plainly because teams waste months optimizing the wrong number. The only metric that actually matters for AI workloads is cost per outcome: cost per training run, cost per million inference tokens, or cost per AI-powered feature. Hourly pricing tells you almost nothing about what your workload will actually cost to run, or what value it will produce.
Bedrock is where most enterprise AI spend is concentrated right now. It offers the broadest model catalog, the deepest AWS integration, and some of the most mature cost optimization levers available on any managed AI platform. For context on what AWS announced around AI tooling, see What AWS re:Invent 2025 Announcements Mean for Enterprise Teams.
Bedrock can include separate costs for model inference, guardrails, knowledge bases, logging, and related AWS services. Teams that only look at model inference costs are systematically underreporting their Bedrock spend.
The cost everyone gets surprised by inside Bedrock Knowledge Bases is Amazon OpenSearch Serverless, the default vector store. It has a minimum baseline of 2 OpenSearch Compute Units at $0.24 per OCU per hour, which works out to roughly $345 per month even with zero query traffic. Amazon S3 Vectors, launched in December 2025, is up to 90% cheaper than OpenSearch Serverless. For new Knowledge Bases, S3 Vectors should be your default unless you have a specific OpenSearch dependency.
Nova Micro at $0.035 per million output tokens is roughly 143 times cheaper than Claude Opus 4.7 at $5.00 per million input tokens. For simple classification, structured extraction, or routing decisions, defaulting to a frontier model is one of the most expensive habits in production AI.
Bedrock's Intelligent Prompt Routing makes this easier than it sounds. For $1.00 per 1,000 requests, it automatically selects between models in the same family based on prompt complexity. Simple queries go to the cheaper model, complex ones go to the more capable one. If routing saves 30 percent on a $5,000/month inference bill, that is $1,500 saved for $300 spent.
These two levers are underused by most teams and the savings are not marginal. Prompt caching can reduce input token costs by up to 90 percent. Batch inference for non-real-time workloads saves 50 percent versus on-demand. Good candidates for batch: nightly document processing, content generation pipelines, embedding generation, and bulk analytics.
A single user query can trigger multiple internal model calls through thinking, searching, tool calling, and summarizing, and you pay for all of them. For agentic workloads, multiply your base cost estimate by 5 to 8 to account for token amplification. A request that looks like a 2,000-token interaction can actually consume 12,000 to 16,000 tokens once the agent loop runs.
| Lever | Mechanism | Saving |
|---|---|---|
| Prompt caching | Cache repeating system prompts and prefixes | up to 90% |
| Batch inference | Non-real-time workloads processed in bulk | 50% |
| Intelligent prompt routing | Auto-route to cheapest model by complexity | 25–40% |
| Provisioned throughput | Reserved capacity for sustained traffic | 20–50% |
| S3 Vectors vs OpenSearch | Cheaper vector store for Knowledge Bases | up to 90% |
| Flex pricing | 50% off via standard APIs with slight latency | 50% |
Vertex AI has the most complex billing surface of the three platforms. It covers 15 or more separately billed services, each with its own meter. A Gemini call may look cheap in isolation, but the moment you add retrieval, grounding, or agent orchestration, you are stacking charges. For a deeper look at how Google approaches AI chip economics, see Google Cloud AI Chip — Workload Economics.
One of the most overlooked aspects of Vertex AI is that it charges differently depending on the deployment method. If you deploy a model to a dedicated endpoint, you are billed by the hour even if the endpoint is idle. Configure autoscaling with minimum replicas at zero for dev and staging. For production, match minimums to baseline traffic, not peak.
Cache aggressively. Context caching saves up to 90 percent on repeated input. Reads cost roughly 10 percent of base input price, with storage charges per million tokens per hour. If your app reuses system prompts or document preambles across requests, caching is the single highest-ROI optimization on the platform.
CUDs are Vertex AI's equivalent of Reserved Instances on AWS and they are underutilized by teams that think of Vertex as inherently pay-as-you-go. One-year CUDs save approximately 30 percent, and three-year CUDs save approximately 50 percent. Breakeven is roughly four months of consistent usage. For a full breakdown of how commitment models compare across providers, see Reserved vs On-Demand vs Spot Instances: A Cost Breakdown.
Spot VMs paired with autoscaling for non-time-sensitive batch training jobs can translate into savings of up to 80 percent relative to on-demand instances. If you are running evaluation jobs, fine-tuning, or batch embeddings generation, Spot VMs should be the default compute choice.
Gemini's extended thinking capability is powerful but expensive on default settings. Default to thinking level "medium." Google's documentation confirms medium delivers comparable reasoning depth to high, at meaningfully lower cost per request. For most production use cases, high-level thinking is only justified for genuinely complex reasoning tasks.
| Lever | Mechanism | Saving |
|---|---|---|
| Context caching | Reuse system prompts & document preambles | up to 90% |
| Scale to zero (idle endpoints) | Autoscale dev/staging to zero replicas | Eliminates idle waste |
| Spot VMs (training/batch) | Preemptible compute for fault-tolerant jobs | up to 80% |
| 1-year CUD | Committed use discount on steady workloads | ~30% |
| 3-year CUD | Longer commitment for stable production | ~50–55% |
| Thinking level "medium" | Comparable quality, lower token cost | Varies by use case |
Azure OpenAI is the natural choice for organizations already operating inside the Microsoft ecosystem. The billing mechanics are different from Bedrock and Vertex, and the hidden cost structure catches teams off guard.
Azure OpenAI pricing looks simple at first: pay per token, pick your model, done. But actual Azure OpenAI cost can run 15 to 40 percent higher than the advertised token prices. Token prices on Azure are identical to OpenAI's direct API. Azure adds the overhead through support plans, data egress, fine-tuning hosting, Private Link, and Log Analytics. Teams choose Azure for compliance and procurement, not price. Understanding this overhead is not a reason to avoid Azure. It is a reason to forecast accurately.
PTUs are units of reserved Azure OpenAI model capacity that guarantee throughput and low latency. Azure bills PTUs based on provisioned capacity, not actual usage. Unlike tokens, if you do not use your PTUs, you lose them and still pay for them.
PTU is cheaper than Pay-As-You-Go only when sustained utilization exceeds 50 percent and monthly token volume is above 150 to 200 million tokens on GPT-4o. Below that line, pay-as-you-go wins on flexibility and total cost. Annual commitments save another approximately 35 percent versus monthly PTU pricing.
GPT-5-nano is the cheapest Azure OpenAI model at $0.05 input and $0.40 output per million tokens, followed by GPT-4.1-nano at $0.10 and $0.40. Both are designed for high-volume, simple tasks. Running GPT-4o for tasks that GPT-4.1-nano handles equally well is a 25 to 50 times cost multiplier. Audit your model-to-task assignments quarterly.
| Lever | Mechanism | Saving |
|---|---|---|
| Prompt caching | Auto-fires on prefix match, no code changes | 50–90% |
| Batch API | Non-real-time workloads, 24-hr SLA | 50% |
| Model right-sizing | GPT-4.1-nano vs GPT-4o where quality allows | 20–50× |
| PTU (reserved) | Fixed capacity for sustained >150M tokens/mo | up to 70% |
| Annual PTU commitment | 12-month lock-in on top of PTU rate | +35% vs monthly |
| Global deployment | Highest throughput, lowest rate, no residency | Varies by region |
Here is how the maximum savings from each optimization lever compare across all three platforms. Use our Cloud Cost Calculator to model these numbers against your own usage patterns.
Figure 2: Maximum savings potential by optimization lever across Bedrock (orange), Vertex AI (green), and Azure OpenAI (blue). Prompt caching and Spot VMs represent the highest-ceiling opportunities.
Most enterprise AI deployments do not live on a single cloud. They span all three platforms, sometimes for the same end product.
Enterprises cannot govern AI GPU economics with cloud dashboards alone, because those dashboards speak three different cost languages and stop at their own cloud boundary. Bedrock bills in per-request tokens. Vertex AI bills in compute hours, token units, and service-specific meters. Azure bills in tokens and PTU-hours. Comparing spend across these platforms requires normalization, and native dashboards do not do it.
| Capability | AWS Bedrock | Vertex AI | Azure OpenAI |
|---|---|---|---|
| Attribution mechanism | Application Inference Profiles (AIPs) — best-in-class | Project labels — excellent with strong project hygiene | Tag inheritance at scale — easiest for large orgs |
| Cost data source | CUR + CloudWatch (token-level) | Monitoring/Logging + BigQuery | Azure Cost Management + token metrics per deployment |
| Budget alerts | AWS Budgets (per service) | GCP Budget Alerts + Optimization Hub | Azure Cost Management (50/80/100% thresholds) |
| Commitment model | Provisioned Throughput (model units) | Committed Use Discounts (CUDs) | Provisioned Throughput Units (PTUs) |
The minimum viable tagging strategy across all three platforms: every deployment tagged with team, product, environment (production, staging, dev), and use case. Without these four tags, cost attribution is guesswork. Bedrock's Application Inference Profiles are the standout feature here. When teams attach profiles at call time, profile names show up in cost data, making showback by team or feature straightforward.
If you are starting from scratch on AI cost management, here is the sequence that delivers the most savings fastest, ordered by effort-to-impact ratio. Teams that have also tackled auto-scaling strategies tend to see the fastest compound returns.
Figure 3: Optimization priority roadmap. Tier 1 actions deliver immediate savings with zero architectural changes. Each tier builds on the one before it.
The managed AI cost landscape is moving fast. A few developments worth tracking closely:
Vertex AI is transitioning to the Gemini Enterprise Agent Platform, a rebrand consolidating Vertex AI and Agentspace into one product. The billing mechanics have not changed, but the product surface is expanding. New services mean new billing meters.
Google's Committed Use Discounts now support multiple prices for the same SKU, which creates more flexibility for enterprise negotiation. Google also introduced Optimization Hub, a central place for cost-saving recommendations, and a new Cost Explorer tool that makes it easier for developers to visualize spending.
On the Bedrock side, the model catalog continues expanding rapidly. Six new models were added in February 2026 alone. Newer models frequently outperform older ones at lower cost, so a quarterly model review is not optional. It is table stakes for cost management.
The teams controlling AI spend in 2026 are not the ones with the strictest budgets. They are the ones who treated cost visibility as an engineering problem, not a finance problem. They built unit economics into their architecture from the beginning, tagged every deployment, and treated prompt design as a cost lever.
Without proactive cost management, the pay-as-you-go flexibility of managed AI platforms quickly becomes a liability. Idle endpoints, inefficient model selection, and unoptimized token consumption can turn a $5,000 budget into a $20,000 surprise bill.
The patterns described in this article are not theoretical. They are what separates teams with explainable, governable AI spend from teams that are perpetually chasing their own bills.