The AI Infrastructure Stack · Part 3 of 3
In Part 1 of this series, we examined why AI agents are dismantling the per-seat SaaS business model. In Part 2, we looked at why that shift puts enormous new pressure on the storage layer underneath all of it. Both articles were about forces already in motion. This one is about the missing context layer teams need to make all of that work in production.
Building AI models in production is harder than the headlines suggest. The demos look clean. The GitHub repos look tidy. What actually happens when an engineering team sits down to build, run, and operate ML workflows on real infrastructure is messier, slower, and more expensive than most organizations expect before they try it.
This is true whether the team is building a large language model for a customer-facing application or a classical ML model: a regression pipeline forecasting demand, a classification system routing support tickets, a recommendation engine deciding what a user sees next. The operational problems are the same across both worlds. The context gap is just as wide.
Russ Artzt has been watching this problem develop. The co-founder of CA Technologies and former executive chairman and head of R&D at RingLead, acquired by ZoomInfo, Artzt is now an advisor to SkyPortal, a company targeting what he describes as one of the most painful and least glamorous problems in the current AI stack: the fact that when a model misbehaves, most teams have no single place to see the environment, code, run history, and infrastructure signals together.
“There's a lot of pain in building these models. It sometimes takes months because people can't figure out how to do it. They can't use the tools right, they don't get enough information, they don't have a way of debugging it.”
— Russ Artzt
Anyone who has run a serious AI development environment knows the feeling. You're not working with one tool. You're working with a constellation of them, each with its own interface, its own conventions, and its own AI assistant with its slice of context.
The standard ML stack might include MLflow or Weights & Biases for experiment tracking and run analysis, AWS SageMaker or Google Vertex AI for managed ML platforms, Kubernetes for orchestration, a vector database for RAG (retrieval-augmented generation) pipelines, and a GPU cluster from a neocloud provider or hyperscaler for compute. Each of these systems has to be integrated and monitored in production, but teams still have to connect the context across them. The MLOps market has grown from $1.58 billion in 2024 to a projected $19.55 billion by 2032, reflecting just how much organizational energy is going into managing this complexity.
“You have a lot of different tools and solutions that have to be configured in the right way,” Artzt explained. “And they're very complex. You've got to know what you're doing.”
A further complication: most of the AI-powered assistants built into these tools are excellent within their own domain. SageMaker's observability features understand SageMaker. Weights and Biases' AI features understand experiment runs. But none of them can see across compute, models, and applications simultaneously. Teams dealing with a production failure that spans all three layers are, in practice, on their own.
The problem isn't that the individual tools are bad. MLflow is genuinely powerful for experiment tracking and model management. Weights & Biases, now part of CoreWeave, is widely used for experiment tracking, visualization, and collaboration across model development teams. But each tool sees only part of the workflow. Teams still have to connect those signals to the rest of the pipeline, and when that context is fragmented, the errors aren't always loud.
Here's the insidious part of broken context in AI pipelines: the failure mode often looks identical to a model that just doesn't work.
In an LLM-based system, you get hallucinations. You get outputs that don't match expectations. You spend weeks trying to improve the model itself before realizing the problem was never the model at all. It was a misconfigured retrieval pipeline starving the model of the context it needed to answer correctly.
In classical ML, the failure is quieter but just as costly. A demand forecasting model starts returning degraded results after a data schema change upstream. A classification pipeline drifts slowly as the distribution of incoming data shifts, and nobody notices until business outcomes deteriorate. In both cases, the symptom looks like a model problem. The cause is a context problem: nobody had a unified view of what changed, when it changed, and what effect it had.
“You know it's not working because you're hallucinating, you're getting bad results. And how do you fix it? You need a debugger.”
— Russ Artzt
That framing is important. In traditional software development, a debugger is a tool that lets engineers pause a running program, inspect its internal state, and trace exactly what happened leading up to a failure: the equivalent of being able to freeze a machine mid-operation and read every dial at once. Every serious language and runtime has this capability built in. In AI model development, the missing layer is broader: teams need context across environments, code, runs, and infrastructure, and the tooling for that is still nascent.
“How do you have a model that doesn't work? How do you help the user debug their ML models and their AI models? How do you help resolve these problems?” Artzt asked. “You'd be able to breakpoint it, stop it at a certain point, and see where you're at. Hard to do that.”
The consequences of this gap are measurable. Studies show 70% of production ML issues are organizational rather than technical, meaning teams are spending most of their debugging time navigating unclear ownership, inconsistent environments, and opaque tooling, rather than seeing one shared operational picture. A 2024 Deloitte survey found that 38% of business executives reported making incorrect decisions based on hallucinated AI outputs, downstream of the same issue: models reaching production without adequate context to catch workflow and infrastructure problems before they surface as wrong answers.
ML models don't exist in isolation. They are components of larger systems, and they pass through organizations before they reach production. A model is typically built by a data scientist or researcher. It gets handed to an ML engineer who integrates it into a pipeline. That pipeline gets handed to a software engineering team or MLOps team that deploys and maintains it. In larger organizations, each of those handoffs crosses a team boundary. In some cases, it crosses an organizational boundary entirely.
At each handoff, context gets lost. The data scientist who understood the model's assumptions, its training data quirks, and its known failure modes is no longer in the loop by the time the model is in production. The ML engineer who built the pipeline may not be reachable when the software team encounters a degraded output six months later. The product manager trying to coordinate a fix doesn't know who owns what, because ownership is distributed across functions that don't share a system of record.
The result is a structural accountability problem that no individual tool solves. Data scientists are still responsible for the results their models produce in production, but they have no visibility into what's happening. Software engineers are operating pipelines they didn't build and don't fully understand. PMs are coordinating across teams without a shared picture of the system's state. Teams aren't failing because the math is wrong. They're failing because the context that would let them diagnose and fix problems dissolves at every boundary the model crosses on its way to production.
SkyPortal's argument is that a unified context layer — one that spans compute utilization, model run history, and application behavior simultaneously — is the structural fix for this problem. Not because it removes the organizational complexity, but because it gives every team in the chain a shared operational picture to reason from.
SkyPortal's thesis is that the AI model development stack needs something analogous to what integrated development environments brought to software: a shared context layer that can see the environment, code, run history, and monitoring together across compute, models, and applications, explain why a model is behaving the way it is, and propose the next step.
| Domain | What it covers | Why it matters |
|---|---|---|
| Compute context | GPU/CPU utilization, cluster configuration, infrastructure performance | Correlates infrastructure behavior with model outcomes |
| Model context | Experiment history, parameter changes, training runs, lineage | Explains what changed and when |
| Application context | Downstream consumption, failure surfaces, end-to-end behavior | Connects “bad outputs” to operational causes |
“SkyPortal tends to be helpful with that,” Artzt explained. “SkyPortal will be able to understand the correct configuration for a number of these tools, and tell you when you've configured it wrong. Tell you when you run your model why it's not running well. Tell you when you're doing different things.”
The practical scope is broader than just flagging bad configuration. Artzt described a system that can identify where workflow context breaks down, across tooling, runtime, or infrastructure, and prescribe specific fixes: “You didn't set up MLflow right. You didn't change this configuration. Do this and do that and try it again. That's why you're getting bad results.” It also does performance monitoring across the GPU and CPU layer, providing visibility into whether infrastructure problems are contributing to model behavior issues.
Critically, the system is designed to be prescriptive, not just diagnostic. It can recommend the missing change, explain it in the context of a broader workflow, and in some cases propose or automate part of the fix. “Sometimes it will say, you know what, you need a different tool. We're going to help you install it. You should use MLflow and we'll install it for you.”
That last detail matters. A system that tells you what changed, why it matters, and what to do next is far more useful than a dashboard that only throws alerts. The value proposition of SkyPortal, as Artzt describes it, is closing the loop between detection and resolution, collapsing what currently takes weeks of expert-level troubleshooting into a workflow that a team without deep MLOps experience can navigate.
The market context for SkyPortal is inseparable from the talent situation described in Part 1 of this series. AI engineers are scarce and expensive. The best ones command seven-figure compensation. Most organizations building AI applications today are doing so with teams that have varying levels of MLOps maturity, some deep, many not.
“How many AI people do you think really know how to build a data center? How do you build a data center? How do we place these GPUs? What's the best way? What kind of clusters do you have to set up? What kind of networking do you need? What kind of storage do you need? It's complicated,” Artzt said.
That question applies equally to model development environments. McKinsey's 2024 AI survey found that 72% of organizations have adopted AI, up from roughly 50% over the prior six years. But adoption is not the same as operational maturity. Many of those organizations are running models, both classical ML pipelines and LLM-based applications, in environments that were configured by people learning as they went, using tools they'd never encountered before, without a shared context layer to catch problems before they compound.
The result is what Artzt described as months-long stalls. Teams that should be iterating quickly on model improvements are instead spending their time trying to diagnose whether the problem is the model, the data, the runtime, or the infrastructure underneath it. SkyPortal's bet is that the cost of that diagnostic work is high enough, and widespread enough, that there's a real market for tooling that reduces it.
Beyond model-level debugging, SkyPortal also addresses something that sounds unglamorous but matters enormously in practice: knowing whether your GPU infrastructure is actually doing what you think it's doing, and seeing that in the same operating picture as the model itself.
“It also does monitoring of your GPUs, tells you what the performance is going on in your CPUs and your GPUs,” Artzt said. “So it gives you some performance monitoring, but it also understands when you configure something wrong.”
This is more significant than it sounds. GPU clusters are expensive. An enterprise team running training workloads on CoreWeave or a hyperscaler is paying for that compute by the hour. A misconfigured pipeline that idles GPUs waiting on data, or a training job running inefficiently because of a suboptimal batch size or parallelization setup, is burning money at a rate that most teams don't have good visibility into.
Connecting infrastructure monitoring to model behavior monitoring is the architectural bet SkyPortal is making. The two problems are related. A model that's performing poorly might be doing so because of a tracking or runtime issue, a data pipeline bottleneck that's starving the GPU, or a hardware issue in the cluster itself. A system that can see across all three layers and correlate them is more useful than three separate tools that each see one.
The three articles in this series tell a connected story. Agentic software is reshaping the application layer (Part 1). That reshaping creates massive new demand on the storage layer beneath it (Part 2). And building the models that power that agentic software, whether those are large language models or classical ML pipelines that have been running in production for years, requires a level of end-to-end operational context that the industry is still catching up to (Part 3).
SkyPortal sits at the intersection of the second and third problems. It's a unified ML operations context layer for the model development environment, which lives on top of the GPU compute and storage layers examined in Part 2. As those layers scale, the complexity of managing the environments that run on top of them scales with it.
Artzt framed it as an infrastructure play with a specific and urgent pain point. “A lot of people are building models today and having trouble. That's the pain point they're trying to solve.”
The interesting strategic question is how SkyPortal's category evolves as the tooling matures. The MLOps market is consolidating quickly, with major platforms like Databricks' managed MLflow offering increasingly integrated capabilities for experiment tracking, governance, and observability. But managed platforms assume a level of standardization that many real-world AI deployments don't yet have. The gap between the clean managed-service experience and the messy reality of multi-tool, multi-cloud, heterogeneous AI stacks, running both modern LLM applications and legacy classical ML pipelines side by side, is exactly where a context layer like SkyPortal is designed to operate.
One of the recurring themes across this entire series is the gap between the AI infrastructure narrative and the operational reality. The narrative is about compute power, model capability, and the speed of innovation. The reality is about expertise shortages, fragmented tooling, missing context, ownership gaps, egress bills, storage bottlenecks, and debugging cycles that stretch for months.
Artzt's view is bracingly practical about all of it. He's not pessimistic about AI. He's one of its clearest advocates. But his perspective comes from decades of watching transformative technologies encounter their operational ceilings: the point where the potential is obvious but the infrastructure to realize it is still being built.
The mainframe-to-client-server transition took years. The client-server-to-cloud transition took years. “It's going to take 10 years, maybe more,” he said of the current AI data center buildout. The ML operations context problem is part of the same long arc, and the companies that build durable solutions to the operational friction in AI pipelines, not just the headline problems, are the ones that tend to win in those transitions.
“You better configure it right. If you don't, you're going to have trouble.”
— Russ Artzt
That's not a warning about a niche edge case. In 2026, with AI adoption at 72% and MLOps expertise still scarce, it's a description of where most organizations actually are: trying to build AI across both modern LLM applications and the classical ML systems they've depended on for years, without a unified view of what changed, why it changed, and what to do next.
This is the final installment in DataStorage.com's three-part series on the AI infrastructure stack.
Read Part 1: “The SaaS Reckoning” and Part 2: “Storage Is the Anchor.”
Russ Artzt is co-founder of CA Technologies and former executive chairman and head of R&D at RingLead, acquired by ZoomInfo. He serves as an advisor to SkyPortal and speaks with DataStorage.com regularly on AI infrastructure, enterprise software strategy, and the evolving data stack. Connect with him on LinkedIn.