A model that passes every offline test can still take down a service the day it ships. Not because the model is wrong, but because the thing you deployed is not a stateless web handler — it is a stateful, hardware-bound, version-fragile artifact that your existing CI/CD pipeline was never designed to carry. The gap between “the model works” and “the model runs in production at cost” is where most cloud-and-DevOps trouble for AI actually lives. The intuition most teams arrive with is that an AI model is just another service: containerize it, push it through the same pipeline, scale it behind a load balancer. At the surface that is mostly true. Underneath, almost none of the assumptions that make a stateless microservice cheap and predictable hold for an inference workload. That mismatch is the subject of this article. Why “Just Deploy It Like Any Other Service” Fails Standard web services have a comfortable shape. They are stateless, CPU-bound, start in milliseconds, scale horizontally on commodity nodes, and cost roughly linearly with traffic. A DevOps platform built around Kubernetes, a container registry, and an autoscaler is tuned for exactly that shape. An inference service violates almost every one of those assumptions. It is bound to a specific accelerator, often holds gigabytes of weights resident in GPU memory, can take tens of seconds to load a model before it serves a single request, and its cost per request is dominated by hardware that sits idle between bursts. Treating it as a stateless service does not produce a slow service — it produces an expensive, brittle one that fails in ways your dashboards were not built to show. We see this pattern regularly: a team ships an inference endpoint, the autoscaler reacts to a traffic spike by launching new pods, and each new pod spends 40 seconds pulling a multi-gigabyte image and loading weights into VRAM before it accepts traffic. By the time the new replicas are warm, the spike is over. The autoscaler then scales them back down, and the next spike repeats the whole expensive dance. The platform behaved exactly as designed. The design just assumed millisecond cold starts. The correct framing is that an AI deployment is a stateful, hardware-coupled, version-fragile artifact, and the cloud and DevOps practice around it has to absorb those three properties explicitly rather than pretend they are not there. The Three Properties That Break the Pipeline Hardware coupling A trained model is not portable in the way a JAR or a static binary is. It is coupled to a runtime (PyTorch, TensorRT, ONNX Runtime), which is coupled to a driver and CUDA version, which is coupled to a specific class of GPU. Change any layer and behaviour can shift — from a silent accuracy regression to an outright load failure. This is why the four-axis CUDA compatibility problem shows up the moment you try to run the same container on a different node pool than the one you tested on. For DevOps this means the build artifact is no longer “the container.” It is the container plus an implicit contract with the node’s driver and accelerator. A pipeline that does not pin and verify that contract ships a time bomb that detonates on the first heterogeneous cluster. Statefulness and warm-up Weights have to be loaded into accelerator memory before the service is useful, and that load is slow relative to a normal pod start. The practical consequence is that scale-to-zero — the cloud-native reflex for cost control — is frequently the wrong move for GPU-backed inference, because the cold-start penalty lands on the user-facing request path. The cost lever is real, but it interacts with the warm-up cost in ways that have to be measured, not assumed. Version fragility In a normal service, “the code is the truth.” In an AI service, the deployed behaviour depends on the model weights, the preprocessing code, the runtime, and the data distribution the model now sees in production — and the last of those changes without anyone deploying anything. A model whose accuracy quietly decays as live data drifts away from its training distribution is a deployment problem that no Git diff will ever surface. Distinguishing that from a hardware-side regression is its own discipline; we treat model drift and hardware drift as separate failure classes because the remediation is completely different. A Decision Table: Stateless Service vs. Inference Service This is the contrast worth keeping in front of any team about to put a model behind a pipeline. Each row is a place where the cloud-native default needs a deliberate override. Dimension Stateless web service AI inference service What changes in your pipeline Cold start Milliseconds Seconds to tens of seconds (weight load) Pre-warm pools; avoid naive scale-to-zero on the request path Cost driver Request volume × CPU Accelerator-seconds (idle or busy) Optimize for utilization, not just replica count Build artifact Container image Image + driver/CUDA/runtime contract Pin and verify the hardware contract in CI Scaling signal CPU / request rate Queue depth, batch fill, GPU memory Custom autoscaling metrics, not CPU% Correctness over time Stable until code changes Degrades as data drifts Continuous monitoring of output quality, not just uptime Resource granularity Fractional CPU per pod Whole or sliced GPU per pod Bin-packing and MIG/MPS decisions enter the platform layer The table is not exhaustive, but if a team can answer the right-hand column for each row, most of the expensive surprises have already been designed out. What MLOps Adds on Top of DevOps It helps to be precise about the boundary. DevOps owns the path from committed code to a running, observable service. MLOps does not replace that — it extends it with three concerns that classical DevOps has no native vocabulary for: the model as a versioned artifact, the data as an input that drifts, and evaluation as a release gate. In practice this looks like model registries (MLflow and similar) sitting beside the container registry, data and feature versioning sitting beside source control, and an offline-evaluation step gating promotion the way a test suite gates a merge. The pipeline is still a pipeline. It simply has more kinds of artifact flowing through it, and more kinds of regression it has to catch. The mistake we see most often is treating MLOps as a separate platform owned by a separate team, bolted on next to the “real” CI/CD. That split is where models go to rot: the DevOps pipeline ships the container, the data-science team owns the model, and nobody owns the contract between them. The whole point of folding AI into cloud and DevOps practice is that the contract becomes explicit and enforced, not tribal knowledge. How Do You Know If Your Pipeline Is AI-Ready? A short diagnostic. If you answer “no” or “we don’t measure that” to more than two of these, the pipeline is carrying risk it cannot see. Hardware contract — Does CI verify the CUDA/driver/runtime versions on the target node pool, not just the build node? Cold-start budget — Do you know your model’s weight-load time, and is your autoscaling policy aware of it? Scaling signal — Does your autoscaler react to a GPU-relevant metric (queue depth, batch utilization) rather than CPU%? Output monitoring — Are you monitoring prediction quality or distribution shift, not just latency and error rate? Rollback unit — Can you roll back the model version independently of the application code? Cost visibility — Can you attribute accelerator-seconds to a service and see idle time, not just an aggregate cloud bill? These map directly to the three properties above: items 1 and 6 are hardware coupling, 2 and 3 are statefulness, and 4 and 5 are version fragility. The Cost Conversation Is a Measurement Conversation The headline reason organizations move AI into disciplined cloud and DevOps practice is cost. GPU-accelerated inference is expensive, and the expense is dominated by accelerators that bill whether they are busy or idle. The lever is utilization, not replica count — and utilization is something you have to measure under realistic load rather than infer from a spec sheet. This is where it matters to remember that the GPU is not the system. The accelerator is one component in a pipeline that includes the CPU, the network, the storage path, and the host memory, and a bottleneck anywhere upstream starves the expensive part you are trying to keep busy. We have repeatedly seen “GPU too slow” diagnoses that turned out to be a data-loading or PCIe-topology problem — an observed pattern across our engagements, not a benchmarked figure, but a consistent one. The honest version of the cost claim is that sustained throughput under realistic load, not peak burst, is the operationally relevant number — and the reasoning for why that distinction matters is exactly the discipline that separates peak performance from steady-state performance in AI systems. Sizing your cluster off a peak figure you will rarely hit is how you end up paying for idle accelerators; sizing it off a number you measured under production-like conditions is how you stop. Getting that sizing right is its own exercise in capacity planning for production inference. FAQ Is deploying an AI model different from deploying a normal microservice? At the surface, no — you still containerize it and push it through a pipeline. Underneath, yes, substantially. An inference service is hardware-coupled, stateful (weights must load before it serves), and version-fragile (its behaviour drifts with live data). Those three properties break cloud-native defaults like scale-to-zero and CPU-based autoscaling, so the pipeline needs deliberate overrides for each. What is the difference between DevOps and MLOps? DevOps owns the path from committed code to a running, observable service. MLOps extends that path with three concerns DevOps has no native vocabulary for: the model as a versioned artifact, the data as a drifting input, and evaluation as a release gate. MLOps does not replace your CI/CD; it adds model registries, data versioning, and quality gates alongside it. Why is GPU inference so expensive in the cloud? Because cost is dominated by accelerator-seconds, and the accelerator bills whether it is busy or idle. The lever is utilization, not replica count, and utilization has to be measured under realistic sustained load rather than inferred from peak specs. A bottleneck elsewhere in the pipeline — data loading, PCIe topology, host memory — can leave the expensive GPU idle while you pay for it. Should I use scale-to-zero for an AI inference service? Usually not on the user-facing request path. Scale-to-zero is the cloud-native reflex for cost control, but for GPU-backed inference the cold-start penalty — pulling a multi-gigabyte image and loading weights into VRAM — lands directly on the first request. The cost saving is real but interacts with warm-up cost in ways you have to measure before relying on it. How do I know if my CI/CD pipeline is ready for AI workloads? Check six things: does CI verify the CUDA/driver/runtime contract on the target node pool; do you know your model’s weight-load time; does autoscaling react to a GPU-relevant metric; do you monitor output quality and drift; can you roll back the model version independently of code; and can you attribute accelerator-seconds with idle time visible. More than two gaps means the pipeline is carrying risk it cannot see. The deeper question is not whether your tooling can run a model — almost any modern platform can. It is whether your pipeline can see the three things that make an AI artifact different from the code it lives next to: the hardware it is bound to, the state it carries, and the way its correctness erodes while nothing in your repository changes. Build the pipeline that watches those, and most of the expensive surprises stop being surprises.