ML pipelines are not the same as software CI/CD Software CI/CD pipelines are deterministic: the same code produces the same binary. ML training pipelines are not. The same code plus the same data may not produce the same model, because random initialisation, hardware differences, and non-deterministic GPU kernels in CUDA and cuDNN all push the result around. That single difference reshapes how an MLOps pipeline is designed, validated, and monitored — and it is the reason a CI/CD culture imported wholesale from software engineering tends to misfire on its first ML project. A useful MLOps pipeline covers six stages: ingestion, validation and preprocessing, training, evaluation, deployment, and monitoring. Each stage has its own failure surface. We walk through them in the order data actually moves, then contrast the whole assembly with a software CI/CD pipeline. What stages does an MLOps pipeline actually contain? 1. Data ingestion The ingestion stage pulls raw data from source systems — relational databases, data lakes such as S3 or GCS, streaming sources like Kafka — into the pipeline boundary. Requirements: schema validation, data freshness checks, and volume checks. A pipeline that runs successfully but operates on yesterday’s data (because an upstream job failed silently) has just used stale training data without anyone noticing. Common failure: source system schema changes that break ingestion silently — a column rename, a type change from int to string, or a previously non-null field becoming nullable. 2. Data validation and preprocessing This stage validates data quality, applies feature engineering, and splits data into training, validation, and test sets. Requirements: statistical validation (expected distributions, null-rate thresholds, cardinality bounds) and — critically — feature-computation consistency between training and serving. Training-serving skew is one of the dominant failure modes for first-time MLOps deployments. Common failure: preprocessing logic in training diverges from preprocessing in serving. A model trained on normalised features but served with raw input will perform arbitrarily badly, and the failure mode is silent: predictions return on time, they are just wrong. 3. Model training The training stage runs the computation that produces model weights. It may involve hyperparameter sweeps, distributed training across GPUs with NCCL, or fine-tuning a foundation model with PyTorch or a library such as Hugging Face Transformers. Requirements: experiment tracking (MLflow, Weights & Biases, or an equivalent — every hyperparameter and metric logged), environment pinning via containerised training (Docker images with explicit CUDA and library versions), and seed logging for whatever reproducibility is achievable. Common failure: untracked dependencies — a library version drift, an unpinned CUDA minor version, a base image rebuilt upstream — that make runs non-reproducible weeks later when an audit asks why two training jobs produced different models. 4. Model evaluation Evaluation compares the new model against the current production model on a held-out set with a fixed metric. Requirements: a fixed, versioned evaluation set; evaluation metrics that reflect business outcomes rather than only statistical fit; and an automatic pass-fail threshold encoded as a quality gate. Common failure: evaluation-set leakage. As training data grows over time, the boundary between “what we train on” and “what we evaluate on” erodes unless the evaluation set is strictly versioned and protected. Once leakage happens, evaluation metrics drift upward for the wrong reason and the team loses its single trustworthy signal. 5. Deployment Deployment registers the new model in a model registry, deploys to staging, runs integration tests against representative traffic, and promotes to production. Requirements: canary deployment or shadow mode so production behaviour can be observed on real traffic before full cutover, plus a rollback mechanism that can return to the previous model version in minutes rather than hours. Common failure: a rollback path that exists on paper but has never been exercised. The first time a team tries to roll back a model under production pressure should not be the first time they try to roll back a model. 6. Monitoring Monitoring tracks how the model behaves once it is serving real requests. Requirements: input distribution monitoring to detect data drift, output distribution monitoring to catch concept drift, and downstream business-metric tracking to confirm the model is still doing useful work. Alerts fire when drift exceeds thresholds rather than on a fixed schedule. ML vs software CI/CD: where the contract differs The table below is the answer surface most teams need when they first explain MLOps to a platform organisation that already runs a mature software CI/CD pipeline. The categories are the same; the contents are not. Aspect Software CI/CD ML Pipeline Determinism Fully deterministic Non-deterministic (init seeds, GPU kernels) “Build” artifact Binary or container image Trained model weights plus metadata Testing Unit and integration tests Statistical evaluation against a baseline model Rollback trigger Test failure, error rate spike Model degradation, data drift, concept drift Frequency Every commit Data-triggered, scheduled, or on-demand Cache validity Invalidated by code change Invalidated by code or data change This is the structural reason that a software platform team cannot simply hand a Kubernetes cluster to a data science team and call it MLOps. The contract between pipeline stages is different. For the broader adoption path — how a team gets from “model in a notebook” to “model serving production traffic” — see MLOps for organisations that have never operationalised a model. What are the most common pipeline failure modes? MLOps pipeline failures cluster into three categories: data failures, infrastructure failures, and model failures. Each category needs its own detection signal and its own remediation playbook. Data failures are the most frequent. They include upstream schema changes (a column is renamed, a type changes, a field becomes nullable), data quality degradation (distribution drift, shifting missing-value patterns, duplicate records appearing where there were none), and data availability issues (source system downtime, API rate limiting, network partitions). We detect data failures using schema validation at ingestion points, statistical distribution checks on incoming batches, and freshness monitoring that alerts when expected data does not arrive within its SLA. Infrastructure failures include compute exhaustion (GPU out-of-memory during training, disk full during a feature-engineering job), dependency failures (a package version conflict introduced by an unpinned pip install, a Docker image failing to pull because the registry is rate-limited), and orchestration failures (a DAG step timing out in Airflow or Argo Workflows, a retry policy exhausting its attempts on a transient error that should have succeeded on retry two). The mitigations are not exotic: resource monitoring with proactive alerts, fully pinned dependencies, and idempotent pipeline steps that can be retried safely without producing duplicate side effects. Model failures happen when a retrained model fails its quality gate — accuracy drops below threshold, prediction distribution diverges from the expected range, or the model produces outputs that violate business rules (a pricing model producing negative prices is the canonical example). Quality gates are the last line of defence between a bad model and production traffic. They have to be comprehensive enough to catch meaningful degradation, but not so sensitive that statistical noise blocks legitimate deployments and trains the team to override the gate. The design principle that governs our pipeline architecture: every failure should be detectable, diagnosable, and recoverable without human intervention during business hours. Manual intervention should be reserved for failures the automated system cannot categorise — and in a well-designed pipeline, those should be rare events rather than weekly occurrences. This is an observed pattern across MLOps engagements we have run, not a benchmarked rate, and the achievable cadence depends heavily on how well-instrumented the underlying data platform already is. Three signal types for pipeline monitoring For pipeline-level monitoring (distinct from model-level monitoring) we implement three signal types: Heartbeat signals — is the pipeline running at all? Did the scheduled job start, and is it making progress? Quality signals — are the outputs correct? Did validation pass, did evaluation clear the quality gate, did the deployed model register successfully? Performance signals — is the pipeline running within its SLA? A pipeline that produces correct outputs but takes six hours instead of the expected two has a performance problem that, if left undetected, will eventually become a quality problem when downstream consumers time out waiting. Observability across the pipeline requires correlation IDs that trace a single data sample from ingestion through feature computation, training batch inclusion, and the resulting model version in production. When a model produces an unexpected prediction in production, the correlation ID lets an engineer trace backwards to identify which training data, which feature values, and which pipeline version contributed to that prediction. End-to-end traceability transforms incident investigation from guesswork into systematic root-cause analysis, and it is the single highest-leverage investment a first MLOps implementation can make. FAQ