ML pipelines are not the same as software CI/CD Software CI/CD pipelines are deterministic: the same code produces the same binary. ML training pipelines are not deterministic: the same code plus the same data may not produce the same model (due to random initialization, hardware differences, non-deterministic GPU operations). This fundamental difference has implications for how ML pipelines are designed, validated, and monitored. 1. Data ingestion Pulls raw data from source systems (databases, data lakes, streaming sources) into the pipeline. Requirements: Schema validation, data freshness checks, volume checks. A pipeline that runs successfully but with yesterday’s data (due to a failed upstream job) has silently used stale training data. Common failure: Source system schema changes that break ingestion silently. 2. Data validation and preprocessing Validates data quality, applies feature engineering, splits into train/val/test. Requirements: Statistical validation (expected distributions, null rate thresholds), feature computation consistency between training and serving (training-serving skew is a major failure mode). Common failure: Preprocessing logic in training differs from serving preprocessing. A model trained on normalized features served raw input performs arbitrarily badly. 3. Model training Runs the training computation. May involve hyperparameter sweeps, distributed training, or fine-tuning a foundation model. Requirements: Experiment tracking (log all hyperparameters and metrics), environment pinning (containerized training), seed logging for reproducibility attempts. Common failure: Untracked dependencies (library version, CUDA version ) that make runs non-reproducible. 4. Model evaluation Evaluates the new model against the current production model on a held-out evaluation set. Requirements: Fixed, versioned evaluation set; evaluation metrics that reflect business outcomes; automatic pass/fail threshold. Common failure: Evaluation set leaks into training data over time (training data grows, evaluation set not strictly protected). 5. Deployment Registers the new model, deploys to staging, runs integration tests, promotes to production. Requirements: Canary deployment or shadow mode to validate behavior before full traffic, rollback mechanism. 6. Monitoring Tracks model behavior in production. Requirements: Input data distribution monitoring (detect drift), output distribution monitoring, downstream business metric tracking. Alerts when drift exceeds thresholds. ML vs software CI/CD comparison Aspect Software CI/CD ML Pipeline Determinism Fully deterministic Non-deterministic “Build” artifact Binary/container Trained model weights Testing Unit/integration tests Statistical evaluation against baseline Rollback trigger Test failure, error rate Model degradation, data drift Frequency Every commit Data-triggered, scheduled, or on-demand For an overview of MLOps practices in organizations starting from scratch, MLOps for organisations that have never operationalised a model covers the adoption path. What are the most common pipeline failure modes? MLOps pipeline failures cluster into three categories: data failures, infrastructure failures, and model failures. Each requires different detection and remediation strategies. Data failures are the most frequent: upstream data schema changes (a column is renamed, a data type changes, or a field becomes nullable), data quality degradation (distribution drift, missing value patterns, duplicate records), and data availability issues (source system downtime, API rate limiting, network partitions). We detect data failures using schema validation at pipeline ingestion points, statistical distribution checks on incoming data, and freshness monitoring (alerting when expected data does not arrive within its SLA). Infrastructure failures include compute resource exhaustion (GPU OOM during training, disk full during data processing), dependency failures (a package version conflict after a pip install, a Docker image failing to pull), and orchestration failures (a DAG step timing out, a retry policy exhausting its attempts). We mitigate infrastructure failures through resource monitoring with proactive alerting, pinned dependency versions, and idempotent pipeline steps that can be safely retried. Model failures occur when a retrained model fails quality gates: accuracy drops below the threshold, prediction distribution diverges from the expected range, or the model produces outputs that violate business rules (e.g., a pricing model producing negative prices). Quality gates are the last line of defence — they must be comprehensive enough to catch meaningful degradation but not so sensitive that they block deployments due to statistical noise. The design principle that governs our pipeline architecture: every failure should be detectable, diagnosable, and recoverable without human intervention during business hours. Manual intervention should be reserved for failures that the automated system cannot categorise — which, in a well-designed pipeline, should be fewer than one per month. For pipeline monitoring, we implement three signal types: heartbeat signals (is the pipeline running?), quality signals (are the outputs correct?), and performance signals (is the pipeline running within SLA?). A pipeline that produces correct outputs but takes 6 hours instead of the expected 2 hours has a performance problem that, if undetected, will eventually become a quality problem when downstream consumers time out waiting for results. Observability across the pipeline requires correlation IDs that trace a data sample from ingestion through feature computation, training batch inclusion, and model version production. When a model produces an unexpected prediction in production, the correlation ID allows tracing backwards to identify which training data, feature values, and pipeline version contributed to that prediction. This end-to-end traceability transforms incident investigation from guesswork into systematic root cause analysis, reducing mean time to resolution from hours to minutes for production ML issues.