The model works in a notebook — now what?
Your data science team trained a model. It performs well on the evaluation dataset. The stakeholders approved it. Now the question: how does this model go from a Jupyter notebook to a production system that runs reliably, day after day, without the data scientist who built it manually running the notebook every morning?
This is the MLOps question, and it is where, in our experience, most organisations encounter their first serious gap between AI capability and AI operations. The model works. The problem is everything around the model: serving it to production systems, monitoring its performance, detecting when it degrades, retraining it when the data changes, and versioning the model, the data, and the code so that any production issue can be traced back to a specific model version trained on a specific dataset.
The gap is well-documented:
- Gartner predicted in 2018 that through 2022, 85% of AI projects would deliver erroneous outcomes — a prediction that subsequent industry data has broadly confirmed. Production deployment remains the primary bottleneck, with inadequate MLOps infrastructure widely cited as a leading barrier.
- A 2023 Weights & Biases practitioner survey found that 62% of ML teams still deploy models manually, without automated pipelines.
- A 2024 O’Reilly survey reports that 47% of companies identify model deployment and monitoring as a bigger challenge than model development itself.
MLflow — the most widely adopted open-source experiment tracking tool, with over 20 million downloads (Databricks, 2024) and use across 10,000+ organisations — exists precisely because this gap demanded tooling.
What MLOps actually is
MLOps — machine learning operations — is the set of practices and infrastructure that manages the lifecycle of ML models in production. It is the ML equivalent of DevOps: just as DevOps provides the tooling and processes for reliable software deployment, MLOps provides the tooling and processes for reliable model deployment and operation.
The core MLOps capabilities:
Model serving. Making the model available to production systems — as an API endpoint, a batch processing pipeline, or an embedded component. The serving infrastructure must handle the production load (requests per second), meet the latency requirements, and scale with demand.
Model monitoring. Tracking the model’s production behaviour — prediction distributions, accuracy metrics (when ground truth is available), latency, error rates, and input data characteristics. Monitoring detects degradation before it impacts business outcomes.
Model retraining. Updating the model when performance degrades — typically because the production data has drifted from the training data. Retraining requires automated data pipelines, training infrastructure, and evaluation pipelines that validate the new model before it replaces the current one.
Model versioning. Tracking which model version is deployed, what data it was trained on, what code produced it, and what evaluation results it achieved. Versioning enables rollback (reverting to a previous model version when the current one fails), auditing (understanding why a specific prediction was made), and reproducibility (retraining the same model from the same data if needed).
Pipeline automation. Automating the end-to-end workflow — from data ingestion through training, evaluation, and deployment — so that model updates do not require manual intervention. The automation replaces the “data scientist runs the notebook” pattern with a reliable, repeatable, and auditable process.
Where to start: the minimum viable MLOps
The MLOps landscape is overwhelming. Platforms like MLflow, Kubeflow, Vertex AI, SageMaker, and Weights & Biases offer comprehensive capabilities — experiment tracking, model registries, feature stores, pipeline orchestration, serving infrastructure, and monitoring dashboards. Adopting a full MLOps platform as the first step is a recipe for a 6-month infrastructure project before any production model is deployed.
The pragmatic starting point is a minimum viable MLOps — the smallest set of practices and tools that enables reliable production model operation. This is the approach Google’s MLOps maturity model (Level 1) describes: manual training, automated serving, and basic monitoring. Levels 2 and 3 add automated retraining and full CI/CD for ML — capabilities that are worth building once the organisation has enough production models to justify them, but premature for a team deploying its first model. In our MLOps engagements, Level 1 maturity typically takes 2–4 weeks to establish.
Start with monitoring. Before automating retraining, before building feature stores, before adopting a platform — instrument the production model to log predictions, input characteristics, and performance metrics. If you have no other MLOps capability, monitoring at least tells you when the model is failing. Without monitoring, failures are discovered through customer complaints or downstream system errors.
The monitoring implementation can be simple: log predictions and input features to a database, compute summary statistics daily, and alert when statistics deviate from the baseline established during deployment. This does not require an MLOps platform — it requires logging, a database, and a scheduled script.
Add versioning. Track which model is deployed, when it was trained, and what data it was trained on. At minimum: store each model artifact with a version identifier, the training data hash, the evaluation metrics, and the deployment date. MLflow provides a model registry that handles versioning; Git with DVC (Data Version Control) handles data and code versioning. The combination provides full traceability without a heavy platform.
Add a retraining pipeline. When monitoring detects degradation (or on a regular schedule — weekly or monthly, depending on the data change rate), a retraining pipeline: pulls the latest training data, trains a new model version, evaluates it against the test set, compares the evaluation results to the current production model, and promotes the new model to production if it passes the quality threshold. The pipeline can be implemented as a scheduled script, a GitHub Actions workflow, or a simple Airflow DAG — it does not require a dedicated ML pipeline platform.
Add serving infrastructure. Move from “the data scientist runs the notebook” to “the model is served as an API.” FastAPI with a model loading pattern (load the model at startup, serve predictions through an HTTP endpoint) is the simplest production-grade serving approach. For higher scale, BentoML or Triton Inference Server provide more sophisticated serving with batching, model versioning, and GPU support. GenAI workloads add further serving concerns — guardrails, cost-per-request monitoring, and evaluation pipelines — covered in moving a GenAI prototype into production.
The progression from minimum viable to mature
The minimum viable MLOps — monitoring, versioning, retraining pipeline, serving — is sufficient for 1–3 production models with moderate update frequency. As the number of production models grows, the operational burden of managing them individually grows proportionally, and the case for more sophisticated infrastructure strengthens:
Feature stores (Feast, Tecton) become valuable when multiple models share the same input features and the feature computation is expensive or latency-sensitive. The feature store computes features once and serves them to all models, ensuring consistency and reducing redundant computation.
Pipeline orchestration (Airflow, Prefect, Kubeflow Pipelines) becomes valuable when retraining pipelines have complex dependencies — multiple data sources, multi-stage processing, parallel training of model variants, and conditional deployment based on evaluation results.
Experiment tracking (MLflow, Weights & Biases) becomes valuable when the team is running frequent experiments — trying different architectures, hyperparameters, or data configurations. The experiment tracker records each experiment’s configuration and results, enabling systematic comparison and preventing the “which notebook had the best results?” problem.
The structured AI consulting engagement includes MLOps implementation as part of the production build phase, sized to the organisation’s current model portfolio and growth trajectory — not oversized for theoretical future scale.
First 90 days: MLOps implementation by team size
The table below maps the first 90 days of MLOps adoption to three team sizes — small, medium, and large. Each cell lists the specific capabilities to implement in that period, progressing from minimum viable MLOps toward the maturity level that matches the team’s operational capacity.
| Period | Small team (1–3 ML engineers) | Medium team (4–8 ML engineers) | Large team (9+ ML engineers) |
|---|---|---|---|
| Days 1–30: Foundation | Log predictions and input features to a database; compute daily summary statistics with alerting on drift from baseline | Deploy model monitoring with prediction logging and data-drift detection; set up MLflow model registry for artifact versioning | Instrument all production models with monitoring dashboards (prediction distributions, latency, error rates); implement model versioning with full lineage tracking (data hash, code commit, evaluation metrics) |
| Days 31–60: Automation | Add model versioning with DVC for data and Git for code; deploy the first model as a FastAPI endpoint with health checks | Build a scheduled retraining pipeline (Airflow DAG or GitHub Actions) with automated evaluation against production baseline; serve models via FastAPI or BentoML with load testing | Deploy pipeline orchestration (Airflow or Kubeflow Pipelines) for multi-model retraining with dependency management; stand up a feature store (Feast or Tecton) for shared feature computation across models |
| Days 61–90: Validation | Implement a scripted retraining pipeline (scheduled weekly) that retrains, evaluates, and promotes the model if it exceeds the quality threshold | Add experiment tracking (MLflow or Weights & Biases) for systematic comparison of model variants; implement rollback procedures and canary deployment for model updates | Integrate experiment tracking across all teams; implement automated canary deployment with traffic splitting and rollback triggers; establish a model governance process with evaluation gates before production promotion |
Use this as a sequencing guide, not a rigid schedule. The goal for days 1–30 is always monitoring — the single capability that provides visibility into production model behaviour before any automation is added.
Common mistakes in MLOps adoption
Over-engineering from the start. Adopting Kubeflow, building a feature store, and implementing a full CI/CD pipeline for ML when the organisation has one model in production. The infrastructure cost and complexity exceed the operational benefit. Start with the minimum viable set and add capability as the operational need grows.
Ignoring monitoring. Building automated retraining without monitoring is like building a fire suppression system without smoke detectors. Retraining addresses a specific problem (data drift causing degradation), but without monitoring, the team does not know whether retraining is needed, whether it worked, or whether the new model is better than the old one.
Manual processes disguised as MLOps. A data scientist who manually runs a training script, manually checks the evaluation metrics, and manually copies the model to the production server has an MLOps process — but it is not automated, not reproducible, and not reliable. The process fails when the data scientist is on holiday, leaves the company, or forgets a step. Automation is the point of MLOps; manual processes with documentation are not a substitute.
Scoping MLOps to current deployment needs rather than theoretical future scale avoids over-engineering — an AI Project Risk Assessment includes MLOps readiness evaluation sized to the workloads actually going to production.