The model works in a notebook — now what? Your data science team trained a model. It performs well on the evaluation dataset. The stakeholders approved it. Now the question: how does this model go from a Jupyter notebook to a production system that runs reliably, day after day, without the data scientist who built it manually running the notebook every morning? This is the MLOps question, and in our experience it is where most organisations encounter their first serious gap between AI capability and AI operations. The model works. The problem is everything around the model — serving it to production systems, monitoring its performance, detecting when it degrades, retraining it when the data changes, and versioning the model, data, and code so that any production issue can be traced back to a specific model version trained on a specific dataset. The gap is well documented. Gartner predicted in 2018 that through 2022, roughly 85% of AI projects would deliver erroneous outcomes (published-survey, Gartner 2018), a prediction that subsequent industry reporting has broadly tracked. A 2023 Weights & Biases practitioner survey reported that 62% of ML teams still deploy models manually, without automated pipelines (published-survey, W&B 2023). A 2024 O’Reilly survey found that 47% of companies identify model deployment and monitoring as a bigger challenge than model development itself (published-survey, O’Reilly 2024). MLflow — the most widely adopted open-source experiment tracking tool, with over 20 million reported downloads and use across more than 10,000 organisations per Databricks’ 2024 figures — exists precisely because this gap demanded tooling. The point of this article is narrower than the literature. We are not surveying the MLOps platform landscape. We are walking through what a first MLOps implementation looks like for a team that has models but no production pipeline — which tools earn their place on day one, which are overengineering, and how the work sequences across the first 90 days. What MLOps actually is, in one paragraph MLOps — machine learning operations — is the set of practices and infrastructure that manages the lifecycle of ML models in production. It is the ML equivalent of DevOps: just as DevOps provides the tooling and processes for reliable software deployment, MLOps provides the tooling and processes for reliable model deployment and operation. The capabilities that matter on the first project are concrete and small in number: model serving (an API endpoint or batch pipeline that handles production load and latency), model monitoring (logging predictions, inputs, and accuracy where ground truth is available), model retraining (a pipeline that pulls new data, retrains, evaluates, and promotes), model versioning (so rollback and audit are possible), and pipeline automation (so updates do not depend on one person remembering a sequence of steps). The temptation, when reading any list like that, is to procure a platform that does all five at once. That is the first mistake, and it is the one this article is built to prevent. How does MLOps differ from DevOps in practice? DevOps and MLOps share the same instincts — version control, CI/CD, monitoring, rollback — but the moving parts are different in three dimensions that matter operationally. First, the artifact is not just code. A model is a function of code and data and hyperparameters and the random seed used during training. Reproducing a production model means reproducing all four. This is why MLflow’s model registry tracks the training data hash and evaluation metrics alongside the model binary, and why DVC exists to version the data that Git is not designed to hold. Second, the failure mode is drift, not bugs. A deployed web service that worked yesterday will, all else equal, work today. A deployed model that worked yesterday may quietly degrade today because the input distribution shifted — a new customer segment, a seasonal pattern, an upstream data pipeline change. The failure is silent and gradual, which is why monitoring is structurally more important in MLOps than in DevOps and why the first capability we build is always monitoring. Third, rollback is heavier. Reverting a code deploy is a git revert and a rebuild. Reverting a model means re-pointing traffic to a previous model version and verifying that the previous version still behaves correctly on today’s data — which it may not, if the input distribution has shifted enough to require retraining rather than rollback. This is why versioning and monitoring have to land before automated retraining; without them, rollback is a guess. These three dimensions are the reason most “we already have DevOps, we’re fine” assumptions fail in their first production incident. Where to start: the minimum viable MLOps The MLOps landscape is overwhelming. MLflow, Kubeflow, Vertex AI, SageMaker, Weights & Biases, Tecton, Feast, BentoML, Triton Inference Server — every one of these is a legitimate tool, and a comprehensive adoption of any of them is a six-month infrastructure project before any production model is deployed. The pragmatic starting point is a minimum viable MLOps — the smallest set of practices and tools that enables reliable production model operation. Google’s MLOps maturity model calls this Level 1: manual training, automated serving, and basic monitoring. Levels 2 and 3 add automated retraining and full CI/CD for ML — capabilities worth building once the organisation has enough production models to justify them, but premature for a team deploying its first. In our MLOps engagements, Level 1 typically takes 2–4 weeks to establish (observed-pattern, not a benchmarked rate; varies with existing data infrastructure). Start with monitoring. Before automating retraining, before building feature stores, before adopting a platform — instrument the production model to log predictions, input characteristics, and performance metrics. If you have no other MLOps capability, monitoring at least tells you when the model is failing. Without it, failures surface through customer complaints or downstream system errors, and by then the cost of the failure has already been paid. The implementation can be simple: log predictions and input features to a database, compute summary statistics daily, and alert when statistics deviate from the baseline established at deployment. This does not require an MLOps platform. It requires logging, a database, and a scheduled script. Add versioning. Track which model is deployed, when it was trained, and what data it was trained on. At minimum: store each model artifact with a version identifier, the training data hash, the evaluation metrics, and the deployment date. MLflow provides a model registry that handles this; Git plus DVC handles data and code versioning. The combination provides full traceability without committing to a heavy platform. Add a retraining pipeline. When monitoring detects degradation — or on a regular schedule, weekly or monthly depending on data change rate — a retraining pipeline pulls the latest training data, trains a new model version, evaluates it against the test set, compares the evaluation results to the current production model, and promotes the new model only if it passes the quality threshold. The pipeline can be a scheduled script, a GitHub Actions workflow, or a simple Airflow DAG. It does not require a dedicated ML pipeline platform. Add serving infrastructure. Move from “the data scientist runs the notebook” to “the model is served as an API.” FastAPI with a model-loading pattern (load the model at startup, serve predictions through an HTTP endpoint) is the simplest production-grade approach. For higher scale, BentoML or Triton Inference Server provide more sophisticated serving with request batching, model versioning, and GPU support. GenAI workloads add further serving concerns — guardrails, cost-per-request monitoring, and evaluation pipelines — covered in moving a GenAI prototype into production. First 90 days: MLOps implementation by team size The table below maps the first 90 days of MLOps adoption to three team sizes — small, medium, and large. Each cell lists the specific capabilities to implement in that period, progressing from minimum viable MLOps toward the maturity level that matches the team’s operational capacity. Treat it as a sequencing guide, not a rigid schedule. The goal for days 1–30 is always monitoring — the single capability that provides visibility into production model behaviour before any automation is added. Period Small team (1–3 ML engineers) Medium team (4–8 ML engineers) Large team (9+ ML engineers) Days 1–30: Foundation Log predictions and input features to a database; compute daily summary statistics with alerting on drift from baseline Deploy model monitoring with prediction logging and data-drift detection; set up MLflow model registry for artifact versioning Instrument all production models with monitoring dashboards (prediction distributions, latency, error rates); implement model versioning with full lineage tracking (data hash, code commit, evaluation metrics) Days 31–60: Automation Add model versioning with DVC for data and Git for code; deploy the first model as a FastAPI endpoint with health checks Build a scheduled retraining pipeline (Airflow DAG or GitHub Actions) with automated evaluation against production baseline; serve models via FastAPI or BentoML with load testing Deploy pipeline orchestration (Airflow or Kubeflow Pipelines) for multi-model retraining with dependency management; stand up a feature store (Feast or Tecton) for shared feature computation Days 61–90: Validation Implement a scripted retraining pipeline (scheduled weekly) that retrains, evaluates, and promotes the model if it exceeds the quality threshold Add experiment tracking (MLflow or W&B) for systematic comparison of model variants; implement rollback procedures and canary deployment Integrate experiment tracking across teams; implement automated canary deployment with traffic splitting and rollback triggers; establish a model governance process with evaluation gates The repeatable infrastructure built in this first 90-day window is the actual ROI of a first MLOps engagement. The first model deployment is expensive because the pipeline does not exist; the second deployment on the same infrastructure is dramatically cheaper because the pipeline is reused. In our experience the second model typically lands in a fraction of the elapsed time of the first (observed-pattern, varies sharply with how generic the first pipeline was made). The progression from minimum viable to mature The minimum viable MLOps — monitoring, versioning, a retraining pipeline, serving — is sufficient for one to three production models with moderate update frequency. As the number of production models grows, the operational burden of managing them individually grows proportionally, and the case for more sophisticated infrastructure strengthens. Feature stores (Feast, Tecton) become valuable when multiple models share the same input features and the feature computation is expensive or latency-sensitive. The store computes features once and serves them to all models, ensuring consistency and reducing redundant computation. Below two or three sharing models, the operational overhead of running a feature store exceeds the saving. Pipeline orchestration (Airflow, Prefect, Kubeflow Pipelines) becomes valuable when retraining pipelines have complex dependencies — multiple data sources, multi-stage processing, parallel training of variants, conditional deployment based on evaluation results. A scheduled GitHub Actions workflow handles a single retraining pipeline well; it does not handle ten interdependent ones. Experiment tracking (MLflow, Weights & Biases) becomes valuable when the team is running frequent experiments — different architectures, hyperparameters, or data configurations. The experiment tracker records each run’s configuration and results, enabling systematic comparison and preventing the “which notebook had the best results?” problem that recurs every time a senior engineer is asked to reproduce a six-month-old result. The decision rule is the same in each case: adopt the heavier tool when the operational burden of not having it exceeds the cost of running it. The structured AI consulting engagement sizes MLOps implementation to the organisation’s current model portfolio and growth trajectory — not to theoretical future scale. Common mistakes in MLOps adoption Over-engineering from the start. Adopting Kubeflow, building a feature store, and implementing a full CI/CD pipeline for ML when the organisation has one model in production. The infrastructure cost and complexity exceed the operational benefit. Start with the minimum viable set and add capability as the operational need grows. Ignoring monitoring. Building automated retraining without monitoring is like building a fire suppression system without smoke detectors. Retraining addresses a specific problem — data drift causing degradation — but without monitoring, the team does not know whether retraining is needed, whether it worked, or whether the new model is better than the old one. Manual processes disguised as MLOps. A data scientist who manually runs a training script, manually checks evaluation metrics, and manually copies the model to the production server has an MLOps process — but it is not automated, not reproducible, and not reliable. The process fails when that person is on holiday, leaves the company, or forgets a step. Automation is the point of MLOps; manual processes with documentation are not a substitute. Scoping MLOps to current deployment needs rather than theoretical future scale is what keeps the first engagement honest. An AI Project Risk Assessment includes MLOps readiness evaluation sized to the workloads actually going to production, not to a vendor’s reference architecture for a hypothetical future state. FAQ What does MLOps actually mean for an organisation that has never operationalised a model? It means moving from a model that runs in a notebook on a data scientist’s laptop to a model that is served as an API or batch pipeline, monitored for performance and drift, versioned alongside the data and code that produced it, and retrained on a schedule or trigger without manual intervention. The first MLOps implementation is not platform adoption — it is the smallest set of practices that makes the model reliable in production: monitoring, versioning, a retraining pipeline, and serving infrastructure. Which MLOps capabilities does a first project genuinely need, and which are overengineering? Genuinely needed on day one: monitoring (so failures are visible), model versioning (so rollback and audit are possible), serving infrastructure (so the model is reachable), and a basic retraining pipeline (so the model can be updated without manual steps). Overengineering for a first project: feature stores, full CI/CD for ML, Kubeflow pipelines, multi-model orchestration. Those become justified once the production model portfolio grows beyond two or three models or once multiple models share features. Which MLOps tools and frameworks are realistic for a first deployment, and which assume mature data engineering already in place? Realistic for a first deployment: MLflow (experiment tracking and model registry), DVC (data versioning on top of Git), FastAPI (serving), GitHub Actions or a simple Airflow DAG (scheduled retraining), and a logging pipeline writing to an existing database for monitoring. Tools that assume mature data engineering: Kubeflow (assumes Kubernetes operational maturity), Feast and Tecton (assume an existing feature pipeline and multiple consuming models), and SageMaker or Vertex AI when adopted as full platforms rather than as serving components. What is the smallest viable MLOps stack that still produces a production-quality deployment? A scheduled script logging predictions and inputs to a database, with a daily summary-statistics job alerting on drift from baseline; MLflow’s model registry for versioning the model artifact, training data hash, and evaluation metrics; FastAPI loading the model at startup and serving it as an HTTP endpoint with health checks; a GitHub Actions workflow that pulls fresh data on a schedule, retrains, evaluates against the current production model, and promotes only if it exceeds the quality threshold. That stack is production-quality for one to three models without requiring a dedicated MLOps platform. How does MLOps differ from DevOps in the data-pipeline, drift, and rollback dimensions? The artifact is code plus data plus hyperparameters, so reproducibility requires versioning data alongside code — Git alone is not sufficient, which is why DVC exists. The failure mode is silent drift rather than crashes, so monitoring is structurally more important than in DevOps. And rollback is heavier: reverting to a previous model version requires verifying that the old model still performs adequately on today’s input distribution, which it may not if drift was the original problem. These three differences are why an existing DevOps practice is a useful foundation for MLOps but not a substitute. Why do most ML models never reach production, and which MLOps gaps cause that? The most consistently cited gaps in industry surveys are the absence of a deployment path beyond the notebook, the absence of monitoring (so the team cannot defend a production model when stakeholders ask how it is performing), the absence of versioning (so a failing model cannot be rolled back to a known-good state), and the operational burden of supporting a model without retraining automation (the “data scientist runs the notebook every morning” pattern that does not survive a holiday). Closing these four gaps is what the minimum viable MLOps implementation is designed to do.