The MLOps tool landscape is wide and over-tooled The MLOps tool market has expanded significantly since 2020, producing dozens of options in each category. Teams new to MLOps often end up with a stack of six to eight tools when two or three would suffice. In our experience working with teams shipping their first production model, the failure mode is rarely a missing capability — it is the cost of operating tools no one fully owns. The discipline that matters is matching tool adoption to the pain points the team can actually name today, not to anticipated needs that may never arrive. This spoke is part of the MLOps for organisations that have never operationalised a model hub. The hub covers the broader adoption path; this article zooms into the tool-selection decision that most first-time teams get wrong by over-buying. Experiment tracking Experiment tracking records training runs: hyperparameters, metrics, artifacts, code version. It is almost always the first MLOps tool a team needs, because without it the question “which model is in production and how was it trained?” has no reliable answer. Tool Strengths Weaknesses Best for MLflow Open source, widely adopted, local or self-hosted UI is dated; distributed setup requires work Teams wanting self-hosted OSS Weights & Biases Strong UI, collaboration features, sweep automation SaaS-only for full features; cost at scale Teams with budget and a sharing need Neptune Good collaboration, flexible schemas Smaller community Mid-size teams Plain file logging Zero cost, no dependency No comparison UI Single researchers Model registry A model registry is a versioned store for trained models with lineage, metadata, and deployment-stage tracking. MLflow includes one. Weights & Biases and Neptune both provide registry features. For cloud-native deployments, managed registries (Amazon SageMaker Model Registry, Google Vertex AI Model Registry, Azure ML) integrate tightly with the cloud’s serving infrastructure and IAM model — which is often the deciding factor over the registry’s own features. Pipeline orchestration Orchestration automates the training pipeline: data ingestion, preprocessing, training, evaluation, conditional deployment. Tool Complexity Best for Airflow High setup overhead Data engineering teams who already run it Kubeflow Pipelines Kubernetes-native Teams already operating Kubernetes Prefect Lower overhead than Airflow Python-native teams, modern workflow Metaflow Data-scientist friendly Teams wanting reproducibility without DevOps depth GitHub Actions or cron Very low overhead Simple scheduled retraining Model serving Serving deploys models as inference endpoints. Choice here depends heavily on hardware target, throughput requirements, and whether the team already operates Kubernetes. Tool Strengths Complexity BentoML Python-native, sensible abstractions Medium Seldon Core Kubernetes-native, A/B testing built in High FastAPI (DIY) Full control, low overhead Low Cloud-managed (SageMaker, Vertex) Minimal infrastructure management Medium Triton Inference Server High throughput, multi-framework, GPU-aware High For GPU inference at non-trivial throughput, NVIDIA Triton Inference Server is the reference point — it handles dynamic batching, multi-framework execution (PyTorch, TensorFlow, ONNX, TensorRT engines), and concurrent model execution on a single GPU. For a CPU model serving a handful of requests per second, FastAPI behind a load balancer is honest engineering, not under-investment. How do you avoid tool sprawl in the MLOps stack? The MLOps tool market offers hundreds of options across experiment tracking, model registries, feature stores, orchestrators, serving platforms, and monitoring. Without discipline, organisations accumulate tools that overlap in functionality, each requiring maintenance, integration, and team training. We apply three filters when evaluating a new tool: capability (does it solve a real problem we have today, not a hypothetical one?), integration (does it work with the existing stack without custom glue code?), and operational cost (can the team maintain it without dedicated staff?). A tool that fails any filter is rejected regardless of its technical merit. The minimum viable MLOps stack for an organisation deploying its first five to ten models looks like this: One experiment tracker — MLflow or Weights & Biases, not both. One model registry — MLflow’s built-in registry is sufficient at this scale. One orchestrator — Airflow or Prefect, whichever the data team already runs. One serving layer — FastAPI for simple CPU models, Triton for GPU models. Monitoring — Prometheus and Grafana for infrastructure, custom scripts for model-quality signals. Each additional tool beyond this minimum needs a specific pain point to justify it. A feature store is justified when multiple models share features and keeping training and serving pipelines consistent becomes error-prone. A dedicated CI/CD system for ML is justified when the orchestrator’s scheduling can no longer express the pipeline. A vector database is justified when the retrieval pipeline handles millions of embeddings. None of these are justified on day one. We review the tool stack annually and remove anything no longer actively used. Abandoned tools are technical debt: they consume maintenance effort, create security surface area, and confuse new team members who find them and assume they matter. What is the smallest viable MLOps stack? The selection process itself matters as much as the final choice. For each category, we recommend a structured evaluation: identify requirements from the team’s actual pain points (not from vendor feature lists), shortlist two or three options, run a time-boxed evaluation of about one week per tool, and decide based on what the evaluation surfaces. The evaluation must implement a representative use case end to end — not read documentation or watch demos. Implementation is what reveals integration friction, documentation quality, and operational overhead that marketing materials do not disclose. Two to three weeks of evaluation prevents months of regret with a poorly fitting tool. Stack selection principles Start minimal. Experiment tracking and a model registry are almost always the first investment. Everything else can wait until a real signal arrives. Match existing infrastructure. If the team is already on AWS, SageMaker components are lower friction than introducing Kubernetes to run Kubeflow. If Airflow already runs the data pipelines, adding ML pipelines to Airflow is lower overhead than a parallel orchestrator. The right tool is often the one already operated competently next door. Avoid premature optimisation. Teams building their first production model do not need A/B testing infrastructure or auto-scaling serving. Build for what the system needs today, plus one level ahead — not three. A practical sequence we have seen work: ship the first model with experiment tracking, a registry, and a hand-rolled FastAPI service behind the existing load balancer. Add orchestration when the second model arrives and manual retraining becomes painful. Add a feature store only when feature consistency between training and serving has caused a real incident. The stack grows in response to evidence, not anticipation. FAQ What does MLOps actually mean for an organisation that has never operationalised a model? MLOps is the operational discipline that takes a trained model out of a notebook and into a system that can serve predictions reliably, be retrained when data shifts, and be rolled back when it misbehaves. For a first-time team, it means owning four things: where the model lives (registry), how it was trained (experiment tracking), how it gets deployed (serving), and how you know it is still working (monitoring). Which MLOps capabilities does a first project genuinely need, and which are overengineering? A first project needs experiment tracking, a model registry, a deployment path, and basic monitoring. It does not need a feature store, A/B testing infrastructure, automated retraining, or a dedicated ML CI/CD platform. Those capabilities solve real problems — but problems that emerge with the second, fifth, or tenth model, not the first. Which MLOps tools and frameworks are realistic for a first deployment? MLflow for tracking and registry, Prefect or Airflow for orchestration if you need it, FastAPI or Triton Inference Server for serving, and Prometheus with Grafana for monitoring. Managed cloud equivalents (SageMaker, Vertex AI, Azure ML) are realistic when the team is already operating that cloud. Avoid Kubernetes-native stacks (Kubeflow, Seldon Core) unless Kubernetes is already in place — they assume mature platform engineering. What is the smallest viable MLOps stack that still produces a production-quality deployment? One experiment tracker, one registry (MLflow can cover both), one orchestrator, one serving layer, and infrastructure plus model-quality monitoring. Five components, each with a single owner. This stack ships production-grade deployments for the first five to ten models without accumulating maintenance debt. How does MLOps differ from DevOps in the data-pipeline, drift, and rollback dimensions? DevOps assumes the artifact under deployment is deterministic — the same code produces the same behaviour. MLOps cannot assume this. The model’s behaviour depends on training data, which drifts; on feature pipelines, which can silently change; and on the production input distribution, which is not under the team’s control. Rollback in DevOps is reverting code. Rollback in MLOps is reverting to a previous model version while keeping data and feature pipelines consistent — which is why a registry with lineage is non-negotiable. Why do most ML models never reach production, and which MLOps gaps cause that? The most common gap is not a missing tool — it is a missing deployment path. Teams build models in notebooks against snapshotted data, with no plan for how the model will receive live inputs, where it will run, who owns it after handover, or how anyone will know it has degraded. Without a registry, a serving target, and a monitoring contract before training starts, the notebook is where the model dies.