Infrastructure complexity should follow deployment maturity Teams building their first ML model often adopt enterprise-grade MLOps infrastructure before they have anything to maintain. The result is months of infrastructure work before any model value is delivered. The alternative — no infrastructure discipline — creates fragility that becomes expensive to unwind later. The right approach is incremental: adopt infrastructure components in response to demonstrated need. Compute infrastructure Training compute: GPU/CPU instances for model training. Options range from local hardware to cloud instances (AWS EC2/SageMaker, GCP Vertex, Azure ML) to specialized cloud GPU providers (Lambda, CoreWeave, RunPod). What to consider: spot vs on-demand pricing (spot is 60–80% cheaper but interruptible; use for batch training, not production serving), memory requirements per model type, multi-GPU requirements for large models. Serving infrastructure: Compute that runs inference. Requirements differ significantly from training: lower latency requirements, consistent availability, variable load (auto-scaling). Storage infrastructure Component What it stores Example tools Data lake / warehouse Raw and processed training data S3, GCS, Snowflake, BigQuery Feature store Computed, versioned feature values Feast, Tecton, Vertex Feature Store Artifact store Model weights, evaluation metrics, plots MLflow, S3, GCS Model registry Versioned models with metadata and deployment status MLflow Registry, SageMaker Orchestration infrastructure Runs and manages ML pipelines: training, evaluation, deployment. For most teams we work with, teams: a simple workflow tool (Prefect, Airflow) or even cloud-native options (SageMaker Pipelines, Vertex Pipelines) is sufficient. Kubernetes + Kubeflow Pipelines is appropriate when the team is already on Kubernetes and has dedicated platform engineering support. Monitoring infrastructure Observability for both data and models in production. Data monitoring: Detects when input distributions to the model shift from training distributions. Tools: Evidently AI, WhyLabs, Great Expectations, Grafana dashboards. Model monitoring: Tracks prediction distribution, error rates, latency. Standard observability tools (Prometheus/Grafana, Datadog) plus ML-specific metrics. Infrastructure adoption roadmap Stage What to add What to defer First POC Local compute, S3 for artifacts, MLflow for experiment tracking Everything else First production model Cloud serving instance, MLflow Registry, basic Grafana dashboard Feature store, orchestration 3–5 production models Orchestration pipeline, data drift monitoring Kubernetes, feature store 10+ models Feature store, dedicated platform team Add incrementally What are the common infrastructure anti-patterns? Over-investing in feature stores before having features: Feature stores solve a real problem (training-serving skew) but are complex to operate. Teams with 1–2 models and simple features get no value from them. Kubernetes before scale: Kubernetes solves a real problem at scale (hundreds of services, multiple teams). A team with 3 models and 5 engineers does not have this problem. Building instead of buying: Cloud-managed MLOps services (SageMaker, Vertex AI) handle most infrastructure concerns for most teams. Custom-built infrastructure is appropriate when cloud costs are prohibitive at scale or when there are constraints the cloud service doesn’t support. For the full MLOps context, MLOps for organisations that have never operationalised a model covers the organizational adoption journey from the beginning. What infrastructure mistakes do organisations make when starting MLOps? In our experience, the most common infrastructure mistake is overbuilding: purchasing GPU clusters, deploying Kubernetes, and implementing a full MLOps platform before the organisation has a single model in production. The infrastructure sits idle while the team learns to use it, and by the time they are ready, the technology has evolved and the initial investment is partially obsolete. The second mistake is underbuilding: running production models on a data scientist’s laptop or a single EC2 instance with no monitoring, no redundancy, and no deployment pipeline. This works until the model becomes business-critical, at which point the absence of infrastructure creates operational risk. The correct approach is incremental: start with the minimum infrastructure required to deploy one model reliably (a serving endpoint, basic monitoring, and a manual deployment process), then add capabilities as the number of models and operational requirements grow. For the first production model, the infrastructure requirement is modest: a compute instance (GPU or CPU depending on the model) with a serving framework (FastAPI, TorchServe, or Triton), a health check endpoint, request logging, and alerting when the service is down. Total infrastructure cost: $50–$500/month depending on compute requirements. Total setup time: 2–3 days for an experienced engineer. For 5–10 production models, the infrastructure grows to include: a container orchestrator (Kubernetes or ECS), a model registry (MLflow), an experiment tracker (MLflow or W&B), automated training pipelines (Airflow or Prefect), and a monitoring dashboard (Grafana). Total infrastructure cost: $1,000–$5,000/month. Total setup time: 2–4 weeks. Beyond 10 models, the infrastructure may require: a feature store, a dedicated ML platform (SageMaker, Vertex AI), advanced monitoring (data drift detection, model performance tracking), and A/B testing infrastructure. At this scale, the MLOps infrastructure becomes a platform that requires dedicated engineering support — typically 2–4 engineers maintaining the platform for 10–30 data scientists.