The retraining architecture determines freshness and cost MLOps architecture is not primarily about tooling — it is about the policies that determine when and how models are updated. The choice between batch scheduled retraining, data-triggered retraining, and online (streaming) learning has direct implications for model freshness, infrastructure complexity, and operational cost. We see teams default to whichever pattern their tools support most easily, then discover months later that the pattern does not match their drift profile or their compute budget. The decision sits upstream of tool selection. PyTorch, TensorFlow, scikit-learn, MLflow, Kubeflow, and Airflow can all support any of the three patterns, but each pattern asks different things of the surrounding infrastructure: how features are computed, how monitoring triggers fire, how rollback works when a deployment goes wrong. Batch scheduled retraining Models are retrained on a fixed schedule (weekly, monthly) regardless of detected drift. Pros: predictable compute cost, simple to implement, easy to audit. A nightly or weekly Airflow DAG handles it cleanly, and the artifact trail through MLflow or a model registry is straightforward. Cons: model may degrade between retraining cycles if data distribution shifts; model may retrain unnecessarily if distribution is stable. Best for: use cases where distribution shifts are gradual and predictable (seasonal patterns, slow-changing user behaviour). Most enterprise ML workloads start here, and most should stay here unless a concrete reason pushes them elsewhere. Data-triggered retraining Retraining is triggered when a monitoring signal exceeds a threshold: data drift, model performance degradation, or accumulated new labelled data volume. Pros: model stays fresh without unnecessary retraining; more responsive to actual drift than a fixed schedule. Cons: requires robust drift monitoring; trigger calibration is non-trivial; unexpected triggers can cause compute spikes that surprise the finance side of the conversation. Best for: use cases with irregular drift patterns (fraud detection, demand forecasting in volatile markets, ad-bidding models around campaign launches). Online / streaming learning Model weights are updated continuously as new data arrives, without full retraining. Pros: maximum freshness; responds to distribution shifts in near real time. Cons: complex to implement correctly; catastrophic forgetting (new updates can degrade performance on older patterns); difficult to audit and roll back; most deep learning frameworks are not designed for this out of the box. Best for: a narrow band of use cases: recommendation systems with high-volume implicit feedback, fraud detection where labels arrive within seconds, some ranking and bidding systems. Pattern comparison Pattern Freshness Infrastructure complexity Cost predictability Rollback ease Scheduled batch Low–Medium Low High Easy Data-triggered Medium–High Medium Medium Medium Online learning High Very high Low Hard Shadow + gradual rollout Medium Medium Medium Easy The shadow pattern is worth calling out separately. A new model is trained on the same cadence as the production model, but its predictions are logged rather than served. Once it has demonstrated parity or improvement over a fixed window, traffic is shifted in stages. This is not really a retraining policy — it is a deployment policy that composes with any of the three above, and it is how we recommend most teams introduce a new architecture or feature set. Feature store architecture For organisations with multiple models sharing features, a feature store addresses training-serving skew by ensuring the same feature computation logic runs in both environments. Feature stores (Feast, Tecton, Vertex Feature Store) are worthwhile when: multiple models share features, feature computation is expensive and should be cached, or point-in-time correctness for training data is required. Feature stores add significant complexity and are not appropriate for organisations with a single model or simple feature engineering. We have seen teams adopt a feature store before they had a second model in production, and the result was a maintenance burden with no payoff. The pattern only earns its keep once feature reuse and point-in-time joins become real problems rather than anticipated ones. Monitoring architecture Production models require two types of monitoring: Data drift monitoring: detects when input distributions change. Statistical tests in common use include PSI (Population Stability Index), the Kolmogorov–Smirnov test, and Jensen-Shannon divergence on feature distributions. Model performance monitoring: tracks prediction distribution and, when labels are available, actual accuracy metrics on a rolling window. Both should feed into alerting systems and, ideally, into automated retraining triggers for drift-triggered architectures. The harder design question is what to do when labels arrive late or never. For fraud and lending, labels may arrive months after the prediction; for some classification problems, no ground truth ever returns. In those cases, drift on input distributions and on prediction distributions becomes the only available proxy for staleness, and threshold calibration has to be done by hand against historical replay. For how these architectural choices fit the broader MLOps picture, MLOps for organisations that have never operationalised a model covers the organisational adoption path from scratch. How do you choose between batch and online retraining? The choice between batch retraining (scheduled, using accumulated data) and online learning (continuous, using streaming data) depends on three factors: data velocity, concept drift speed, and error cost. Data velocity: if new training data arrives in large batches (daily database dumps, weekly data exports), batch retraining aligns naturally. If data streams continuously (clickstream events, sensor readings), online learning avoids the waste of waiting for a batch to accumulate. Concept drift speed: if the relationship between features and target changes slowly (annual seasonality, gradual market shifts), batch retraining at daily or weekly intervals is responsive enough. If the relationship changes rapidly (intraday market volatility, real-time user preference shifts), online learning adapts faster. Error cost: if a stale model produces expensive errors (incorrect fraud decisions, missed anomalies in critical systems), faster retraining reduces error exposure. If model staleness has low cost (recommendation systems where a slightly outdated model still produces acceptable suggestions), slower retraining is acceptable. In practice, most production ML systems use batch retraining because it is simpler to implement, debug, and validate. Online learning introduces additional complexity: streaming data validation, incremental model updates without catastrophic forgetting, and continuous quality monitoring without a clear validation checkpoint. We recommend starting with triggered batch retraining: retrain when monitoring detects performance degradation rather than on a fixed schedule. This approach responds to concept drift without the operational complexity of online learning. The monitoring system tracks a performance metric (accuracy, precision, AUC) on a rolling window of production predictions with delayed ground truth labels. When performance drops below a threshold, the monitoring system triggers the retraining pipeline automatically. For the rare workloads where online learning is genuinely necessary, we implement it with safeguards: a shadow model trained online runs alongside a stable batch-trained model. Production traffic is served by the batch model until the online model demonstrates superior performance for a sustained period (typically 48–72 hours), at which point the online model is promoted. The batch model stays warm as a rollback target, because the failure mode for online learning is rarely a clean break — it is a slow drift into worse predictions that the system cannot detect on its own. FAQ