The retraining architecture determines freshness and cost MLOps architecture is not primarily about tooling — it is about the policies that determine when and how models are updated. The choice between batch scheduled retraining, data-triggered retraining, and online (streaming) learning has direct implications for model freshness, infrastructure complexity, and operational cost. Batch scheduled retraining Models are retrained on a fixed schedule (weekly, monthly) regardless of detected drift. Pros: Predictable compute cost, simple to implement, easy to audit. Cons: Model may degrade between retraining cycles if data distribution shifts; model may retrain unnecessarily if distribution is stable. Best for: Use cases where distribution shifts are gradual and predictable (seasonal patterns, slow-changing user behavior). Most enterprise ML workloads start here. Data-triggered retraining Retraining is triggered when a monitoring signal exceeds a threshold: data drift, model performance degradation, or accumulated new labeled data volume. Pros: Model stays fresh without unnecessary retraining; more responsive to actual drift. Cons: Requires robust drift monitoring; trigger calibration is non-trivial; unexpected triggers can cause compute spikes. Best for: Use cases with irregular drift patterns (fraud detection, demand forecasting in volatile markets). Online / streaming learning Model weights are updated continuously as new data arrives, without full retraining. Pros: Maximum freshness; responds to distribution shifts in real time. Cons: Complex to implement correctly; catastrophic forgetting (new updates can degrade performance on older patterns); difficult to audit and roll back; most deep learning frameworks are not designed for this. Best for: Very narrow use cases: recommendation systems with high-volume user feedback, fraud detection with real-time label availability. Pattern comparison Pattern Freshness Infrastructure complexity Cost predictability Rollback ease Scheduled batch Low–Medium Low High Easy Data-triggered Medium–High Medium Medium Medium Online learning High Very high Low Hard Shadow + gradual rollout Medium Medium Medium Easy Feature store architecture For organizations with multiple models sharing features, a feature store addresses training-serving skew by ensuring the same feature computation logic runs in both environments. Feature stores (Feast, Tecton, Vertex Feature Store) are worthwhile when: multiple models share features, feature computation is expensive and should be cached, or point-in-time correctness for training data is required. Feature stores add significant complexity and are not appropriate for organizations with a single model or simple feature engineering. Monitoring architecture Production models require two types of monitoring: Data drift monitoring: Detects when input distributions change (statistical tests: PSI, KS test, Jensen-Shannon divergence on feature distributions) Model performance monitoring: Tracks prediction distribution and, when labels are available, actual accuracy metrics Both should feed into alerting systems and, ideally, into automated retraining triggers for drift-triggered architectures. For how these architectural choices fit the broader MLOps picture, MLOps for organisations that have never operationalised a model covers the organizational adoption path from scratch. How do you choose between batch and online retraining? The choice between batch retraining (scheduled, using accumulated data) and online learning (continuous, using streaming data) depends on three factors: data velocity, concept drift speed, and error cost. Data velocity: if new training data arrives in large batches (daily database dumps, weekly data exports), batch retraining aligns naturally. If data streams continuously (clickstream events, sensor readings), online learning avoids the waste of waiting for a batch to accumulate. Concept drift speed: if the relationship between features and target changes slowly (annual seasonality, gradual market shifts), batch retraining at daily or weekly intervals is responsive enough. If the relationship changes rapidly (intraday market volatility, real-time user preference shifts), online learning adapts faster. Error cost: if a stale model produces expensive errors (incorrect fraud detection, missed anomalies in critical systems), faster retraining reduces error exposure. If model staleness has low cost (recommendation systems where a slightly outdated model still produces acceptable recommendations), slower retraining is acceptable. In practice, most production ML systems use batch retraining because it is simpler to implement, debug, and validate. Online learning introduces additional complexity: streaming data validation, incremental model updates without catastrophic forgetting, and continuous quality monitoring without a clear validation checkpoint. We recommend starting with triggered batch retraining: retrain when monitoring detects performance degradation rather than on a fixed schedule. This approach responds to concept drift without the operational complexity of online learning. The monitoring system tracks a performance metric (accuracy, precision, AUC) on a rolling window of production predictions with delayed ground truth labels. When performance drops below a threshold, the monitoring system triggers the retraining pipeline automatically. For the rare workloads where online learning is genuinely necessary, we implement it with safeguards: a shadow model trained online runs alongside a stable batch-trained model. Production traffic is served by the batch model until the online model demonstrates superior performance for a sustained period (typically 48–72 hours), at which point the online model is promoted.