Two phenomena, one word A deployed AI system performs worse than it did six months ago. Two different teams reach for the same word — “drift” — and mean two completely different things by it. The MLOps team means the model’s predictions have degraded against the same evaluation set; the platform team means the GPU’s tokens-per-second on the same model has shifted from where it benchmarked at install time. Both are real. Neither is the other. Conflating them produces root-cause analyses that look at the wrong layer of the stack. Model drift and hardware-side performance change are independent axes of temporal change, with separate measurement methods, separate monitoring infrastructure, and separate remediation paths. The starting point for reasoning about either is to keep them apart. How does model drift differ from hardware-side performance change? Model drift describes a degradation in model output quality over time as the input distribution shifts away from the distribution the model was trained on. The model itself does not change — its weights are static after training. What changes is the world the model is being applied to, and the model’s behavior on that shifting world deviates from its behavior on the data it was evaluated against at training time. The drift literature distinguishes several mechanisms: Data drift (covariate shift): the distribution of input features changes. The relationship between inputs and the correct outputs may be unchanged, but the inputs the model sees in production no longer match the training distribution. Concept drift: the relationship between inputs and correct outputs changes. The same inputs would now warrant different outputs than they did at training time. This is the harder case because retraining requires labelled data from the new regime. Label drift: the distribution of correct outputs changes, often as a downstream effect of one of the above. The measurement is on the model’s accuracy, calibration, or downstream business metric — not on the accelerator’s throughput. The remediation is data-side: retraining, fine-tuning, or input preprocessing changes. No hardware action addresses model drift. What hardware-side performance change actually is Hardware-side performance change is the temporal axis explored in why AI performance changes over time: warmup behavior, thermal equilibrium, scheduling drift, driver/runtime updates, and the slow shifts in the AI Executor’s effective throughput on the same model. The model is unchanged. The accelerator’s silicon is unchanged. What changes is some combination of: The thermal regime the device is operating in (sustained heat raises the throttle floor over a long workload). The driver and runtime versions deployed on the host. The framework version and the kernel libraries it dispatches to. Co-tenant workload pressure on the host (CPU, memory bandwidth, network). The cooling/power infrastructure of the data center. The measurement is on the AI Executor’s throughput, latency distribution, or per-precision performance on a fixed workload. The remediation is platform-side: thermal investigation, driver/library version control, scheduling changes, or executor specification updates. No model action addresses hardware drift. The two are uncorrelated and require separate monitoring Property Model drift Hardware drift What changes Input distribution or input→output relationship Executor’s effective throughput / latency on a fixed workload What stays constant Model weights, accelerator hardware, runtime Model, weights, the workload definition Detection signal Accuracy / calibration / business-metric degradation on a held-out monitoring set Throughput / p95 / p99 / energy-per-inference deviation from a reference benchmark Required monitoring Labelled (or proxied) production input + output distribution tracking Periodic re-runs of a reference benchmark on the production executor Remediation domain Data and model lifecycle Platform, driver, runtime, infrastructure What it does NOT detect Hardware drift; throughput regression looks normal to a model-quality monitor Model drift; the model could be returning gibberish at full throughput The columns share no detection apparatus. A model-quality monitor that watches accuracy on a labelled production sample cannot detect that the accelerator now produces those same predictions at 60% of its prior throughput. A hardware benchmark that re-runs a reference workload cannot detect that the model’s predictions on that fixed workload are now systematically wrong on the production input distribution. Both monitoring systems are required to understand the operational performance of a deployed AI system over time, and a misattributed root cause — “the model is broken” when the throughput regressed, or “the GPU is slow” when the input distribution shifted — is the predictable failure mode when only one is in place. Why benchmarks scope only to one of the two Benchmark protocols measure the executor on a fixed workload. They are designed for that scope. A reference benchmark re-run quarterly on the production AI Executor is the right tool for detecting hardware-side performance change: the workload is held constant, so any deviation in the result is attributable to the executor. The same protocol cannot detect model drift. The benchmark workload’s input distribution does not change, by design — that’s what makes the comparison valid across time. So the part of the system that drifts when input distribution shifts (the model’s accuracy on production inputs) is precisely the part the benchmark holds constant. A benchmark that tried to detect model drift would have to vary its workload over time, which would also break its ability to detect hardware drift. The methodological consequence is that benchmark methodology is the right tool for the hardware-drift question and the wrong tool for the model-drift question. Model drift requires production-monitoring instrumentation: held-out evaluation sets refreshed against production data, prediction-distribution tracking, and (where labels are available) accuracy regression alerts. These are different infrastructure than benchmark re-runs. The framing that helps Model drift and hardware drift are independent temporal axes. They have separate causes, separate detection signals, separate remediation paths, and separate monitoring infrastructure. A deployed AI system needs both kinds of monitoring; conflating them produces misattributed root-cause analyses; and a benchmark methodology — by holding the workload constant — is structurally scoped to the hardware-side axis only. LynxBench AI is a benchmark methodology for the hardware-side temporal axis: re-running a reference workload on the AI Executor to detect changes in throughput, latency distribution, and per-precision performance. It is intentionally scoped to that axis, because the model-side axis requires a different instrumentation approach that benchmark methodology does not — and should not — try to substitute for.