Why does a model that runs in 40ms on a test device run in 340ms on production devices? A client-side ML model validates in the lab at 40ms inference time. The same model, deployed to a production user base, generates support tickets within days: in our experience across client-side ML engagements, inference takes 340ms on a significant fraction of devices (an observed pattern, not a guaranteed outcome), the user experience is broken, and the team is now debugging a latency problem under production pressure. This failure pattern is not caused by a model architecture error or an implementation bug. It is caused by a missing step before architecture selection: the device capability baseline. The test device was a recent mid-to-high-end handset with a GPU that supports the inference operations efficiently. The production user base includes devices from three years prior with GPUs that process the same operations at 6–8× lower throughput (an observed range across our client-side ML engagements, not a benchmarked industry rate). The model was never evaluated against the device distribution of the actual user population — only against the device the development team had available. The device capability gap in client-side inference Client-side ML inference environments — mobile browsers (WebGL, WebGPU), native mobile runtimes (CoreML, ONNX Runtime Mobile), and web applications — have a fundamental characteristic that server-side inference does not: the compute environment is heterogeneous and outside the deployment team’s control. A server-side inference deployment runs on infrastructure with known specifications. You choose the hardware, you control the software stack, and inference latency is predictable. A client-side deployment runs on whatever device the user owns, with whatever GPU generation, browser version, and background process load they have at the time. The device capability gap — the difference in inference throughput between the fastest and slowest devices in a realistic user base — is typically 10–20× for mobile GPU operations (an observed pattern across our client-side ML engagements; the specific multiplier depends on the device cohort being targeted). This gap is not uniformly distributed: a small fraction of devices (recent flagship hardware) represents the best-case; a large fraction of devices represents the median and below. In a client-side ML inference WebSDK project we ran for telecom SIM registration, the latency target was under 200ms for the full registration inference pipeline (operational measurement from that project). The development and testing phase validated against a set of recent mid-range and high-end devices. When the device cohort was extended to include older handsets and budget devices representative of the actual user population in the target market, inference times on 30% of devices exceeded the 200ms target (project-specific cohort measurement from the deployment). The solution required both a model architecture adjustment (reducing the inference graph depth for low-capability device paths) and device-gating logic (routing low-capability devices to a simplified pipeline or a server-side fallback path). Both changes were significant: they would have been less expensive to design into the system from the beginning than to retrofit after deployment. The device capability baseline: what it requires and when to establish it A device capability baseline is an empirical characterisation of the inference performance of the target runtime (WebGL, WebGPU, CoreML, ONNX Runtime Mobile) across the device distribution of the actual user population. It should be established before model architecture selection, not after. Baseline components: Component What it measures Why it matters GPU operation throughput by device cohort Matrix multiplication, convolution, and activation throughput for representative devices Determines which neural network architectures are feasible within the latency budget on each device tier Runtime feature support matrix Which WebGL extensions, WebGPU features, or CoreML operations are supported across the device distribution Some model operations are emulated on unsupported hardware, with 10–50× latency penalty Memory pressure under production conditions Available GPU memory under realistic background load (other apps, browser tabs) Memory-intensive model layers may fail or fall back to CPU on devices with competing memory pressure Thermal throttling behaviour Inference latency on repeat requests under sustained load Devices with aggressive thermal management reduce GPU clock speed after 30–60 seconds of sustained load Network conditions for fallback paths Available bandwidth if a server-side fallback is needed Fallback path latency budget depends on round-trip time and transfer size The runtime feature support gap is the source of the largest individual latency penalties. Operations that are emulated rather than executed natively on the device’s GPU — a WebGPU-targeted matrix operation falling back to a CPU implementation in a WebGL-only browser, a CoreML operation falling back to its compatibility layer on an older iOS version — can run 10–50× slower than the native path (an observed range across our client-side ML engagements, not a per-operation guarantee). A model that compiles successfully and produces correct results on a device with feature emulation may still miss its latency target by an order of magnitude. Once the baseline is established, the architecture decision is straightforward: model size and computational complexity are constrained by the latency budget across the device distribution, not by the best-case device performance. Deploying CV models to edge devices describes the broader deployment decision between edge and cloud inference; the device baseline is the input that makes that decision quantitative rather than qualitative. The device baseline measurement protocol The baseline is only useful if it is measured against the right devices, with the right workload, and on the right runtimes. The protocol below is the structure we use; the specific device list depends on the user population the deployment is targeting. 1. Device cohort selection. Build the cohort from telemetry of the existing user base, not from generic “popular device” lists. The cohort should cover the 95th percentile of the user population by usage share — typically 12–20 distinct device models, distributed across recent flagship, recent mid-range, two-to-three-year-old mid-range, two-to-three-year-old budget, and four-plus-year-old devices that remain in use. If telemetry is not yet available (greenfield deployment), use the public regional device share statistics from the target market. 2. Runtime matrix. For each device, identify the runtime versions that will execute the inference: browser engine version (Chromium, WebKit, Gecko) and the WebGL/WebGPU support state on each; native runtime version (CoreML version on iOS, ONNX Runtime Mobile or NNAPI version on Android). The same device with two browser versions is two distinct measurement points. 3. Workload definition. The benchmark workload should be the actual model the deployment will use, not a generic benchmark suite. Generic benchmarks (MobileNet inference, ResNet inference) measure throughput characteristics that are useful as a sanity check but do not predict the latency of the specific operation graph in the deployed model. 4. Measurement method. For each device-runtime-workload combination, measure: cold-start latency (first inference after page load or app launch, including model compilation), warm latency (median of 100 inferences after warm-up), p95 latency under sustained load (5 minutes of repeated inference to expose thermal throttling), and peak memory footprint during inference. Cold-start and sustained-load measurements are the two most often skipped and the two most often responsible for production-time surprises. 5. Tooling. For browser targets, an automated harness using Playwright or WebPageTest running against a BrowserStack or Sauce Labs device farm produces the matrix without per-device manual setup. For native targets, Firebase Test Lab (Android) and TestFlight or AWS Device Farm (iOS) cover the device matrix at scale. Hand-instrumented runs on a small cohort of physically owned devices serve as a calibration against the cloud device farm results. 6. Result format. A device baseline report is a table indexed by device-runtime pair with columns for cold latency, warm latency, sustained-load p95, peak memory, and a pass/fail flag against the latency budget. The pass/fail column is the input to the architecture decision: if more than the acceptable fraction of the user population fails, the architecture must change before the model does. 7. Refresh cadence. Re-run the baseline whenever the deployed model architecture changes meaningfully, when a major OS or browser version ships, and at minimum quarterly. Device baselines drift as the user population’s hardware turns over and as runtime versions evolve. What happens without a baseline Teams that skip the device capability baseline make architecture decisions implicitly: the model complexity, the runtime target, and the inference graph design are all calibrated to the development environment rather than the production environment. The architecture decisions that would have changed if the baseline were known — model depth, layer type selection, batch size — are locked in before the gap between development and production conditions is known. The most common consequence is a post-deployment architectural rewrite. The options at that point are constrained: Model distillation to a smaller architecture that fits within the latency budget on low-capability devices. This requires retraining with a distillation procedure, which is a significant investment if the original model was not designed with distillation in mind. Quantisation to reduce inference compute. This reduces latency but introduces quality tradeoffs that may not be acceptable for the use case, and requires per-platform validation. Device-gating to route low-capability devices to a simplified model path or server-side fallback. This requires designing a detection mechanism for device capability — which is another audit step that should have happened at the start. All three options are available before deployment as well. The difference is the cost: designing for a known device distribution is significantly less expensive than refactoring a deployed system under production pressure. For teams approaching client-side ML deployment for the first time, a Production CV Readiness Assessment includes device capability baseline establishment as a pre-architecture step.