Why Client-Side ML Projects Miss Latency Targets Before Deployment

Q: What does "client-side ML" actually pay for, and where does it fall apart on mobile and browser?

Client-side ML pays for low end-to-end latency, data locality, and offline capability. It falls apart when the device population is heterogeneous enough that a single model architecture cannot meet the latency budget across the distribution — typically a 10–20× spread between the fastest and slowest devices in a realistic mobile or browser user base.

Q: How do I model latency budgets across device, network, and cold-start before architecting a client-side ML feature?

Decompose the budget into cold-start, warm inference, sustained-load p95 under thermal throttling, and any network round-trip to a fallback endpoint. Measure each component on the actual device cohort and runtime matrix, not on the development device. The architecture is feasible only if every component fits the budget on the acceptable fraction of the cohort.

Why does a model that runs in 40ms on a test device run in 340ms on production devices?

A client-side ML model validates in the lab at 40ms inference time. The same model, deployed to a production user base, generates support tickets within days: in our experience across client-side ML engagements, inference takes 340ms on a significant fraction of devices (an observed pattern, not a guaranteed outcome), the user experience is broken, and the team is now debugging a latency problem under production pressure.

This failure pattern is not caused by a model architecture error or an implementation bug. It is caused by a missing step before architecture selection: the device capability baseline.

The test device was a recent mid-to-high-end handset with a GPU that supports the inference operations efficiently. The production user base includes devices from three years prior with GPUs that process the same operations at 6–8× lower throughput (an observed range across our client-side ML engagements, not a benchmarked industry rate). The model was never evaluated against the device distribution of the actual user population — only against the device the development team had available.

The device capability gap in client-side inference

Client-side ML inference environments — mobile browsers (WebGL, WebGPU), native mobile runtimes (CoreML, ONNX Runtime Mobile), and web applications — have a fundamental characteristic that server-side inference does not: the compute environment is heterogeneous and outside the deployment team’s control.

A server-side inference deployment runs on infrastructure with known specifications. You choose the hardware, you control the software stack, and inference latency is predictable. A client-side deployment runs on whatever device the user owns, with whatever GPU generation, browser version, and background process load they have at the time.

The device capability gap — the difference in inference throughput between the fastest and slowest devices in a realistic user base — is typically 10–20× for mobile GPU operations (an observed pattern across our client-side ML engagements; the specific multiplier depends on the device cohort being targeted). This gap is not uniformly distributed: a small fraction of devices (recent flagship hardware) represents the best-case; a large fraction of devices represents the median and below.

In a client-side ML inference WebSDK project we ran for telecom SIM registration, the latency target was under 200ms for the full registration inference pipeline (operational measurement from that project). The development and testing phase validated against a set of recent mid-range and high-end devices. When the device cohort was extended to include older handsets and budget devices representative of the actual user population in the target market, inference times on 30% of devices exceeded the 200ms target (project-specific cohort measurement from the deployment). The solution required both a model architecture adjustment (reducing the inference graph depth for low-capability device paths) and device-gating logic (routing low-capability devices to a simplified pipeline or a server-side fallback path). Both changes were significant: they would have been less expensive to design into the system from the beginning than to retrofit after deployment.

The device capability baseline: what it requires and when to establish it

A device capability baseline is an empirical characterisation of the inference performance of the target runtime (WebGL, WebGPU, CoreML, ONNX Runtime Mobile) across the device distribution of the actual user population. It should be established before model architecture selection, not after.

Baseline components:

Component	What it measures	Why it matters
GPU operation throughput by device cohort	Matrix multiplication, convolution, and activation throughput for representative devices	Determines which neural network architectures are feasible within the latency budget on each device tier
Runtime feature support matrix	Which WebGL extensions, WebGPU features, or CoreML operations are supported across the device distribution	Some model operations are emulated on unsupported hardware, with 10–50× latency penalty
Memory pressure under production conditions	Available GPU memory under realistic background load (other apps, browser tabs)	Memory-intensive model layers may fail or fall back to CPU on devices with competing memory pressure
Thermal throttling behaviour	Inference latency on repeat requests under sustained load	Devices with aggressive thermal management reduce GPU clock speed after 30–60 seconds of sustained load
Network conditions for fallback paths	Available bandwidth if a server-side fallback is needed	Fallback path latency budget depends on round-trip time and transfer size

The runtime feature support gap is the source of the largest individual latency penalties. Operations that are emulated rather than executed natively on the device’s GPU — a WebGPU-targeted matrix operation falling back to a CPU implementation in a WebGL-only browser, a CoreML operation falling back to its compatibility layer on an older iOS version — can run 10–50× slower than the native path (an observed range across our client-side ML engagements, not a per-operation guarantee). A model that compiles successfully and produces correct results on a device with feature emulation may still miss its latency target by an order of magnitude.

Once the baseline is established, the architecture decision is straightforward: model size and computational complexity are constrained by the latency budget across the device distribution, not by the best-case device performance. Deploying CV models to edge devices describes the broader deployment decision between edge and cloud inference; the device baseline is the input that makes that decision quantitative rather than qualitative.

The device baseline measurement protocol

The baseline is only useful if it is measured against the right devices, with the right workload, and on the right runtimes. The protocol below is the structure we use; the specific device list depends on the user population the deployment is targeting.

1. Device cohort selection. Build the cohort from telemetry of the existing user base, not from generic “popular device” lists. The cohort should cover the 95th percentile of the user population by usage share — typically 12–20 distinct device models, distributed across recent flagship, recent mid-range, two-to-three-year-old mid-range, two-to-three-year-old budget, and four-plus-year-old devices that remain in use. If telemetry is not yet available (greenfield deployment), use the public regional device share statistics from the target market.

2. Runtime matrix. For each device, identify the runtime versions that will execute the inference: browser engine version (Chromium, WebKit, Gecko) and the WebGL/WebGPU support state on each; native runtime version (CoreML version on iOS, ONNX Runtime Mobile or NNAPI version on Android). The same device with two browser versions is two distinct measurement points.

3. Workload definition. The benchmark workload should be the actual model the deployment will use, not a generic benchmark suite. Generic benchmarks (MobileNet inference, ResNet inference) measure throughput characteristics that are useful as a sanity check but do not predict the latency of the specific operation graph in the deployed model.

4. Measurement method. For each device-runtime-workload combination, measure: cold-start latency (first inference after page load or app launch, including model compilation), warm latency (median of 100 inferences after warm-up), p95 latency under sustained load (5 minutes of repeated inference to expose thermal throttling), and peak memory footprint during inference. Cold-start and sustained-load measurements are the two most often skipped and the two most often responsible for production-time surprises.

5. Tooling. For browser targets, an automated harness using Playwright or WebPageTest running against a BrowserStack or Sauce Labs device farm produces the matrix without per-device manual setup. For native targets, Firebase Test Lab (Android) and TestFlight or AWS Device Farm (iOS) cover the device matrix at scale. Hand-instrumented runs on a small cohort of physically owned devices serve as a calibration against the cloud device farm results.

6. Result format. A device baseline report is a table indexed by device-runtime pair with columns for cold latency, warm latency, sustained-load p95, peak memory, and a pass/fail flag against the latency budget. The pass/fail column is the input to the architecture decision: if more than the acceptable fraction of the user population fails, the architecture must change before the model does.

7. Refresh cadence. Re-run the baseline whenever the deployed model architecture changes meaningfully, when a major OS or browser version ships, and at minimum quarterly. Device baselines drift as the user population’s hardware turns over and as runtime versions evolve.

When client-side ML should be replaced with an edge or cloud inference path

The baseline can produce an unambiguous answer: client-side inference is not feasible for the targeted user population within the latency budget. When more than a small minority of the device cohort fails the budget even after model adjustment, the decision is to move the inference path off the device rather than to keep distilling against a wall. The candidate destinations are a regional edge inference service (lowest latency, highest infrastructure cost) or a centralised cloud inference endpoint (higher latency, lower cost, simpler deployment).

The model-format choice interacts with this decision. A model authored for WebGPU with no ONNX or CoreML export path constrains the deployment to browsers that support WebGPU natively; the older Android Chromium builds and Safari versions that fall back to WebGL will execute a fraction of the operations on the CPU and miss the budget. A model with a clean ONNX export, on the other hand, can run on ONNX Runtime Mobile, on NNAPI, on CoreML via a converter, or on a server-side ONNX Runtime — meaning the same training pipeline supports both the on-device path and the cloud fallback without parallel model maintenance. Format flexibility is what makes a hybrid client-side / server-side deployment economically tractable.

For instrumenting the deployed system once it ships, the same measurement framework that produced the baseline becomes a regression-detection telemetry stream. Client-side inference latency histograms, segmented by device class and runtime version, surface latency regressions from a new model version or a new browser release before users notice them. The instrumentation cost is small if the harness was built during the baseline phase; it is significant if it has to be retrofitted.

What happens without a baseline

Teams that skip the device capability baseline make architecture decisions implicitly: the model complexity, the runtime target, and the inference graph design are all calibrated to the development environment rather than the production environment. The architecture decisions that would have changed if the baseline were known — model depth, layer type selection, batch size — are locked in before the gap between development and production conditions is known.

The most common consequence is a post-deployment architectural rewrite. The options at that point are constrained:

Model distillation to a smaller architecture that fits within the latency budget on low-capability devices. This requires retraining with a distillation procedure, which is a significant investment if the original model was not designed with distillation in mind.
Quantisation to reduce inference compute. This reduces latency but introduces quality tradeoffs that may not be acceptable for the use case, and requires per-platform validation.
Device-gating to route low-capability devices to a simplified model path or server-side fallback. This requires designing a detection mechanism for device capability — which is another audit step that should have happened at the start.

All three options are available before deployment as well. The difference is the cost: designing for a known device distribution is significantly less expensive than refactoring a deployed system under production pressure.

For teams approaching client-side ML deployment for the first time, a Production CV Readiness Assessment includes device capability baseline establishment as a pre-architecture step.

FAQ

Why do client-side ML projects miss latency targets before deployment?

Because the device capability baseline is established after architecture selection rather than before. The model and inference graph are tuned to the development team’s devices, then deployed to a user population whose median GPU is several generations older with 6–8× lower throughput on the relevant operations. The latency miss is a sequencing failure, not a modelling error.

What does “client-side ML” actually pay for, and where does it fall apart on mobile and browser?

Client-side ML pays for low end-to-end latency (no network round trip), data locality (sensitive input never leaves the device), and offline capability. It falls apart when the device population is heterogeneous enough that a single model architecture cannot meet the latency budget across the distribution — typically a 10–20× spread between the fastest and slowest devices in a realistic mobile or browser user base.

How do I model latency budgets across device, network, and cold-start before architecting a client-side ML feature?

Decompose the budget into cold-start (first-inference load and compilation), warm inference (median repeat latency), sustained-load p95 (under thermal throttling), and, if a fallback exists, network round-trip to the fallback endpoint. Measure each component on the actual device cohort and runtime matrix, not on the development device. The architecture is feasible only if every component fits the budget on the acceptable fraction of the cohort.

Which model-format choices (ONNX, CoreML, WebGL/WebGPU) most affect real-world client-side latency?

The format choice that locks the deployment to a single runtime path has the largest latency exposure: a WebGPU-only model on a device that falls back to WebGL emulates operations on the CPU at 10–50× the native cost. A model with a clean ONNX export and parallel CoreML / WebGPU / WebGL paths trades a small overhead in build complexity for the ability to route each device to its fastest supported runtime, and to a server-side fallback when no client path meets the budget.

When should client-side ML be replaced with a low-latency edge or cloud inference path?

When the baseline shows that more than a small minority of the targeted device cohort cannot meet the latency budget even after model distillation and quantisation. The choice between regional edge inference and centralised cloud inference is then a cost-versus-latency trade: edge is lowest end-to-end latency at the highest infrastructure cost; cloud is acceptable when the network round-trip plus server inference fits the budget the device path could not meet.

How do I instrument client-side ML to catch latency regressions before users do?

Reuse the baseline measurement harness as a telemetry stream. Emit per-inference latency segmented by device class, runtime version, and cold-versus-warm path; aggregate into histograms; alert on p95 regressions versus the baseline. The cost of building this in is small if the harness already exists from the baseline phase, and large if it has to be retrofitted after a regression has shipped.