Why Client-Side ML Projects Miss Latency Targets Before Deployment

Client-side ML misses latency targets when the device capability baseline is set after architecture selection rather than before. Sequence matters.

Why Client-Side ML Projects Miss Latency Targets Before Deployment
Written by TechnoLynx Published on 29 Apr 2026

Why does a model that runs in 40ms on a test device run in 340ms on production devices?

A client-side ML model validates in the lab at 40ms inference time. The same model, deployed to a production user base, generates support tickets within days: in our experience across client-side ML engagements, inference takes 340ms on a significant fraction of devices (an observed pattern, not a guaranteed outcome), the user experience is broken, and the team is now debugging a latency problem under production pressure.

This failure pattern is not caused by a model architecture error or an implementation bug. It is caused by a missing step before architecture selection: the device capability baseline.

The test device was a recent mid-to-high-end handset with a GPU that supports the inference operations efficiently. The production user base includes devices from three years prior with GPUs that process the same operations at 6–8× lower throughput (an observed range across our client-side ML engagements, not a benchmarked industry rate). The model was never evaluated against the device distribution of the actual user population — only against the device the development team had available.

The device capability gap in client-side inference

Client-side ML inference environments — mobile browsers (WebGL, WebGPU), native mobile runtimes (CoreML, ONNX Runtime Mobile), and web applications — have a fundamental characteristic that server-side inference does not: the compute environment is heterogeneous and outside the deployment team’s control.

A server-side inference deployment runs on infrastructure with known specifications. You choose the hardware, you control the software stack, and inference latency is predictable. A client-side deployment runs on whatever device the user owns, with whatever GPU generation, browser version, and background process load they have at the time.

The device capability gap — the difference in inference throughput between the fastest and slowest devices in a realistic user base — is typically 10–20× for mobile GPU operations (an observed pattern across our client-side ML engagements; the specific multiplier depends on the device cohort being targeted). This gap is not uniformly distributed: a small fraction of devices (recent flagship hardware) represents the best-case; a large fraction of devices represents the median and below.

In a client-side ML inference WebSDK project we ran for telecom SIM registration, the latency target was under 200ms for the full registration inference pipeline (operational measurement from that project). The development and testing phase validated against a set of recent mid-range and high-end devices. When the device cohort was extended to include older handsets and budget devices representative of the actual user population in the target market, inference times on 30% of devices exceeded the 200ms target (project-specific cohort measurement from the deployment). The solution required both a model architecture adjustment (reducing the inference graph depth for low-capability device paths) and device-gating logic (routing low-capability devices to a simplified pipeline or a server-side fallback path). Both changes were significant: they would have been less expensive to design into the system from the beginning than to retrofit after deployment.

The device capability baseline: what it requires and when to establish it

A device capability baseline is an empirical characterisation of the inference performance of the target runtime (WebGL, WebGPU, CoreML, ONNX Runtime Mobile) across the device distribution of the actual user population. It should be established before model architecture selection, not after.

Baseline components:

Component What it measures Why it matters
GPU operation throughput by device cohort Matrix multiplication, convolution, and activation throughput for representative devices Determines which neural network architectures are feasible within the latency budget on each device tier
Runtime feature support matrix Which WebGL extensions, WebGPU features, or CoreML operations are supported across the device distribution Some model operations are emulated on unsupported hardware, with 10–50× latency penalty
Memory pressure under production conditions Available GPU memory under realistic background load (other apps, browser tabs) Memory-intensive model layers may fail or fall back to CPU on devices with competing memory pressure
Thermal throttling behaviour Inference latency on repeat requests under sustained load Devices with aggressive thermal management reduce GPU clock speed after 30–60 seconds of sustained load
Network conditions for fallback paths Available bandwidth if a server-side fallback is needed Fallback path latency budget depends on round-trip time and transfer size

The runtime feature support gap is the source of the largest individual latency penalties. Operations that are emulated rather than executed natively on the device’s GPU — a WebGPU-targeted matrix operation falling back to a CPU implementation in a WebGL-only browser, a CoreML operation falling back to its compatibility layer on an older iOS version — can run 10–50× slower than the native path (an observed range across our client-side ML engagements, not a per-operation guarantee). A model that compiles successfully and produces correct results on a device with feature emulation may still miss its latency target by an order of magnitude.

Once the baseline is established, the architecture decision is straightforward: model size and computational complexity are constrained by the latency budget across the device distribution, not by the best-case device performance. Deploying CV models to edge devices describes the broader deployment decision between edge and cloud inference; the device baseline is the input that makes that decision quantitative rather than qualitative.

The device baseline measurement protocol

The baseline is only useful if it is measured against the right devices, with the right workload, and on the right runtimes. The protocol below is the structure we use; the specific device list depends on the user population the deployment is targeting.

1. Device cohort selection. Build the cohort from telemetry of the existing user base, not from generic “popular device” lists. The cohort should cover the 95th percentile of the user population by usage share — typically 12–20 distinct device models, distributed across recent flagship, recent mid-range, two-to-three-year-old mid-range, two-to-three-year-old budget, and four-plus-year-old devices that remain in use. If telemetry is not yet available (greenfield deployment), use the public regional device share statistics from the target market.

2. Runtime matrix. For each device, identify the runtime versions that will execute the inference: browser engine version (Chromium, WebKit, Gecko) and the WebGL/WebGPU support state on each; native runtime version (CoreML version on iOS, ONNX Runtime Mobile or NNAPI version on Android). The same device with two browser versions is two distinct measurement points.

3. Workload definition. The benchmark workload should be the actual model the deployment will use, not a generic benchmark suite. Generic benchmarks (MobileNet inference, ResNet inference) measure throughput characteristics that are useful as a sanity check but do not predict the latency of the specific operation graph in the deployed model.

4. Measurement method. For each device-runtime-workload combination, measure: cold-start latency (first inference after page load or app launch, including model compilation), warm latency (median of 100 inferences after warm-up), p95 latency under sustained load (5 minutes of repeated inference to expose thermal throttling), and peak memory footprint during inference. Cold-start and sustained-load measurements are the two most often skipped and the two most often responsible for production-time surprises.

5. Tooling. For browser targets, an automated harness using Playwright or WebPageTest running against a BrowserStack or Sauce Labs device farm produces the matrix without per-device manual setup. For native targets, Firebase Test Lab (Android) and TestFlight or AWS Device Farm (iOS) cover the device matrix at scale. Hand-instrumented runs on a small cohort of physically owned devices serve as a calibration against the cloud device farm results.

6. Result format. A device baseline report is a table indexed by device-runtime pair with columns for cold latency, warm latency, sustained-load p95, peak memory, and a pass/fail flag against the latency budget. The pass/fail column is the input to the architecture decision: if more than the acceptable fraction of the user population fails, the architecture must change before the model does.

7. Refresh cadence. Re-run the baseline whenever the deployed model architecture changes meaningfully, when a major OS or browser version ships, and at minimum quarterly. Device baselines drift as the user population’s hardware turns over and as runtime versions evolve.

What happens without a baseline

Teams that skip the device capability baseline make architecture decisions implicitly: the model complexity, the runtime target, and the inference graph design are all calibrated to the development environment rather than the production environment. The architecture decisions that would have changed if the baseline were known — model depth, layer type selection, batch size — are locked in before the gap between development and production conditions is known.

The most common consequence is a post-deployment architectural rewrite. The options at that point are constrained:

  • Model distillation to a smaller architecture that fits within the latency budget on low-capability devices. This requires retraining with a distillation procedure, which is a significant investment if the original model was not designed with distillation in mind.
  • Quantisation to reduce inference compute. This reduces latency but introduces quality tradeoffs that may not be acceptable for the use case, and requires per-platform validation.
  • Device-gating to route low-capability devices to a simplified model path or server-side fallback. This requires designing a detection mechanism for device capability — which is another audit step that should have happened at the start.

All three options are available before deployment as well. The difference is the cost: designing for a known device distribution is significantly less expensive than refactoring a deployed system under production pressure.

For teams approaching client-side ML deployment for the first time, a Production CV Readiness Assessment includes device capability baseline establishment as a pre-architecture step.

Digital Shelf Monitoring with Computer Vision: What Retail AI Actually Detects

Digital Shelf Monitoring with Computer Vision: What Retail AI Actually Detects

7/05/2026

Digital shelf monitoring uses CV to detect out-of-stocks, planogram compliance, and pricing errors. What the systems actually detect and where accuracy drops.

Deep Learning for Image Processing in Production: Architecture Choices, Training, and Deployment

Deep Learning for Image Processing in Production: Architecture Choices, Training, and Deployment

7/05/2026

Deep learning for image processing in production: CNN vs ViT tradeoffs, training data requirements, augmentation, deployment optimisation, and.

AI vs Real Face: Anti-Spoofing, Liveness Detection, and When Custom CV Models Are Necessary

AI vs Real Face: Anti-Spoofing, Liveness Detection, and When Custom CV Models Are Necessary

7/05/2026

When synthetic faces defeat pretrained detectors: anti-spoofing challenges, liveness detection requirements, and when custom models are unavoidable.

AI-Based CCTV Monitoring Solutions: Automation vs Human Review and What Each Handles Well

AI-Based CCTV Monitoring Solutions: Automation vs Human Review and What Each Handles Well

7/05/2026

AI CCTV monitoring vs human monitoring: cost comparison, coverage capability, response time tradeoffs, and what AI handles well vs where human judgment is.

CCTV Face Recognition in Production: Why It Fails More Than Demos Suggest

CCTV Face Recognition in Production: Why It Fails More Than Demos Suggest

7/05/2026

CCTV face recognition: resolution requirements, angle and lighting challenges, false positive rates, GDPR compliance, and why production performance lags.

AI-Enabled CCTV for Building Security: Analytics, Camera Placement, and Infrastructure

AI-Enabled CCTV for Building Security: Analytics, Camera Placement, and Infrastructure

6/05/2026

AI CCTV for building security: intrusion detection, people counting, loitering analytics, camera placement strategy, and storage and bandwidth.

Best Wired CCTV Systems for AI Video Analytics: What Matters Beyond Resolution

Best Wired CCTV Systems for AI Video Analytics: What Matters Beyond Resolution

6/05/2026

Wired CCTV systems for AI analytics need more than high resolution. Codec support, edge processing, and integration architecture determine analytics quality.

Automated Visual Inspection in Pharma: How CV Systems Replace Manual Quality Checks

Automated Visual Inspection in Pharma: How CV Systems Replace Manual Quality Checks

6/05/2026

Automated visual inspection in pharma uses computer vision to detect defects in vials, syringes, and tablets — faster and more consistently than human.

Automated Visual Inspection Systems: Hardware, Model Selection, and False-Reject Rates

Automated Visual Inspection Systems: Hardware, Model Selection, and False-Reject Rates

6/05/2026

Build automated visual inspection systems that work: hardware setup, model selection (classification vs detection vs segmentation), and managing.

Aseptic Manufacturing in Pharma: Process Control, Risks, and Where AI Fits

Aseptic Manufacturing in Pharma: Process Control, Risks, and Where AI Fits

6/05/2026

Aseptic manufacturing prevents microbial contamination during sterile drug production. AI monitoring addresses the environmental control gaps humans miss.

4K Security Cameras and AI Analytics: When Higher Resolution Helps and When It Doesn't

4K Security Cameras and AI Analytics: When Higher Resolution Helps and When It Doesn't

6/05/2026

4K security cameras for AI analytics: bandwidth and storage costs, where higher resolution improves results, compression artifacts and AI accuracy.

Computer Vision in Pharmacy Retail: Inventory Tracking, Planogram Compliance, and Shrinkage Reduction

Computer Vision in Pharmacy Retail: Inventory Tracking, Planogram Compliance, and Shrinkage Reduction

5/05/2026

CV in pharmacy retail addresses unique challenges: regulated product tracking, controlled substance security, and planogram compliance across thousands of SKUs.

Visual Inspection Equipment for Manufacturing QC: Where AI Adds Value and Where Rules Still Win

5/05/2026

AI-enhanced visual inspection replaces rule-based defect detection with learned representations — but requires validated training data matching production variability.

Facial Recognition in Video Surveillance: Why Lab Accuracy Doesn't Transfer to CCTV

5/05/2026

Facial recognition accuracy drops 10–40% between controlled enrollment conditions and production CCTV due to angle, lighting, and resolution.

Computer Vision Store Analytics: What Cameras Can Actually Measure in Retail

5/05/2026

Store analytics CV must distinguish 'detected' from 'measured with business-decision confidence.' Most deployments conflate the two.

AI in Pharmaceutical Supply Chains: Where Computer Vision and Predictive Analytics Deliver ROI

5/05/2026

Pharma supply chain AI delivers measurable ROI in three areas: serialisation verification, cold-chain anomaly prediction, and visual inspection automation.

Computer Vision for Retail Loss Prevention: What Works, What Breaks, and Why Scale Matters

5/05/2026

CV-based loss prevention must handle thousands of SKUs under variable lighting. Single-model approaches produce unactionable alert volumes at scale.

Intelligent Video Analytics: How Modern CCTV Systems Detect Behaviour Instead of Motion

4/05/2026

IVA shifts surveillance alerting from pixel-change detection to behaviour understanding. But only modular pipeline architectures deliver this in practice.

Cross-Platform TTS Inference Under Real-Time Constraints: ONNX and CoreML

1/05/2026

Cross-platform TTS to iOS, Android and browser stays consistent only if compression is decided at training time — distill once, export to ONNX.

Production Anomaly Detection in Video Data Pipelines: A Generative Approach

1/05/2026

Generative models trained on normal frames detect rare video anomalies without labelled anomaly data — reconstruction error is the score.

Designing Observable CV Pipelines for CCTV: Modular Architecture for Security Operations

30/04/2026

Operators stop trusting CV alerts when the pipeline is opaque. Observable, modular CCTV pipelines decompose decisions into auditable stages.

The Unknown-Object Loop: Designing Retail CV Systems That Improve Operationally

30/04/2026

Retail CV deployments meet products outside the training catalogue. The architectural choice: silent misclassification or a designed review loop.

Building a Production SKU Recognition System That Degrades Gracefully

29/04/2026

Graceful degradation in production SKU recognition is an architectural property: predictable automation rate as the catalogue grows.

Distillation vs Quantisation for Multi-Platform Edge Inference: How to Choose

28/04/2026

Distillation and quantisation both shrink models for edge inference, but for three-or-more platforms only distillation keeps quality consistent.

GPU-Accelerating RF Signal Propagation Simulation: From Days to Hours

28/04/2026

Naive GPU porting of sequential RF simulation delivers modest gains. Algorithmic redesign to expose parallelism turns multi-day runtimes into hours.

Why AI Video Surveillance Generates False Alarms — And What Pipeline Architecture Reduces Them

28/04/2026

Surveillance false alarms are an architecture problem, not a sensitivity setting. Modular pipelines reduce them; monolithic ones cannot.

Why Computer Vision Fails at Retail Scale: The Compound Failure Class

28/04/2026

CV models that pass accuracy tests at 500 SKUs fail in production above 1,000 — not from one cause but from four simultaneous failure axes.

When to Build a Custom Computer Vision Model vs Use an Off-the-Shelf Solution

26/04/2026

Custom CV models are justified when the domain is specialised and off-the-shelf accuracy is insufficient. Otherwise, customisation adds waste.

How to Deploy Computer Vision Models on Edge Devices

25/04/2026

Edge CV trades accuracy for latency and bandwidth savings. Quantisation, model selection, and hardware matching determine whether the trade-off works.

What ROI Computer Vision Actually Delivers in Retail

24/04/2026

Retail CV ROI comes from shrinkage reduction, planogram compliance, and checkout automation — not AI dashboards. Measure what changes operationally.

Data Quality Problems That Cause Computer Vision Systems to Degrade After Deployment

23/04/2026

CV system degradation after deployment is usually a data problem. Annotation inconsistency, domain shift, and data drift are the structural causes.

How Computer Vision Replaces Manual Visual Inspection in Pharmaceutical Quality Control

23/04/2026

CV-based pharma QC inspection is a production engineering problem, not a model accuracy problem. It requires data, validation, and pipeline design.

How to Architect a Modular Computer Vision Pipeline for Production Reliability

22/04/2026

A production CV pipeline is a system architecture problem, not a model accuracy problem. Modular design enables debugging and component-level maintenance.

Machine Vision vs Computer Vision: Choosing the Right Inspection Approach for Manufacturing

21/04/2026

Machine vision is deterministic and auditable. Computer vision is adaptive and generalisable. The choice depends on defect complexity, not preference.

Why Off-the-Shelf Computer Vision Models Fail in Production

20/04/2026

Off-the-shelf CV models degrade in production due to variable conditions, class imbalance, and throughput demands that benchmarks never test.

Deep Learning Models for Accurate Object Size Classification

27/01/2026

A clear and practical guide to deep learning models for object size classification, covering feature extraction, model architectures, detection pipelines, and real‑world considerations.

Mimicking Human Vision: Rethinking Computer Vision Systems

10/11/2025

Why computer vision systems trained on benchmarks fail on real inputs, and how attention mechanisms, context modelling, and multi-scale features close the gap.

Visual analytic intelligence of neural networks

7/11/2025

Neural network visualisation: how activation maps, layer inspection, and feature attribution reveal what a model has learned and where it will fail.

AI Object Tracking Solutions: Intelligent Automation

12/05/2025

Multi-object tracking in production: handling occlusion, re-identification, and real-time latency constraints in industrial and retail camera systems.

Automating Assembly Lines with Computer Vision

24/04/2025

Integrating computer vision into assembly lines: inspection system design, detection accuracy targets, and edge deployment considerations for manufacturing environments.

The Growing Need for Video Pipeline Optimisation

10/04/2025

Video pipeline optimisation: how encoding, transmission, and decoding decisions determine real-time computer vision latency and processing throughput at scale.

Smarter and More Accurate AI: Why Businesses Turn to HITL

27/03/2025

Human-in-the-loop AI: how to design review queues that maintain throughput while keeping humans in control of low-confidence and edge-case decisions.

Optimising Quality Control Workflows with AI and Computer Vision

24/03/2025

Quality control with computer vision: inspection pipeline design, defect detection architectures, and the measurement factors that determine false-reject rates in production.

Inventory Management Applications: Computer Vision to the Rescue!

17/03/2025

Computer vision for inventory counting and tracking: how shelf-state monitoring, object detection, and anomaly detection reduce manual audit overhead in warehouses and retail.

Explainability (XAI) In Computer Vision

17/03/2025

Explainability in computer vision: how saliency maps, attention visualisation, and interpretable architectures make CV models auditable and correctable in production.

The Impact of Computer Vision on Real-Time Face Detection

10/02/2025

Real-time face detection in production: CNN architecture choices, detection pipeline design, and the latency constraints that determine deployment feasibility.

Case Study: Large-Scale SKU Product Recognition

10/12/2024

Hierarchical SKU classification using DINO embeddings and few-shot learning — above 95% accuracy at ~1k classes, above 83% at ~2k.

Case Study: WebSDK Client-Side ML Inference Optimisation

20/11/2024

Browser-deployed face quality classifier rebuilt around a single multiclassifier, WebGL pixel capture, and explicit device-capability gating.

Back See Blogs
arrow icon