CASE STUDY

WebSDK Client-Side ML Inference Optimisation

Browser inference on low-end mobile devices is a model-strategy problem that happens to run in a browser — not a web-development problem. For an SME building SIM registration software, what arrived as a tuning brief turned into an architectural restart: the original multi-model design was not the right foundation for a ≤200ms latency budget. A single multiclassifier was.

TensorFlow.js ONNX.js WebGL Knowledge Distillation

The Challenge

The client was building SIM registration software that needed to assess photo quality in the browser before upload — detecting plain background, blur, eyes-closed, and occlusion conditions. The system had to run on low-end Android devices. It had to work without a server round-trip. And it had to return a result fast enough that a user waiting for a selfie capture did not notice the inference happening. This was not a web development problem. It was an AI model strategy problem that happened to run in a browser.

≤200ms on low-end mobile devices.

The latency budget was fixed by product requirements. Low-end Android devices have fragmented hardware support and wildly variable GPU compute availability — a model that hits the budget on a mid-range test device may exceed it significantly on a budget handset in a real deployment.

WebGPU availability is not guaranteed.

WebGPU was not universally available across the target device and browser matrix. The system needed a WebGL fallback that maintained acceptable performance — not a graceful degradation that effectively disabled quality checking on non-WebGPU devices.

JavaScript pixel capture is a bottleneck.

The naive approach — reading pixel data from a video tag via JavaScript's get_pixel_data — is too slow on low-end devices to meet the latency target. The frame capture method is not a secondary concern: it is often the primary performance bottleneck before inference even starts.

The original model needed a restart, not a tune.

The existing multi-model architecture — running separate classifiers for each quality dimension — was not a good foundation to optimise. Our assessment: this was an AI major problem, closer to restart than fix. Multiple models means multiple load times, multiple memory allocations, and multiple inference passes per frame — all compounding the latency problem.

A laptop running a browser session, the deployment surface for the client-side ML pipeline

Three Architectural Reversals

Each of the three decisions that made the latency target reachable was a reversal of the original architectural assumption — not a refinement of it.

Reversal 1: Multi-Model → Single Multiclassifier

Profiling the existing multi-model system end-to-end — model load time, inference time, pixel capture overhead, memory pressure — made it clear the architecture was not recoverable by tuning. Multiple models meant multiple load times, multiple memory allocations, multiple inference passes per frame, all compounding the latency problem. The replacement was a single multiclassifier predicting all quality dimensions at once, compressed via knowledge distillation to preserve the 90–95% accuracy band.

Evaluating TensorFlow.js and ONNX.js against the operation set the architecture actually required surfaced layer and op compatibility issues in ONNX.js — issues that would not have been visible from documentation alone. The framework selection had to be reversed back to TF.js. The lesson is generalisable: real-world inference framework choice is decided by which runtime supports the layers and ops the target architecture actually needs, not by benchmark headlines.

Reversal 2: TF.js → ONNX.js → TF.js

Reversal 3: JS Pixel Read → WebGL Texture Binding

The naive frame-capture path — reading pixel data from a video tag via JavaScript — was, on the target devices, slower than the inference itself. Replacing it with WebGL texture binding read directly from the video tag eliminated a non-inference bottleneck that no amount of model compression would have closed. A device-capability gating component (loading a dummy model, measuring inference time, computing a performance ratio) then prevented hardware that still could not meet the latency budget from being served the classifier at all.

The Solution

We redesigned the system from first principles for the browser inference constraint — not tuning the existing architecture but replacing the parts that made it fundamentally unfit for the latency target. The result: a system that runs correctly on the devices where it can, and fails explicitly rather than silently on those where it cannot.

Single Multiclassifier

One model predicts all face quality dimensions in a single inference pass. Multi-model designs compound load time, memory pressure, and inference passes — each cost paid per frame. The continuous classifier runs on every frame; secondary classifiers (for conditions that only need to run on selected frames) trigger conditionally, avoiding compute on frames already filtered out. The pattern recurs across our computer vision work: when the latency budget sits below what the original architecture costs, replacement is faster than tuning.

WebGL Video Texture Capture

On low-end devices the frame-capture overhead is often larger than the inference overhead — making capture the first thing to fix, not the last. Reading the video tag via WebGL texture binding bypasses the slow JS pixel-read path and integrates naturally with a WebGL inference backend, so the captured texture stays on the GPU rather than crossing the JS boundary twice.

Device Capability Gating

Static device-tier assumptions break as new hardware enters the field. A measured-performance gating component is more reliable: at initialisation it loads a lightweight dummy model, measures inference time, and computes a performance ratio. Devices below the threshold do not receive the classifier — they get an explicit fallback rather than a silent latency failure in production.

Technical Specifications

Primary runtime TensorFlow.js (WebGL backend)

Evaluated runtime ONNX.js (evaluated; layer/op compatibility issues encountered — not deployed)

Pixel capture WebGL texture binding from video tag (not JS get_pixel_data)

Model compression Knowledge distillation

Classifier type Single multiclassifier (background, blur, eyes-closed, occlusion)

Continuous vs conditional Primary classifier runs on all frames; secondary classifiers only on selected good frames

Frame processing requestAnimationFrame for smooth frame scheduling; parallel model loading + caching

Latency target ≤200ms end-to-end on target device range

Accuracy target 90–95% browser-stage per-dimension accuracy

Capability gating Measured-performance ratio at init; below-threshold devices receive an explicit fallback

Memory management Explicit TF model disposal and WebGL context cleanup

A laptop screen as the destination surface for the browser-side inference pipeline

The Outcome

The architectural decisions are the headline; the numbers follow from them. Replacing the multi-model design with a single multiclassifier, replacing JS pixel capture with WebGL texture binding, and adding measured-performance device gating moved the latency budget from unattainable to operationally bounded. With those three reversals in place, the rebuilt system was validated against the ≤200ms end-to-end latency target on the qualifying device tier and validated against the 90–95% per-dimension accuracy target at the browser inference stage. The capability-gating component handled the rest — devices below the measured-performance threshold received an explicit fallback rather than a silent latency failure in production.

One boundary worth naming: the framework reversal (TF.js → ONNX.js → TF.js) only became visible after the target architecture was actually built and tested on both runtimes. Layer and op compatibility constraints in browser inference frameworks are rarely fully described in their documentation — a recurring lesson in client-side ML deployment for telecommunications and identity-verification products built on commodity mobile hardware.

Key Achievements

Validated against the ≤200ms end-to-end latency target on the qualifying low-end mobile device tier

Validated against the 90–95% browser-stage accuracy target across the four quality dimensions

Single multiclassifier replaced multi-model architecture — reducing load time, memory pressure, and inference passes in one change

WebGL texture binding replaced JS pixel capture — eliminating a capture bottleneck larger than inference on constrained hardware

Measured-performance device gating — more reliable than static tier assumptions across a fragmented device landscape

Deployment Conditions That Shape This Architecture

Computer Vision Services

Our services feature expertise in classical computer vision, human-supervised system design for legal compliance, video pipeline optimisation with tools like FFmpeg, custom adaptable models, and explainable AI for ethical transparency.

GPU Performance Engineering

We deliver GPU-accelerated inference pipelines optimised for constrained edge hardware and high-throughput server deployments — profiling-led, architecture-first, with measurable performance outcomes.

Telecommunications

We build AI systems for telecommunications operators and identity-verification products — from on-device inference for SIM and KYC workflows to network-side video and signal processing pipelines.

Deploying ML to the Browser?

Browser inference on commodity mobile hardware is decided by architecture, not by tuning. The choices — how many models to load, how to capture pixels, which devices to serve at all — usually determine whether the latency budget is reachable, long before any individual model is optimised.

Book a Call Get Expert Input

More Case Studies See All