Browser inference on low-end mobile devices is a model-strategy problem that happens to run in a browser β not a web-development problem. For an SME building SIM registration software, what arrived as a tuning brief turned into an architectural restart: the original multi-model design was not the right foundation for a β€200ms latency budget. A single multiclassifier was.
The client was building SIM registration software that needed to assess photo quality in the browser before upload β detecting plain background, blur, eyes-closed, and occlusion conditions. The system had to run on low-end Android devices. It had to work without a server round-trip. And it had to return a result fast enough that a user waiting for a selfie capture did not notice the inference happening. This was not a web development problem. It was an AI model strategy problem that happened to run in a browser.
β€200ms on low-end mobile devices.
The latency budget was fixed by product requirements. Low-end Android devices have fragmented hardware support and wildly variable GPU compute availability β a model that hits the budget on a mid-range test device may exceed it significantly on a budget handset in a real deployment.
WebGPU availability is not guaranteed.
WebGPU was not universally available across the target device and browser matrix. The system needed a WebGL fallback that maintained acceptable performance β not a graceful degradation that effectively disabled quality checking on non-WebGPU devices.
JavaScript pixel capture is a bottleneck.
The naive approach β reading pixel data from a video tag via JavaScript's get_pixel_data β is too slow on low-end devices to meet the latency target. The frame capture method is not a secondary concern: it is often the primary performance bottleneck before inference even starts.
The original model needed a restart, not a tune.
The existing multi-model architecture β running separate classifiers for each quality dimension β was not a good foundation to optimise. Our assessment: this was an AI major problem, closer to restart than fix. Multiple models means multiple load times, multiple memory allocations, and multiple inference passes per frame β all compounding the latency problem.
Each of the three decisions that made the latency target reachable was a reversal of the original architectural assumption β not a refinement of it.
Profiling the existing multi-model system end-to-end β model load time, inference time, pixel capture overhead, memory pressure β made it clear the architecture was not recoverable by tuning. Multiple models meant multiple load times, multiple memory allocations, multiple inference passes per frame, all compounding the latency problem. The replacement was a single multiclassifier predicting all quality dimensions at once, compressed via knowledge distillation to preserve the 90β95% accuracy band.
Evaluating TensorFlow.js and ONNX.js against the operation set the architecture actually required surfaced layer and op compatibility issues in ONNX.js β issues that would not have been visible from documentation alone. The framework selection had to be reversed back to TF.js. The lesson is generalisable: real-world inference framework choice is decided by which runtime supports the layers and ops the target architecture actually needs, not by benchmark headlines.
The naive frame-capture path β reading pixel data from a video tag via JavaScript β was, on the target devices, slower than the inference itself. Replacing it with WebGL texture binding read directly from the video tag eliminated a non-inference bottleneck that no amount of model compression would have closed. A device-capability gating component (loading a dummy model, measuring inference time, computing a performance ratio) then prevented hardware that still could not meet the latency budget from being served the classifier at all.
We redesigned the system from first principles for the browser inference constraint β not tuning the existing architecture but replacing the parts that made it fundamentally unfit for the latency target. The result: a system that runs correctly on the devices where it can, and fails explicitly rather than silently on those where it cannot.
One model predicts all face quality dimensions in a single inference pass. Multi-model designs compound load time, memory pressure, and inference passes β each cost paid per frame. The continuous classifier runs on every frame; secondary classifiers (for conditions that only need to run on selected frames) trigger conditionally, avoiding compute on frames already filtered out. The pattern recurs across our computer vision work: when the latency budget sits below what the original architecture costs, replacement is faster than tuning.
On low-end devices the frame-capture overhead is often larger than the inference overhead β making capture the first thing to fix, not the last. Reading the video tag via WebGL texture binding bypasses the slow JS pixel-read path and integrates naturally with a WebGL inference backend, so the captured texture stays on the GPU rather than crossing the JS boundary twice.
Static device-tier assumptions break as new hardware enters the field. A measured-performance gating component is more reliable: at initialisation it loads a lightweight dummy model, measures inference time, and computes a performance ratio. Devices below the threshold do not receive the classifier β they get an explicit fallback rather than a silent latency failure in production.
The architectural decisions are the headline; the numbers follow from them. Replacing the multi-model design with a single multiclassifier, replacing JS pixel capture with WebGL texture binding, and adding measured-performance device gating moved the latency budget from unattainable to operationally bounded. With those three reversals in place, the rebuilt system was validated against the β€200ms end-to-end latency target on the qualifying device tier and validated against the 90β95% per-dimension accuracy target at the browser inference stage. The capability-gating component handled the rest β devices below the measured-performance threshold received an explicit fallback rather than a silent latency failure in production.
One boundary worth naming: the framework reversal (TF.js β ONNX.js β TF.js) only became visible after the target architecture was actually built and tested on both runtimes. Layer and op compatibility constraints in browser inference frameworks are rarely fully described in their documentation β a recurring lesson in client-side ML deployment for telecommunications and identity-verification products built on commodity mobile hardware.
Validated against the β€200ms end-to-end latency target on the qualifying low-end mobile device tier
Validated against the 90β95% browser-stage accuracy target across the four quality dimensions
Single multiclassifier replaced multi-model architecture β reducing load time, memory pressure, and inference passes in one change
WebGL texture binding replaced JS pixel capture β eliminating a capture bottleneck larger than inference on constrained hardware
Measured-performance device gating β more reliable than static tier assumptions across a fragmented device landscape
Browser inference on commodity mobile hardware is decided by architecture, not by tuning. The choices β how many models to load, how to capture pixels, which devices to serve at all β usually determine whether the latency budget is reachable, long before any individual model is optimised.