How to Deploy Computer Vision Models on Edge Devices

Why the edge matters for computer vision

A cloud-based computer vision pipeline works like this: a camera captures an image, the image is transmitted over the network to a cloud server, the server runs inference, and the result is transmitted back. The round-trip latency — image transmission, queuing, inference, result transmission — is typically 100–500 milliseconds, sometimes more under network congestion. For many applications, that latency is acceptable. For others — industrial inspection at production line speed, autonomous navigation, real-time safety monitoring — it is not.

Edge deployment moves the inference step from the cloud to a device co-located with the camera: an NVIDIA Jetson module, a Google Coral accelerator, a Qualcomm AI-optimised SoC, or an Intel Neural Compute Stick attached to an embedded system. The image never leaves the device. Inference latency drops to 10–50 milliseconds (an observed pattern across our edge-CV engagements, not a benchmarked industry rate). Network bandwidth requirements drop to near zero — only results, not images, are transmitted. And the system continues operating when the network connection is unavailable, which in industrial environments happens more often than IT architecture diagrams suggest.

The trade-off: edge devices have constrained compute, memory, and power budgets compared to cloud servers. A model that runs comfortably on an NVIDIA A100 in the cloud may not run at all on a Jetson Nano, and the modifications required to fit the edge hardware’s constraints affect accuracy, throughput, or both. This is the trade-off envelope the rest of this article tries to make navigable.

How do you fit a production model onto an edge device?

As an illustrative example from our edge-deployment engagements (an observed pattern, not a benchmarked industry rate): a ResNet-152 trained for image classification has approximately 60 million parameters, requiring roughly 240 MB of memory and significant compute for each inference pass. An edge device with 2–4 GB of shared RAM and a low-power GPU or NPU cannot run this model at production frame rates. The model has to be made smaller, faster, or both — without degrading accuracy below the application’s acceptance threshold.

Quantisation reduces model precision from 32-bit floating point to 16-bit or 8-bit integer representation. INT8 quantisation typically reduces model size by 4× and improves inference speed by 2–4×, with accuracy degradation of 0.5–2 percentage points for well-quantised models (observed pattern across our deployments, not a published benchmark). Post-training quantisation — applying quantisation to an already-trained model — is the simplest approach; quantisation-aware training, where the model trains with quantisation constraints in the loop, preserves accuracy better but requires access to the training pipeline. TensorRT, OpenVINO, and TFLite all support quantisation workflows for their respective hardware targets.

Model architecture selection. Not all architectures quantise equally well, and not all are designed for edge deployment. MobileNet, EfficientNet-Lite, and YOLO-NAS are architectures explicitly designed for resource-constrained inference — in our experience across edge-CV engagements, they achieve competitive accuracy with 5–20× fewer parameters than their full-scale equivalents (an observed range, not a benchmarked industry rate). Choosing an edge-optimised architecture from the start avoids the lossy compression of shrinking a large model to fit a small device.

Knowledge distillation trains a small “student” model to reproduce the outputs of a large “teacher” model. The student inherits the teacher’s learned representations at a fraction of the parameter count and computational cost. This approach is particularly effective when the large model achieves the accuracy the edge application needs, but the architecture is too large for edge deployment — distillation transfers the accuracy into a deployable form factor.

We regularly apply these techniques in combination: selecting an edge-optimised architecture, training with quantisation awareness, and distilling from a larger model when the accuracy-size trade-off requires it. The specific combination depends on the edge hardware target and the application’s latency and accuracy requirements. There is no universal recipe — there is a recipe per (model, hardware, accuracy budget) triple.

Hardware selection: matching compute to workload

Edge AI hardware spans a wide range of capability, power consumption, and cost. The selection criteria are workload-specific, not vendor-specific.

NVIDIA Jetson family (Orin Nano, Orin NX, AGX Orin) provides CUDA-compatible GPU compute at the edge, supporting the full NVIDIA inference stack — TensorRT, DeepStream, cuDNN. The Jetson platform is the most capable edge AI hardware widely available, with the AGX Orin delivering up to 275 TOPS of AI compute per NVIDIA’s published specifications. The trade-off is power consumption (15–60 W depending on the module) and cost (roughly £200–£1,500 per module, observed market range). For applications that require high throughput — multiple camera streams, high-resolution processing, complex models — Jetson is typically the right choice.

Google Coral (USB Accelerator, Dev Board, M.2 module) provides a dedicated Edge TPU that accelerates TFLite models at very low power — on the order of 2–4 W per the Coral datasheet. The performance ceiling is lower than Jetson, and the Edge TPU supports a specific set of operations optimised for MobileNet-class models. The power and cost profile (roughly £50–£150 per unit, observed market range) makes Coral suitable for high-volume deployments where per-unit cost matters more than peak throughput.

Qualcomm and MediaTek AI SoCs integrate neural processing units into mobile and IoT system-on-chip designs. These are the foundation of AI capability in smartphones, smart cameras, and consumer IoT devices. The advantage is integration density and power efficiency; the constraint is software ecosystem maturity and model compatibility. Qualcomm’s SNPE and MediaTek’s NeuroPilot are improving, but the developer experience is still rougher than the NVIDIA stack.

The GPU performance considerations that apply to cloud inference also apply at the edge, with the additional constraint that edge devices do not have the thermal headroom or memory bandwidth of data centre hardware. Memory bandwidth, in particular, is often the binding constraint on edge devices — a model that is compute-bound on cloud hardware can become memory-bandwidth-bound on edge hardware, requiring different GPU inference latency optimisation strategies. Profiling on the actual target hardware, not extrapolating from cloud profiling, is the only reliable way to find out.

Deployment pipeline considerations

Deploying a model to an edge device is not the same as deploying a model to a cloud server. The operational constraints are different, and the deployment pipeline must account for them.

Over-the-air model updates. Edge devices in the field need to receive model updates without physical access. This requires an update mechanism that downloads the new model, validates it (checksum, inference test on reference data), and swaps it atomically — so that a failed update does not leave the device without a functioning model. The update bandwidth is constrained — in our experience across edge engagements (a planning heuristic, not a benchmarked industry rate): a 50 MB quantised model over a cellular connection is feasible; a 500 MB full-precision model is not.

Fallback and degradation handling. What happens when the model fails to load, the inference engine crashes, or the device runs out of memory? Cloud deployments handle this with redundancy — another instance picks up the load. Edge deployments must handle it locally: a fallback model (simpler, smaller, less accurate but always available), a degraded-mode protocol (pass through without inference, alert the monitoring system), or a restart-and-recover process that restores the device to a known-good state. Pick the failure mode you can tolerate before the device is in the field, not after.

Monitoring and telemetry. Edge devices produce monitoring data — inference latency, prediction distributions, error counts, device temperature, memory utilisation — that must be transmitted to a central monitoring system. The telemetry pipeline has to be lightweight (the device’s compute budget is consumed by inference, not monitoring), resilient to connectivity interruptions (buffer and forward when the connection is restored), and structured for anomaly detection so that a device behaving differently from its peers is flagged automatically.

Edge CV deployment: pilot vs scale-out checklist

Dimension	Pilot (1–5 devices)	Scale-out (tens to hundreds of devices)
Hardware selection	Single platform — e.g. Jetson Orin NX for flexibility during model iteration	Mixed fleet matched to workload: Jetson for multi-stream sites, Coral or Qualcomm SoCs where per-unit cost and power (2–4 W) dominate
Model optimisation level	Post-training INT8 quantisation via TensorRT or TFLite; edge-optimised architecture (MobileNet, EfficientNet-Lite, YOLO-NAS)	Quantisation-aware training + knowledge distillation; per-hardware model variants compiled against each target’s inference engine
Monitoring infrastructure	Basic telemetry — inference latency, error counts, device temperature — forwarded to a central dashboard on reconnect	Full anomaly-detection pipeline: buffered telemetry with store-and-forward, peer-comparison alerts, prediction-distribution drift detection
Update mechanism	Manual model push or scripted SCP/SSH; validate with checksum and reference-data inference test	OTA update service with atomic model swap, automatic rollback on validation failure, bandwidth-aware scheduling (≤ 50 MB quantised payloads over cellular)
Redundancy	Restart-and-recover process that restores a known-good model state; no hardware redundancy	Fallback model per device (simpler, always loadable), degraded-mode protocol (pass-through + alert), spare-device pool for field swap
Validation scope	Accuracy check on a reference dataset before deployment; manual review of edge-case predictions	Full accuracy regression against production distribution, automated A/B comparison between old and new model, per-site acceptance gates before fleet-wide rollout

When edge deployment is the right architecture

In our experience, edge deployment is justified when one or more of these conditions hold: the latency requirement is below what cloud inference can reliably deliver, the bandwidth cost of transmitting images to the cloud exceeds the cost of edge compute, the system must operate during network outages, or data privacy requirements prohibit transmitting images off-premises.

When none of these conditions hold, cloud inference is usually simpler, more flexible, and easier to maintain. The edge-vs-cloud question is an architecture decision, not a technology preference — and getting it wrong in either direction has cost and capability consequences. Hybrid topologies (edge inference for the latency-critical path, cloud for heavier analytics and model retraining) are increasingly the default for systems that need both responsiveness and centralised learning.

FAQ

How do I deploy computer vision models on edge devices reliably?

Reliable edge CV deployment depends on three things working together: a model sized and quantised for the chosen hardware, a deployment pipeline that can update and validate models without physical device access, and a failure-handling strategy that keeps the device functional when the model or inference engine misbehaves. Pick the hardware first, optimise the model against it, and design OTA updates plus a fallback model into the system from the start — retrofitting them later is significantly harder.

What is the latency / accuracy / power trade-off for edge CV, and how do I navigate it?

The three axes are linked: lower latency usually means a smaller or more quantised model, which costs accuracy; higher accuracy means a larger model, which costs latency and power. Navigation starts by fixing the constraint the application cannot move — typically latency for real-time control loops, accuracy for inspection — and then trading off the other two. Profile on the actual target hardware, because memory bandwidth on edge devices often binds before compute does, and that reshapes the envelope.

Jetson Nano vs Intel Neural Compute Stick vs Coral — which edge target fits my constraints?

Jetson (Nano, Orin Nano, Orin NX, AGX Orin) is the right choice when you need GPU-class throughput, multi-stream processing, or compatibility with the full NVIDIA inference stack (TensorRT, DeepStream). Google Coral fits high-volume, low-power deployments where MobileNet-class TFLite models are sufficient and per-unit cost matters. Intel Neural Compute Stick is a USB-attached accelerator useful for prototyping on x86 hosts but is not typically the right scale-out target. The decision usually collapses to: throughput-per-watt, software ecosystem fit, and per-unit cost at fleet scale.

What does edge inference cost compared to cloud inference for a video-analytics workload?

Edge inference shifts cost from recurring cloud compute and bandwidth to one-time hardware and integration. For a multi-camera video-analytics workload, the cloud bill grows roughly linearly with stream count and resolution — egress bandwidth often dominates — whereas the edge cost is bounded by the device fleet. The break-even point depends on stream count, frame rate, and how much of the data the cloud would have to receive; in our engagements (observed pattern, not a benchmarked industry rate) workloads that stream raw video continuously almost always favour edge for the steady-state cost, while bursty or low-volume workloads often favour cloud.

How do I size models so they hit latency targets on the chosen edge hardware?

Start from the latency budget and work backwards: subtract the camera capture time, the pre- and post-processing overhead, and the result-transmission overhead from the end-to-end target — the residual is the inference budget. Profile candidate models on the actual target hardware (not on a workstation GPU) using the target’s inference runtime (TensorRT for Jetson, the Edge TPU compiler for Coral, OpenVINO for Intel). If the smallest acceptable-accuracy model misses the budget, the options are quantisation, distillation, architecture change, or stronger hardware — in roughly that order of effort.

Which architectural patterns (on-device-only, hybrid, cloud-fallback) survive real-world deployment?

On-device-only survives when the device must operate disconnected and the model is small enough to fit comfortably. Hybrid (edge inference on the hot path, cloud for retraining, drift detection, and heavier secondary analytics) survives in most production systems we see — it gives the latency benefits of edge with the operational benefits of centralised model management. Cloud-fallback (edge first, fail over to cloud) sounds attractive but tends to be fragile in practice, because the fallback path is exercised rarely and silently rots; if you adopt it, instrument it heavily and test the fallback regularly.

A Production CV Readiness Assessment includes edge-specific hardware, model optimisation, and deployment architecture analysis for teams making this decision.