Edge AI means running inference on hardware that is physically located at the data source — in a vehicle, on a factory floor, in a camera housing. The constraints are fundamentally different from cloud inference: power envelopes measured in watts rather than kilowatts, no guaranteed network connectivity, harsh operating environments, and often safety-critical reliability requirements. The model and deployment choices that work in a cloud data centre frequently fail at the edge. This article walks through how those constraints shape three concrete deployment classes — autonomous vehicles, industrial inspection, smart cameras — and what the compression and connectivity decisions actually look like once you stop treating the edge as “cloud, but smaller.” For the underlying compression question — distillation versus quantisation across multiple runtimes — we cover the full comparison in Distillation vs Quantisation for Multi-Platform Edge Inference. This piece is the application-side companion. What changes when inference moves to the edge? Three numbers move at once: the power budget collapses, the latency budget tightens, and the connectivity assumption stops being free. Each one independently invalidates a default that holds in the data centre. Power matters because thermal envelope is fixed by the enclosure. A Jetson Orin module dissipates 15–60W; a vehicle compute platform runs 50–200W under sustained load. There is no headroom for a model that “almost fits” — the cooling cannot keep up, and throttling shows up as missed frames rather than slower frames. Latency matters because the action the inference drives is physical. A defect missed at line speed is shipped; a pedestrian detected 30 ms late is a different incident. Cloud inference can amortise tail latency across many users; edge inference cannot. Connectivity matters because the deployment outlasts any single network assumption. A camera installed today will see four firmware-update cycles, two network migrations, and one outage long enough to matter. The architecture has to survive all of them. Autonomous vehicles Autonomous driving requires multiple concurrent perception tasks: object detection, lane segmentation, depth estimation, and sensor fusion across cameras, lidar, and radar. The compute budget is fixed by the vehicle’s thermal design and power supply — typically 50–200W for the compute platform. Latency requirements are strict: detection-to-actuation latency above roughly 50 ms degrades safety margins in a way that downstream planners cannot recover. Key characteristics: Batch size is typically 1 (one vehicle, processing sensor streams in real time), so kernels that amortise launch overhead across a batch do not help. Multiple models run concurrently on the same hardware, contending for memory bandwidth and on-chip caches. Functional safety requirements (ISO 26262) constrain acceptable failure modes — non-deterministic kernels and unverified runtimes are off the table. NVIDIA Orin (up to 275 TOPS) and NVIDIA Drive Thor (1,000+ TOPS) are purpose-built platforms; published TOPS figures are an upper bound, not a working assumption. Industrial inspection Visual inspection on manufacturing lines — detecting surface defects, dimensional tolerances, assembly errors — has latency requirements set by line speed. A product moving at one metre per second past a camera must be inspected within 20–50 ms. False negatives (missed defects) have direct quality cost; false positives cause line stoppages and operator fatigue. Both error modes are expensive, and the acceptable ratio between them is a business decision, not a model-training decision. Key characteristics: Well-defined, constrained problem space (known product types, controlled lighting), which means models can be smaller and more specialised than general-purpose detectors. Hardware can be application-specific: Jetson Orin NX, Hailo-8, or Intel Movidius — each with its own quantisation toolchain and runtime quirks. Inference hardware may need to run unattended for years without maintenance, so model-update mechanics need to be designed up front, not bolted on. Smart cameras Smart cameras embed inference capability directly in the camera housing. They range from entry-level designs (ARM Cortex-A plus a small NPU, roughly 1–4 TOPS) to capable edge nodes (Jetson Orin Nano, around 40 TOPS). Applications include people counting, crowd density analysis, licence-plate recognition, and real-time alerting without round-tripping to a cloud server. The defining constraint here is not raw compute — it is the combination of unit cost, power-over-Ethernet budget, and the fact that the housing has to dissipate everything passively. How do compression methods compare for edge deployment? There is no free lunch in edge deployment. Every optimisation that reduces model size or inference time has an accuracy cost that must be evaluated against the specific task. The table below is the observed pattern across our engagements — figures are planning heuristics, not benchmarked rates for any specific model, and the variance per architecture is large enough that PTQ on your own model is the only honest baseline. Optimisation Typical size reduction Typical accuracy impact Applies at training? INT8 quantisation 2–4× smaller <1% on well-calibrated models (observed pattern) Post-training or QAT INT4 quantisation 4–8× smaller 1–3% typical, varies widely (observed pattern) Primarily post-training Knowledge distillation 2–10× smaller 2–5% typical (observed pattern) Training-time Structured pruning 2–4× smaller 1–3% with careful tuning (observed pattern) Training-time Architecture selection (MobileNet, EfficientDet) Baseline smaller Application-specific Design choice The right answer depends on what “accuracy” means for the specific application. A 2% mAP reduction in object detection may be tolerable for retail analytics but unacceptable for autonomous-vehicle perception. The same compression operation can be a win on one task and a regression on another, which is why we treat the deployment matrix — model × runtime × target hardware — as the unit of evaluation rather than the model alone. One structural point worth naming explicitly: quantisation behaviour is runtime-specific. INT8 on CoreML and INT8 on ONNX Runtime are not the same operation, and a model that ships clean on one can show measurable accuracy drift on the other. Distillation does not have this property — a distilled smaller architecture behaves consistently across runtimes. If the deployment target is more than two runtimes, the validation cost of quantisation per platform often dominates the compute savings it promises. The full reasoning sits in Distillation vs Quantisation for Multi-Platform Edge Inference. Connectivity assumptions Edge AI deployments should be designed with explicit connectivity assumptions, not optimistic ones. Three patterns recur: Fully offline. Model runs entirely on-device. No network dependency for inference. Model updates require physical access or a local update mechanism. Applies to autonomous vehicles in tunnels, remote industrial sites, and air-gapped facilities. Occasionally connected. Inference runs offline; telemetry and model updates happen when connectivity is available. Applies to most industrial IoT, where the data plane and the control plane have different reliability requirements. Latency-tolerant cloud assist. Primary inference on-device; complex or low-confidence cases escalate to cloud for secondary analysis. Adds complexity but handles out-of-distribution inputs in a way pure on-device cannot. The worst deployment architecture is one designed assuming connectivity that isn’t reliably available. The failure mode is not a graceful degradation — it is an outage that propagates into the physical process the inference is driving. On-device vs edge-server inference Not all edge AI runs on the sensing device itself. An edge server — a small compute node colocated with sensors in a factory or road-side unit — can aggregate feeds from multiple cameras and run more capable models centrally, while still operating offline from the cloud. This is often the right architectural answer when sensor density is high and the per-sensor inference budget is small. Approach Latency Cost per node Model capability On-device (camera-embedded) Lowest High per unit at scale Limited (MobileNet-class) Edge server (local aggregation) Low Lower at scale Moderate (EfficientDet-class) Cloud inference High (network-dependent) Low CapEx Unrestricted For applications with many sensing points and moderate latency tolerance, edge-server architectures often outperform on-device on both cost and model capability. For applications with strict real-time requirements or unreliable connectivity, on-device is mandatory. The decision is rarely uniform across a site: many production deployments mix both, with the most latency-sensitive inference on-device and the rest pushed to a local aggregator. How do quantisation and distillation actually combine for edge? The practical summary, working in order of cost-to-try: Use post-training quantisation (PTQ) as the first step. It requires no retraining and often achieves INT8 with minimal accuracy loss for well-trained models. If PTQ on your specific model meets the accuracy budget, you can stop here. Apply quantisation-aware training (QAT) if PTQ accuracy is insufficient. QAT recovers most of the gap but adds a training-time cost and a calibration-data requirement. Use knowledge distillation when the task has available training data and the accuracy gap from quantisation alone is too large, or when the deployment target spans multiple runtimes where per-runtime quantisation validation is uneconomic. Combine both. Distil to a smaller architecture, then quantise the distilled model. This compound approach commonly achieves the smallest deployable model for a given accuracy target. It also confines the runtime-specific validation burden to the final quantisation step, on a model that is already smaller and more predictable. Deployment checklist for edge AI Define the latency budget: what is the maximum acceptable inference time, measured end-to-end, not kernel-only? Define the power budget: what sustained TDP is the hardware allowed under the worst expected ambient temperature? Define the connectivity model: offline, occasionally connected, or real-time cloud assist? Baseline the accuracy requirement: what accuracy metric, at what threshold, constitutes acceptable for the actual task — not a generic benchmark? Evaluate PTQ on the selected model before investing in QAT or distillation. Profile inference on target hardware, not development hardware. Development-host numbers consistently overstate edge performance. Test under operating-condition variation (temperature, lighting, sensor noise) before signing off. Define the model-update strategy before deployment, not after. FAQ Closing thoughts Edge AI deployments succeed when constraints are treated as design inputs from the start, not as optimisation problems solved after the fact. Model size, latency, power, and connectivity requirements must be specified before model selection or training. Quantisation and distillation are tools that trade accuracy for resource efficiency — the right tradeoff is application-specific, and the multi-platform dimension changes which tool is cheaper to live with over the deployment lifetime. The most common failure mode we encounter in edge AI projects is deploying a cloud-trained model directly to edge hardware and discovering it is too large, too slow, or too power-hungry to meet the requirement that mattered most.