We find that edge AI means running inference on hardware that is physically located at the data source — in a vehicle, on a factory floor, in a camera housing. The constraints are fundamentally different from cloud inference: power envelopes measured in watts rather than kilowatts, no guaranteed network connectivity, harsh operating environments, and often safety-critical reliability requirements. The model and deployment choices that work in a cloud data center frequently fail at the edge. Autonomous Vehicles? Autonomous driving requires multiple concurrent perception tasks: object detection, lane segmentation, depth estimation, and sensor fusion across cameras, lidar, and radar. The compute budget is fixed by the vehicle’s thermal design and power supply — typically 50–200W for the compute platform. Latency requirements are strict: detection-to-actuation latency above ~50ms degrades safety margins. Key characteristics: Batch size is typically 1 (one vehicle, processing sensor streams in real time) Multiple models run concurrently on the same hardware Functional safety requirements (ISO 26262) constrain acceptable failure modes NVIDIA Orin (up to 275 TOPS) and NVIDIA Drive Thor (1,000+ TOPS) are purpose-built platforms Industrial Inspection Visual inspection on manufacturing lines — detecting surface defects, dimensional tolerances, assembly errors — has strict latency requirements set by line speed. A product moving at 1 meter/second past a camera must be inspected within 20–50ms. False negative rates (missed defects) have direct quality cost consequences; false positive rates cause line stoppages. Key characteristics: Well-defined, constrained problem space (known product types, controlled lighting) Models can be smaller and more specialized than general-purpose detectors Hardware can be application-specific: Jetson Orin NX, Hailo-8, or Intel Movidius Inference hardware may need to run unattended for years without maintenance Smart Cameras Smart cameras embed inference capability directly in the camera housing. They range from entry-level (ARM Cortex-A + NPU, ~1–4 TOPS) to capable edge nodes (Jetson Orin Nano, ~40 TOPS). Applications include people counting, crowd density analysis, license plate recognition, and real-time alerting without round-tripping to a cloud server. Practical comparison There is no free lunch in edge deployment. Every optimization that reduces model size or inference time has an accuracy cost that must be evaluated: Optimization Typical Size Reduction Typical Accuracy Impact Applies at Training? INT8 quantization 2–4x smaller <1% on well-calibrated models Post-training or QAT INT4 quantization 4–8x smaller 1–3% typical, varies widely Primarily post-training Knowledge distillation 2–10x smaller 2–5% typical Training-time Structured pruning 2–4x smaller 1–3% with careful tuning Training-time Architecture selection (MobileNet, EfficientDet) Baseline smaller Application-specific Design choice The right answer depends on what “accuracy” means for the specific application. A 2% mAP reduction in object detection may be tolerable for smart city applications but unacceptable for autonomous vehicle perception. Connectivity Assumptions Edge AI deployments should be designed with explicit connectivity assumptions, not optimistic ones: Fully offline: Model runs entirely on-device. No network dependency for inference. Model updates require physical access or a local update mechanism. Applies to autonomous vehicles in tunnels, remote industrial sites, and air-gapped facilities. Occasionally connected: Inference runs offline; telemetry and model updates happen when connectivity is available. Applies to most industrial IoT. Latency-tolerant cloud assist: Primary inference on-device; complex cases escalate to cloud for secondary analysis. Adds complexity but handles out-of-distribution inputs. The worst deployment architecture is one designed assuming connectivity that isn’t reliably available. On-Device vs Edge-Server Inference Not all edge AI runs on the sensing device itself. An edge server — a small compute node colocated with sensors in a factory or road-side unit — can aggregate feeds from multiple cameras and run more capable models centrally, while still operating offline from the cloud. Approach Latency Cost per Node Model Capability On-device (camera-embedded) Lowest High per unit at scale Limited (MobileNet-class) Edge server (local aggregation) Low Lower at scale Moderate (EfficientDet-class) Cloud inference High (network-dependent) Low CapEx Unrestricted For applications with many sensing points and moderate latency tolerance, edge-server architectures often outperform on-device on cost and capability. For applications with strict real-time requirements or unreliable connectivity, on-device is mandatory. Quantization and Distillation Choices at the Edge The parent hub article Distillation vs Quantisation for Multi-Platform Edge Inference covers the full comparison. The practical summary for edge AI: Use post-training quantization (PTQ) as the first step — it requires no retraining and often achieves INT8 with minimal accuracy loss for well-trained models Apply quantization-aware training (QAT) if PTQ accuracy is insufficient for the application requirement Use knowledge distillation when the task has available training data and the accuracy gap from quantization alone is too large Combine both: distill to a smaller architecture, then quantize the distilled model — this compound approach commonly achieves the smallest deployable model for a given accuracy target Deployment Checklist for Edge AI Define latency budget: what is the maximum acceptable inference time? Define power budget: what TDP is the hardware allowed? Define connectivity model: offline, occasionally connected, or real-time cloud assist? Baseline accuracy requirement: what accuracy metric, at what threshold, constitutes acceptable? Evaluate PTQ on the selected model before investing in QAT or distillation Profile inference on target hardware, not development hardware Test under operating condition variation (temperature, lighting, sensor noise) Define model update strategy before deployment, not after Concluding thoughts Edge AI deployments succeed when constraints are treated as design inputs from the start, not as optimization problems solved after the fact. Model size, latency, power, and connectivity requirements must be specified before model selection or training. Quantization and distillation are tools that trade accuracy for resource efficiency — the right tradeoff is application-specific. The most common failure we encounter mode in edge AI projects is deploying a cloud-trained model directly to edge hardware and discovering it’s too large, too slow, or too power-hungry to meet requirements.