Augmented reality is often described as a graphics problem. In production, it is mostly a computer vision problem with a graphics layer bolted on top. The hard parts — knowing where the camera is, knowing what is in front of it, anchoring digital content to a surface that does not slide around — are all CV problems, and they sit underneath every consumer AR app, every industrial headset, and every retail try-on experience that actually ships. We work on these systems often enough to have a fairly fixed view of where the budget goes. The visible part (3D rendering, shaders, occlusion) gets the demo applause. The invisible part (tracking, plane detection, latency control) decides whether the experience holds together after the first 30 seconds. What computer vision actually does inside an AR pipeline A typical AR runtime breaks into four CV layers, each with a different tolerance for error and a different latency contract. Camera pose tracking — usually some flavour of visual-inertial SLAM (simultaneous localisation and mapping). The system fuses camera frames with IMU readings to estimate where the device is in space, 60 to 120 times per second. ARKit and ARCore wrap this as a platform primitive; on headsets, Meta and Apple ship their own stacks. If pose drifts, virtual objects appear to swim. Drift under 1 cm per minute of use is the rough bar for a stable consumer experience. Scene geometry — plane detection, depth estimation, and (increasingly) mesh reconstruction. Phone-class AR detects floors and tables; headset-class AR reconstructs a full room mesh. This layer is what lets a virtual chair sit on a real floor instead of floating six inches above it. Semantic understanding — object detection, segmentation, OCR, hand and face tracking. This layer answers “what is in the scene?” rather than “where is the scene?”. It runs slower than tracking (often 10–30 Hz instead of 60–120 Hz) because applications can tolerate more latency here. Content anchoring — the bridge layer that takes geometry plus semantics and tells the renderer where each piece of digital content should sit, frame by frame. The four layers have different failure modes. Pose tracking failures break the whole experience; semantic failures degrade it gracefully. We design pipelines accordingly: redundant sensors and aggressive optimisation around tracking, lighter-touch engineering around semantics. The models people actually deploy in 2026 The model zoo has narrowed a lot in the last 18 months. Most teams converge on a similar shortlist: Pipeline stage Production-grade options (2026) Visual-inertial SLAM ORB-SLAM3, DROID-SLAM, platform SLAM in ARKit/ARCore/VisionOS Depth estimation DepthAnything-class monocular models, stereo on headsets Real-time detection YOLO11 / RT-DETR, MediaPipe lightweight detectors Segmentation MobileSAM, EfficientSAM, FastSAM Hand / face / body tracking MediaPipe Holistic, MMPose, platform body-tracking APIs 3D content / relighting Gaussian Splatting, NeRF derivatives (now production-viable) OCR for text-in-scene PaddleOCR, MMOCR, platform Live Text APIs The interesting shift since 2024 is Gaussian Splatting moving from research to production. We are seeing it used for environment capture and relighting in commercial AR work — not because it produces better geometry than mesh-based reconstruction, but because it renders faster on the same hardware budget. That kind of latency-driven model choice is the norm rather than the exception in AR engineering. The latency budget is the real constraint The number that decides everything in AR is the motion-to-photon budget. From the moment the user moves their head until the rendered frame reflects that motion, the system has roughly 20 milliseconds. Past 20 ms, the discrepancy between proprioception and vision starts to register; past 50 ms, users feel motion sick. This is an observed pattern across consumer AR hardware — Meta, Apple, and Microsoft all target this same window because the human perceptual constraint does not change with the hardware. That budget has to cover sensor capture, CV inference, application logic, rendering, and display scan-out. The CV stages typically get 4–8 ms of it. That is why tracking has to run on-device, why depth models are quantised aggressively, and why “just run it on the cloud” is not a serious answer for anything in the tracking path. What can live in the cloud, and what cannot The hybrid pattern that has settled out in 2026 looks like this: On-device, always: pose tracking, plane detection, hand and head tracking. Anything in the motion-to-photon loop. On-device or edge GPU: real-time object detection, segmentation, OCR. Tolerant to ~50–100 ms latency. Cloud or edge GPU: scene understanding, generative content creation, large-vocabulary recognition. Tolerant to seconds. The hybrid split moves with the hardware. Snapdragon XR2 Gen 2 (Meta Quest 3/3S) and Apple’s M2+R1 combination (Vision Pro) handle more on-device than the previous generation could, which has pulled object detection and segmentation back onto the device for most workloads. The cloud lane mostly handles content generation and heavyweight scene understanding now. Where AR engineering goes wrong in practice Most AR projects that miss their targets miss them for one of three reasons. Treating tracking as solved. ARKit and ARCore are excellent on flagship devices in well-lit indoor environments with texture-rich surfaces. They degrade hard in low light, on featureless surfaces (white walls), or with rapid motion. Teams that demo on an iPhone Pro in an office often discover at deployment that their actual users are in a warehouse on a four-year-old Android. The CV pipeline needs to be validated on representative hardware in representative environments, not on the developer’s desk. Underestimating the semantic layer’s cost. Plane detection is cheap. Recognising the specific product on a shelf is expensive, and doing it at 30 Hz on a phone budget is harder than it looks. We see teams design a feature around “the system will recognise X” without scoping what X looks like in the data they actually have. Mistaking demo accuracy for production accuracy. A face filter that works on 50 testers in good lighting will not necessarily work on 50,000 users across skin tones, lighting, occlusion, and motion. Bias and edge-case performance show up only at scale, and they show up after launch unless the evaluation set was built to surface them. For a deeper architectural walkthrough of the related face-recognition pipeline — which sits inside many AR experiences as the body-anchored tracking layer — see Facial Recognition in Computer Vision: How the Pipeline Actually Works. For broader programme context, our Computer Vision R&D practice covers the full deployment lifecycle. What “good” AR engineering looks like A useful diagnostic: when an AR team scopes a new build, can they tell you which CV stage will fail first under stress, and what the fallback is? Good teams can. They know that under motion blur, tracking degrades before detection; under low light, depth degrades before tracking; under network loss, the cloud-perception lane fails and the on-device lane has to carry the experience alone. The same teams measure motion-to-photon latency on real devices, not on internal estimates. They have a validated test set that reflects deployment conditions, not curated demo footage. They treat the renderer and the CV stack as one performance budget, not two. That discipline is what separates AR work that survives launch from AR work that gets quietly pulled six months later. The graphics layer gets the applause; the CV layer gets the post-launch incident reports. Frequently asked questions How does the facial recognition pipeline decompose — detection, alignment, embedding, matching? Facial recognition inside an AR pipeline runs as four sequential stages. Detection finds the face region in the frame (MediaPipe or a YOLO-class detector). Alignment normalises the face crop using landmark points so the embedding model sees a consistent geometry. Embedding runs a deep network (ArcFace, MobileFaceNet) to produce a fixed-length vector. Matching compares that vector against a gallery using cosine similarity at a chosen threshold. Each stage has its own failure mode, which is the main reason “facial recognition accuracy” is not a single number. Why is MTCNN typically preferred over Haar cascades in modern face detection, and where does that trade-off flip? MTCNN (and its modern replacements like MediaPipe Face Detector and YOLO-Face variants) handles pose variation, occlusion, and varied lighting far better than Haar cascades, because it is a learned detector rather than a hand-engineered feature cascade. The trade-off flips on extremely constrained hardware — old microcontrollers, very low-power edge devices — where Haar’s tiny memory footprint and CPU-only execution still matter. In any AR scenario with a smartphone-class or better SoC, the learned detector wins. Where does facial recognition sit in the broader CV pipeline (image recognition, pattern recognition, deep learning)? Facial recognition is a specialised application of the general image-recognition stack. Image recognition classifies whole images; pattern recognition is the wider statistical lineage (going back to eigenfaces); deep learning is the modern implementation substrate. Facial recognition inherits all three: it uses deep learning models (CNNs and transformers) to perform pattern recognition on face regions identified by image-recognition techniques. In AR specifically, it sits in the semantic-understanding layer alongside hand and body tracking. What are the realistic accuracy and bias limits of production facial recognition in 2026 deployments? State-of-the-art models on NIST FRVT benchmarks report false-match rates below 1 in 100,000 at high-quality enrollment. Production deployments rarely see those conditions. Realistic deployed accuracy depends on gallery quality, lighting, pose, and demographic coverage of the training data. Bias against under-represented demographic groups remains a measured gap in 2026 evaluations, and the EU AI Act now classifies most real-time biometric identification in public spaces as high-risk, which forces explicit documentation of these limits. Which CV algorithms (eigenfaces, deep embeddings, transformers) are still relevant for face recognition, and which are obsolete? Eigenfaces and Fisherfaces are historical reference points, not deployment options. Deep CNN embeddings (FaceNet, ArcFace, CosFace) are the production default. Vision transformers and hybrid CNN-transformer architectures are increasingly competitive and dominate the FRVT leaderboard in 2026. For AR-class workloads where the embedding must run on-device, MobileFaceNet-class CNNs still hold a latency advantage; transformer variants take over once the budget allows. How does facial recognition deployment differ across cloud, on-device, and edge inference settings? Cloud deployment gives you the largest models and the easiest gallery management, but adds 100–500 ms of latency and creates a regulatory surface for biometric data in transit. On-device deployment (phone, headset) is what AR demands for body-anchored experiences — the embedding model has to be quantised and small (under ~10 MB is typical). Edge deployment (camera-attached inference) sits between the two and is the dominant pattern for fixed-installation surveillance, which is a different problem space from AR but uses many of the same models. How TechnoLynx can help We build computer vision systems for AR deployments that have to work in real conditions — not just on a developer’s bench. Our work covers the full stack: tracking and SLAM tuning for specific hardware classes, semantic-layer model selection and quantisation, motion-to-photon latency measurement and budget engineering, and the validation infrastructure that surfaces edge-case failures before launch rather than after. If you are scoping an AR build and the CV stack is the part you are least sure about, talk to us. We will tell you which stage is most likely to break in your specific deployment, and what it will take to make it not break. Image credits: Freepik