Most “cloud computer vision” architectures look clean on a slide and messy on the wire. A camera publishes frames, a hosted endpoint runs the model, results come back. In production the split is rarely that tidy — detection often runs on-device, embedding extraction on a regional GPU, gallery matching in a private cluster, and audit logging somewhere else again. The interesting engineering question is not whether to use the cloud; it is which stage of the pipeline lives where, and why. This piece walks the computer vision workload across cloud, edge, and on-device tiers. We use facial recognition as the worked example because its four-stage pipeline — detection, alignment, embedding, identity match — makes the placement decisions visible. The same logic applies to defect detection, medical imaging, and retail catalog work, just with different stage costs. What does “cloud computer vision” actually mean? The phrase covers three architectures that behave very differently in production. The first is hosted inference: frames travel to a managed endpoint (a CV service on AWS, Azure, or GCP, or a self-hosted Triton instance on a managed Kubernetes cluster), inference runs on an A10 or L4 GPU, and JSON comes back. Round-trip latency typically sits at 80–250 ms depending on region and frame size — an observed pattern across our engagements, not a benchmarked rate, and one that varies with payload size and TLS overhead more than people expect. The second is edge inference with cloud orchestration: the model runs on a Jetson, an Intel NUC with an Arc GPU, or a fanless box near the camera. The cloud handles model distribution, telemetry, and gallery updates but never sees raw pixels. This is the default architecture for any deployment that touches GDPR-regulated biometrics. The third is split inference: detection and a lightweight quality gate run on-device, but heavy embedding extraction and search run cloud-side over a thumbnail crop. This is the architecture that actually wins for facial recognition at scale, because the bandwidth saving (sending a 112×112 aligned face crop instead of a 1080p frame) is large and the privacy story is cleaner. Calling all three “cloud CV” hides the design decisions that matter. Where does each pipeline stage want to live? Facial recognition decomposes into four stages, and each has a natural home tier. The deeper architectural walkthrough lives in Facial Recognition in Computer Vision: How the Pipeline Actually Works; the placement table below is the operational summary. Stage Typical compute cost Best tier Why Face detection (MTCNN, RetinaFace, YOLO-face) Low — runs on CPU or small GPU On-device / edge Runs on every frame; sending every frame to cloud is wasteful and privacy-hostile. Alignment (5-point landmark warp) Negligible On-device / edge Trivial transform; co-locate with detection. Embedding extraction (ArcFace, FaceNet, AdaFace) Medium — 512-d vector per face, GPU-friendly Cloud or beefy edge Heavier model; runs only on detected, quality-gated crops. Cloud GPU amortises well. Gallery matching (cosine search over N identities) Scales with gallery size Cloud (or on-device for small N) A million-identity gallery wants a vector DB (FAISS, ScaNN, Milvus). A 50-identity gallery wants to stay local. This is an observed pattern across vendor pipelines, not a benchmarked rate — but the structural reason holds: detection is per-frame work, embedding is per-face work, matching is per-query work, and each tier is sized for a different rate. Why bandwidth, not GPU cost, drives the split A 1080p H.264 stream at 25 fps is roughly 3–5 Mbps. Sending that to a cloud endpoint for every camera in a 200-camera deployment is 600–1000 Mbps sustained — before any model has run. Cloud egress and ingress charges aside, this is a network-engineering problem, not a model-engineering one. The on-device detection gate changes the economics. Instead of streaming 25 fps of video, you stream the 1–3 face crops per second that actually have a person in them. That is two to three orders of magnitude less data. We see this pattern regularly: teams optimise the model and discover the bottleneck was the upload. For non-biometric workloads — defect detection on a factory line, license-plate recognition, agricultural disease detection — the same logic applies. A trigger model on-device decides which frames are interesting; the cloud sees only those. When does pure cloud inference still make sense? Three conditions: Low frame rates. A retail catalog ingest at one image per second per SKU has no bandwidth problem. Ship the whole image; let cloud do everything. Burst workloads. A film studio re-tagging a 20-year archive overnight wants horizontal scale, not a Jetson cluster. Cloud GPUs paid by the hour win cleanly. No data-residency constraint. Once biometrics, medical imaging, or anything covered by GDPR Article 9 enters the picture, raw-pixel egress to a public cloud region becomes a compliance problem before it is an engineering one. The first two conditions describe most of the “cloud CV” success stories. The third condition explains why most facial recognition deployments are not pure-cloud, no matter what the vendor slide deck shows. What changes under the EU AI Act and BIPA Real-time remote biometric identification is now classified as high-risk (or prohibited, depending on context) under the EU AI Act. Illinois’s BIPA imposes per-identifier statutory damages. Both regimes assume your architecture can answer one question: where did this face embedding live, and for how long? A pure-cloud pipeline that uploads raw frames to a hyperscaler’s hosted face-recognition API typically cannot answer that question well — the embedding lifecycle is inside the vendor’s black box, and the data-processing agreement is the only visibility you get. A split-inference architecture, where embeddings are computed on hardware you control and only matched against a gallery you also control, is much easier to defend in an audit. This is not a legal opinion; it is an architecture observation. Compliance teams ask different questions of split-inference designs than they do of hosted-endpoint designs, and the split-inference questions are usually easier to answer. Latency budgets, honestly A common assumption is that “cloud is slower than edge.” That is only true when the comparison is honest about where the work actually is. A well-provisioned cloud endpoint in the same region as the camera, with TensorRT-compiled models on an L4 GPU and warm Triton workers, typically delivers 30–80 ms inference latency. Add 20–60 ms of network round-trip and you are inside 150 ms end-to-end — well within real-time for most CV applications. This is an observed range across our engagements, not a benchmarked rate. The latency problem is not cloud GPUs being slow. It is: Cross-region calls (a camera in Frankfurt hitting an endpoint in us-east-1 will hurt). Cold-start penalties on serverless GPU offerings (Lambda-style inference can add 2–10 seconds on first call). Payload size (uploading 4K frames over consumer broadband, not the GPU). Edge wins for sub-50 ms requirements (autonomous-vehicle perception, robotics control loops, AR overlays). Cloud wins for everything else where the bandwidth question has been answered properly. How do MLOps and model distribution fit in? The cloud’s most underrated contribution to production CV is not inference — it is model lifecycle management. A 200-camera edge deployment with five model versions across three hardware SKUs is unmanageable without a central registry, automated rollout, and telemetry feedback. The pattern we see working: MLflow or a custom registry holds the model artefacts, a thin agent on each edge box pulls signed model bundles, Kubernetes (or k3s on the edge) handles version rollover, and Prometheus pulls per-device inference metrics back to a central observability stack. The cloud orchestrates; the edge executes. Neither tier touches the other’s job. This is also where ONNX, TensorRT, and OpenVINO matter operationally — not as accelerator marketing, but because the edge tier needs the same model the cloud trained, compiled for whatever silicon happens to be on-site. What buyers should ask vendors If a vendor pitches a “cloud computer vision” platform without distinguishing the three architectures above, that is a signal. The questions that separate serious offerings from repackaged hosted APIs: Where does raw video live, and for how long? (If the answer is “in our cloud,” ask about Article 9 / BIPA.) Does detection run on-device, or are you streaming full frames? Where are face embeddings computed, and where are they stored? What is the gallery refresh policy, and who controls it? What is the operating false-match rate at the chosen threshold, and how was it measured? These are the same questions the facial recognition pipeline explainer develops in more detail. A vendor who answers them precisely is selling engineering. A vendor who answers them in marketing language is selling something else. FAQ How does the facial recognition pipeline decompose — detection, alignment, embedding, matching? Detection finds faces in the frame and outputs bounding boxes. Alignment uses landmark points (typically five) to warp the face into a canonical pose. Embedding extraction runs a deep network (ArcFace, FaceNet, AdaFace) to produce a 512-dimensional vector. Matching compares that vector against a gallery using cosine similarity, returning a candidate identity if the score exceeds the operating threshold. Why is MTCNN typically preferred over Haar cascades in modern face detection, and where does that trade-off flip? MTCNN (and successors like RetinaFace) handle pose variation, occlusion, and lighting changes far better than Haar cascades, which were tuned for frontal, well-lit faces. The trade-off flips only on extremely constrained hardware — older ARM Cortex-M class devices — where a Haar cascade may be the only thing that fits, and the deployment can accept the accuracy cost. Where does facial recognition sit in the broader CV pipeline (image recognition, pattern recognition, deep learning)? Facial recognition is a specialised application of image recognition that uses deep-learning embeddings rather than classical pattern recognition. Detection borrows from general object detection; embedding extraction borrows from metric-learning research; matching is a vector-search problem. The pipeline reuses general CV components but tunes each stage for faces. What are the realistic accuracy and bias limits of production facial recognition in 2026 deployments? NIST FRVT-class benchmarks show top systems achieving sub-1% false non-match rates at fixed false-match thresholds in controlled conditions, with measurable demographic differentials. Production deployments — uncontrolled lighting, off-axis cameras, low-resolution crops — typically perform several times worse. Operating thresholds must be set against the deployment’s own measured error curve, not the vendor’s brochure number. Which CV algorithms (eigenfaces, deep embeddings, transformers) are still relevant for face recognition, and which are obsolete? Eigenfaces and Fisherfaces are historical interest only. Deep embeddings from CNN backbones (ArcFace on ResNet/IResNet) remain the production default. Vision transformers are now competitive on accuracy but still carry a latency cost that matters on edge hardware. The choice is usually between an ArcFace-family CNN for inference cost or a ViT-family model where a slightly better accuracy ceiling matters more than throughput. How does facial recognition deployment differ across cloud, on-device, and edge inference settings? Cloud is best for embedding extraction and large-gallery matching once frames have been gated locally. On-device suits detection, alignment, and small-gallery matching where privacy or latency rules out egress. Edge sits between — typically a Jetson or x86 box co-located with cameras, running the full pipeline locally and syncing telemetry to the cloud. The right split depends on bandwidth, gallery size, and the regulatory regime. For broader programme context across our engagements, explore the Computer Vision R&D practice. Where the cloud-versus-edge split is wrong, the symptoms usually show up as bandwidth bills or compliance findings before they show up as accuracy problems — which is the failure class an architecture audit catches early.