How do face detection and recognition differ in practice? Face detection and face recognition are distinct pipeline stages with different requirements and failure modes. Detection answers “is there a face in this image, and where?” Recognition answers “whose face is this?” Many deployments conflate the two, which leads to unrealistic accuracy expectations and poor camera specifications. Detection is a prerequisite for recognition — you cannot recognise a face the detector has not found. But detection alone carries significant independent applications: people counting by face, crowd density estimation, access-point presence verification. Specifying a camera system for detection differs from specifying one for full identity matching. For the four-stage pipeline that sits on top of the detector — alignment, embedding, and gallery match — see our explainer on how the facial recognition pipeline actually works. Resolution and geometric prerequisites for detection Face detection models require a minimum face size in pixels to fire reliably. Below this threshold, false negative rates increase sharply — faces are missed, not misclassified. This is not a soft degradation; it is a hard cliff that camera specification has to respect. Minimum face height Detection behaviour Notes <20 px Unreliable; high miss rate Not suitable for detection 20–40 px Moderate detection rate (~70–80%) High FP rate; model operates at limit 40–80 px Good detection (85–93%) Practical minimum for most applications 80–150 px High detection (93–98%) Reliable across pose and partial occlusion >150 px Near-ceiling performance Exceeds what most detectors need Camera specification for detection follows from one calculation: at the intended operating distance, verify that the smallest face you need to detect fills at least 40–80 pixels of height in the frame. That number, paired with the detection zone geometry, drives lens selection and sensor resolution. Skip it and you are buying a camera system that cannot meet its stated job. Angle requirements. Face detectors are trained predominantly on near-frontal images. At yaw angles beyond ±45°, detection rates drop significantly — this is an observed pattern across the open-source detectors discussed below, not a property of any one model. Cameras must be positioned so subjects present their face within this range when entering the detection zone. Ceiling-mounted cameras over wide thoroughfares routinely violate this and then get blamed for “low accuracy”. Lighting minimums. Face detectors require adequate image contrast. In low light, the face must still have sufficient detail — this means either adequate ambient light, IR illumination, or a low-light-capable sensor. In our experience, detection rates drop noticeably below approximately 10 lux ambient illumination without IR supplementation. That figure is an observed pattern, not a benchmarked rate; the exact cliff depends on sensor, lens aperture, and exposure settings, but the order of magnitude has been stable across the deployments we have specified. MTCNN vs RetinaFace vs MediaPipe: which detector to choose? Three widely deployed open-source face detectors, with different performance profiles: MTCNN (Multi-task Cascaded CNNs). A three-stage cascaded detector that progressively refines bounding boxes. One of the most widely used face detectors in production deployments due to its accuracy and well-maintained PyTorch and TensorFlow implementations. Strengths: good accuracy across face sizes, outputs 5-point landmarks for downstream alignment. Weaknesses: slower than single-stage detectors; the cascaded architecture is less GPU-parallelisable. Typical inference time: 20–50 ms per image on CPU; 5–15 ms on GPU. RetinaFace. Single-stage detector trained on a large-scale face dataset. Currently one of the most accurate open-source detectors. Strengths: high accuracy, handles small faces well, outputs detailed facial landmarks, supports multiple backbone sizes. Weaknesses: heavier than MTCNN for equivalent backbone size; less widely integrated in off-the-shelf pipelines. Typical inference time: 10–30 ms per image depending on backbone (GPU, ONNX or TensorRT runtime). MediaPipe Face Detection. Google’s BlazeFace model, optimised for mobile and real-time inference. Strengths: very fast (sub-5 ms on mobile GPU); designed for on-device deployment. Weaknesses: lower accuracy on small, occluded, or extreme-pose faces; limited to roughly frontal face detection. Typical inference time: 1–5 ms on mobile GPU; 3–10 ms on CPU. Detector Accuracy (WIDER FACE Hard) Speed (GPU) Landmark output Best for MTCNN ~85% ~10 ms 5 points General production; balanced RetinaFace R50 ~91% ~20 ms 5 points High-accuracy applications BlazeFace / MediaPipe ~78% ~3 ms 6 points Mobile, edge, real-time WIDER FACE Hard numbers are benchmark figures published by the respective authors. They are useful as relative ordering, not as a prediction of how the detector will perform on your scene. A retail floor with overhead lighting and a 30-metre operating distance is not WIDER FACE Hard. Confidence threshold calibration The confidence threshold determines where the trade-off between detection rate and false positive rate is set. The default threshold in most implementations is not calibrated for production — it is set conservatively to show high recall in demos. In production: Set the threshold on a validation set drawn from your deployment environment, not benchmark datasets. Measure precision and recall at multiple thresholds; plot the precision–recall curve. Select the operating threshold based on your application’s tolerance for false positives versus false negatives. Verify the threshold holds under different lighting and time-of-day conditions, and re-verify after any sensor or lens change. Face detection deployment checklist Minimum face size at operating distance calculated and verified against camera specification. Camera angle verified against detector yaw tolerance (±45°). Lighting assessed at night and during low-light periods; IR illumination specified if needed. Detector selected based on latency budget and accuracy requirements for the specific scene. Confidence threshold calibrated on in-domain validation data. False positive rate measured on frames without faces (background scenes, non-human objects). Detection rate validated on held-out evaluation set with representative pose and lighting variation. Real-world false positive rates In production deployments, face detectors generate false positives from a handful of recurring sources: faces on screens, posters, and printed materials; face-shaped objects such as certain toys, mannequins, and some signage; partial occlusions that expose face-like regions; and high-noise low-light conditions where the detector latches onto texture rather than structure. Across our deployments, the operating false positive rates we have observed cluster by scene class: Controlled indoor environments (lobby, access point): 2–5% FPR at 90%+ detection rate. Retail environments with product displays and signage: 8–15% FPR, with posters and product imagery driving most of the excess. Outdoor environments with billboards and vehicle advertising: 10–20% FPR. These ranges are observed patterns across our engagements, not benchmarked rates. They are useful for planning — for setting expectations with stakeholders before the camera goes up — but the specific number for any given deployment has to come from in-scene measurement. For applications where false positives carry a cost — triggering downstream recognition, generating alerts, logging biometric events under GDPR or BIPA — post-detection filtering brings operational false positive rates to acceptable levels. Liveness checks, minimum-size filters, and quality-score thresholds (sharpness, exposure, yaw estimate) each remove a different class of false positive. Stacking two or three of them is usually enough; relying on the detector’s confidence score alone is not. FAQ Where this leaves camera specification The camera comes before the model. If the minimum face height at operating distance is wrong, no detector swap recovers it; if the yaw envelope is wrong, ±45° is a wall, not a guideline. Get those two right, calibrate the confidence threshold on in-scene data, and the detector choice between MTCNN, RetinaFace, and MediaPipe becomes the ordinary latency-versus-accuracy trade-off it should be. A failure pattern we see often enough to name: detection-stage false positives from in-scene signage routed straight into a biometric-logging pipeline, with no liveness or quality gate in between. The fix is a quality-score filter at the detector output, not a better detector.