Object Detection Model Selection for Production: YOLO vs Transformers, Speed/Accuracy, and Deployment

Model selection is a deployment decision, not a benchmark decision

Selecting an object detection model for production requires evaluating performance under your deployment constraints — hardware, latency budget, classes of interest, expected input distribution — not picking the model with the highest COCO mAP on the leaderboard. The benchmark score is a useful starting signal, not a decision criterion.

This piece is a focused decision aid: the main architecture families, their real-world tradeoffs, how to evaluate mAP against latency on your specific hardware, and the deployment considerations that often dominate the choice. The broader build-versus-buy question — whether you should be selecting a model at all, or licensing a vendor solution — is the subject of the parent decision framework on when to build a custom computer vision model versus use an off-the-shelf solution.

What are the main object detection architecture families?

Three architecture families cover most production deployment scenarios we encounter:

Single-stage detectors (YOLO family, FCOS, CenterNet) process the image in a single forward pass to produce detections. Fast inference, lower accuracy on small objects, and well-optimised for deployment. YOLO variants (YOLOv8, YOLOv9, YOLO11) dominate production deployments that require real-time or near-real-time inference.

Two-stage detectors (Faster R-CNN, Cascade R-CNN) first generate region proposals, then classify and refine each proposal. Higher accuracy, especially on small and occluded objects, and noticeably slower inference — typically 2–5× slower than single-stage at equivalent backbone size. They are less common in edge deployments and tend to live in cloud inference pipelines where throughput matters more than per-image latency.

Detection transformers (DETR, RT-DETR, DINO) use attention mechanisms and remove NMS post-processing. The original DETR was too slow for many production targets; RT-DETR and DINO have closed the gap with YOLO for a wide range of scenarios. Their strength shows in complex scenes with many interacting objects, where the global attention pattern handles overlap better than dense-anchor approaches.

Performance comparison

Benchmark performance on COCO (80-class detection):

Model	COCO mAP50-95	Latency (A100 GPU)	Params	Best For
YOLOv8n	37.3	~1.5 ms	3.2M	Edge, real-time, resource-constrained
YOLOv8m	50.2	~5.1 ms	25.9M	Balanced speed/accuracy
YOLOv8x	53.9	~13.2 ms	68.2M	Accuracy priority, server inference
YOLOv9c	53.0	~6.7 ms	25.3M	Efficient high-accuracy
RT-DETR-L	53.0	~9.1 ms	32.9M	Transformer baseline, complex scenes
Faster R-CNN R-50	42.0	~40 ms	41.8M	Cloud, high accuracy, batch processing
DINO-4scale	49.0	~50 ms	47.0M	High accuracy, non-real-time

These figures are A100 GPU benchmarks at FP32 precision, reported by the model authors. On-device performance varies significantly by hardware — see the deployment section for embedded numbers.

In our experience, YOLOv8m or YOLOv9c is the practical starting point for most production detection deployments: strong accuracy, a well-supported inference stack (TensorRT, ONNX, CoreML), active maintenance, and a large community for troubleshooting. We treat the leaderboard as a shortlist generator, then move quickly to evaluation on the actual task.

mAP vs latency: making the tradeoff

COCO mAP is a useful comparison tool but has limitations as a production metric:

COCO has 80 classes; your task probably has 2–5. Aggregate mAP ranking does not reliably predict performance on your specific class set.
COCO contains objects at many scales. If your task is single-scale — vehicles on a highway, parcels on a conveyor — small-object performance differences are irrelevant.
mAP at IoU 0.5 (mAP50) reflects detection quality at a single overlap criterion; mAP50-95 averages across overlap thresholds. For presence/absence and rough location, mAP50 is enough. For precise localisation or measurement-style downstream use, mAP50-95 matters.

The latency-accuracy tradeoff has to be evaluated on your hardware with your classes. A workable sequence:

Select 3–5 candidate models spanning a size/speed range.
Fine-tune each on your training data (or evaluate zero-shot where it applies).
Measure inference latency on your target hardware at the batch size you will actually run in production.
Measure accuracy on your held-out test set — not COCO — using the metrics that matter for your application (detection rate, false positive rate, localisation accuracy).
Select the model that meets your latency budget with the highest accuracy on the task as you defined it.

This sequence sounds obvious on paper. The reason to write it down is that we routinely see teams compress it — pick YOLOv8x because it is at the top of the table, then discover at integration time that the latency budget is half what the A100 number suggested.

Edge vs cloud deployment considerations

Deployment context constrains model selection at least as much as accuracy requirements do.

Edge deployment (NVIDIA Jetson, Coral TPU, Hailo)

Edge inference has hard constraints: memory budget, thermal envelope, power budget, and often a requirement for INT8 quantisation.

Memory. YOLOv8n fits in under 10 MB; YOLOv8x requires 130 MB or more. This matters on devices with limited RAM and on shared embedded systems.
INT8 quantisation. Most edge accelerators (Coral, Hailo, TensorRT) require or strongly prefer INT8 quantised models. Quantisation accuracy loss on object detection is typically 0.5–1.5 mAP points with proper calibration — small enough to absorb, but worth measuring before commitment.
ONNX export. Export and validate ONNX before committing to a model for edge deployment. Some components (certain attention mechanisms, dynamic operations) have limited ONNX or TensorRT support and will fail at convert time rather than at training time.
TensorRT optimisation. On NVIDIA Jetson, TensorRT typically provides a 3–5× throughput improvement over native PyTorch for YOLOv8 models.

Approximate NVIDIA Jetson Orin Nano inference performance (INT8) we observe in practice:

YOLOv8n: ~45–60 FPS at 640 px input
YOLOv8m: ~15–20 FPS at 640 px input
YOLOv8x: ~5–8 FPS at 640 px input

Cloud and server deployment

Cloud deployment has fewer hard constraints but introduces its own tradeoffs around throughput and cost:

Batching. Server-side detection should use batch inference for throughput efficiency. YOLOv8m at batch size 8 on an A100 reaches roughly 250 images per second.
GPU tier. Model size determines the required GPU class. YOLOv8n runs efficiently on T4; YOLOv8x needs A10 or A100 for production throughput.
Latency vs throughput. Real-time applications (live video) require dedicated GPU allocation. Batched offline analytics can use spot or preemptible instances and absorb interruption.

Pre-deployment validation checklist

Model evaluated on a held-out test set drawn from the deployment distribution, not benchmark datasets.
Detection rate measured per class — aggregate mAP can mask poor performance on rare classes.
Confidence threshold calibrated to the target precision-recall operating point.
Inference latency measured on target hardware at production batch size.
INT8 quantisation accuracy validated if edge deployment requires it.
ONNX export tested and outputs verified against native inference.
False positive rate measured on negative examples from the deployment environment.
Model handles input at deployment resolution without resize-pipeline artefacts.

Post-deployment monitoring

Confidence score distribution tracked over time for distribution-shift detection.
Detection rate sampled and validated against human annotations on a periodic cadence.
Latency monitored in production — model performance degrades under load.
Retraining trigger defined as a concrete event or metric threshold, not a vague “when accuracy drops”.

What are the common selection mistakes?

Selecting on COCO mAP without evaluating on the actual task. COCO rankings do not transfer to domain-specific tasks. A model ranked third on COCO may well be the best on your class set and image distribution.

Ignoring deployment hardware until after model selection. Choosing YOLOv8x and then discovering it does not meet latency requirements on the target Jetson Nano forces a restart of the selection process.

Skipping confidence threshold calibration. The default confidence threshold (0.25 in YOLOv8) is not calibrated for production. It has to be set against the precision-recall requirement of your application on your validation set.

Neglecting NMS tuning. Non-Maximum Suppression IoU threshold and confidence threshold interact. Tuning only confidence without considering NMS IoU produces duplicate detections at high-recall settings, which inflates false positive counts in dense scenes.

The shared pattern across all four is treating model selection as a benchmark exercise rather than a deployment exercise. The benchmark tells you which models are plausible; the deployment tells you which one ships.

FAQ

When should I build a custom computer vision model versus use an off-the-shelf solution?

When your deployment conditions diverge meaningfully from the data and classes covered by available off-the-shelf models — different sensor characteristics, narrow domain-specific classes, unusual environmental conditions, or accuracy requirements above what pre-trained models achieve on your data. If a current off-the-shelf detector meets your accuracy and latency targets on a representative test set from your deployment environment, custom development is usually not justified. The decision framework in the parent article treats this as a structured question rather than a preference.

What does “off-the-shelf CV” actually cover, and where does it run out?

Off-the-shelf detectors — YOLOv8, YOLOv9, RT-DETR pre-trained on COCO, or vendor APIs from major cloud providers — cover common objects in conditions similar to internet-scale training data. They run out when your task involves classes not in COCO (industrial parts, medical features, domain-specific defects), when your imaging conditions differ substantially (thermal, hyperspectral, unusual lighting), or when accuracy on rare classes matters more than aggregate mAP suggests.

How do I estimate the engineering cost of a custom CV model before committing to it?

Estimate three components: data collection and annotation (often the largest, scaling with class count and required examples per class), training and iteration (typically 4–12 weeks for a first production-grade model in our experience), and inference integration plus monitoring infrastructure (often underestimated). The decision framework treats availability of representative production data as a separate evaluable criterion because it dominates the data-side cost.

Which signals tell me a vendor’s pre-trained model will fail on my data?

Evaluate on a representative sample of deployment-environment images, not vendor demos. Signals of likely failure include: low detection rate on classes that look visually similar to but are not exactly the COCO equivalent; high false positive rate on objects the model treats as your target class; sensitivity to lighting, resolution, or sensor characteristics that differ from the vendor training set; and poor calibration of confidence scores on your data.

What is the realistic time-to-value for a custom CV model versus a vendor solution?

A vendor or off-the-shelf solution can be integration-ready in days to weeks if the model is adequate as-is. A custom model is typically 2–4 months from project start to first production deployment, dominated by data collection, annotation, and validation rather than training itself. These are observed-pattern ranges across our engagements, not benchmarked rates — your timeline depends on data availability and accuracy requirements.

Can I start with off-the-shelf and migrate to custom later without throwing the integration away?

Yes, if the integration is designed around a standard interface — ONNX runtime, TensorRT, or a vendor-neutral inference server — and not coupled to a specific model’s input/output shape. Keep the data collection pipeline running from day one so that when a custom model becomes justified, the training data is already accumulating. This staged approach is what the parent decision framework recommends as the default unless domain divergence is obvious upfront.