Deep learning is the reason computer vision became practical at industrial scale. Before 2012, every new vision task meant a new feature pipeline. After AlexNet, the dominant pattern flipped: collect data, pick an architecture, train, deploy. A decade and a half later the recipe has matured, but the trade-offs are sharper than the marketing suggests. This article covers what actually works in production, what to learn first, and where classical computer vision still beats a neural network. Why Deep Learning Took Over Three things lined up at once: Convolutional neural networks could learn visual features end-to-end instead of relying on hand-designed ones. GPUs made training those networks economically viable. Anything close to modern training would take centuries on a CPU. Large labelled datasets like ImageNet gave the field a common benchmark, which let progress compound. The result was a step-change in accuracy on classification, detection, and segmentation tasks. Within a few years, the question stopped being “can a network learn this?” and became “can we collect enough data to train one cheaply?” The Architectures That Earn Their Cost There are hundreds of published architectures. A working practitioner needs to know maybe ten. The ones that show up most in deployed systems: Convolutional Networks CNNs are still the default for many tasks. The families worth knowing: ResNet. The skip-connection trick that unlocked very deep networks. Still a strong baseline. EfficientNet. Optimised for the accuracy-per-FLOP curve. Common on edge hardware. ConvNeXt. A modern CNN that competes with transformers on accuracy while keeping convolutional efficiency. For a deeper view of the building blocks underneath, see Feature Extraction and Image Processing for Computer Vision. Vision Transformers ViTs treat an image as a sequence of patches and apply self-attention. They scale better on very large datasets and have become the backbone for foundation models — CLIP, DINO, SAM. They cost more compute per parameter than CNNs but unlock capabilities that CNNs do not. Object Detection Heads YOLO (v5, v8, v11), DETR, and RT-DETR are the practical choices for “find and locate.” YOLO dominates real-time edge deployments. DETR-style models are catching up and are easier to extend with additional output heads. Segmentation Models U-Net for medical and scientific imaging, DeepLab for general semantic segmentation, Mask R-CNN for instance segmentation, SAM for zero-shot prompt-driven segmentation. Each has a clear sweet spot. Foundation Models CLIP, DINO, SAM, and their successors changed the workflow. Instead of training a model from scratch, the pattern now is: take a pre-trained foundation model, freeze most of it, and fine-tune a small head for your task. This typically reduces required labelled data by 10× to 100×. How Training Actually Works on Real Data Tutorials show clean datasets and steady loss curves. Real projects do not. The training loop in production looks more like this: Collect raw data from the target environment. Cameras, lighting, distance, angles must match deployment. Label a first batch carefully. Two annotators on a sample of frames to measure agreement. Rewrite the label spec until agreement is above 90%. Fine-tune a foundation model as a starting point. Resist the urge to train from scratch. Look at the failures. Run inference on a held-out set and visually inspect the worst predictions. Most insight comes from this step. Targeted data collection. The errors tell you what data is missing. Collect or synthesise more of that. Repeat. Three or four cycles usually beat any clever architecture change. Calibrate the threshold for your task. The default 0.5 confidence cutoff is almost never right. Lock the model and write the eval harness before deployment, not after. Most of the engineering work is in steps 4–7, not in the model definition. Where Classical Computer Vision Still Wins Deep learning is not always the right tool. Classical methods — edge detection, template matching, contour analysis, geometric transforms — beat neural networks when: The task is geometric, not perceptual. Measuring the angle of a known part on a fixture does not need a CNN. The dataset is tiny. With twenty examples, a Hough transform or SIFT-based matcher will outperform a poorly-trained network. Latency or power is the binding constraint. A few OpenCV operations run faster than even a quantised network on the smallest devices. Explainability matters. A classical pipeline can be inspected step by step. A neural network is a black box even when it works. The conditions are tightly controlled. Fixed lighting, fixed camera, fixed background — exactly the conditions where classical methods were always strongest. A good practitioner knows when to skip the network entirely. We touched on this trade-off in Computer Vision and Image Understanding. Hardware and Deployment Realities Training and inference have different hardware profiles. Training is throughput-bound and lives in the cloud on big GPUs. Inference is latency-bound and increasingly lives at the edge. The practical knobs: Quantisation. FP16 or INT8 quantisation typically cuts inference cost 2–4× with minor accuracy loss. Worth the engineering investment for any high-volume deployment. Pruning and distillation. Train a big model, distil it into a small one. Common pattern for shipping a 100MB model derived from a 4GB teacher. Hardware-aware training. Models trained with the target hardware in mind (Jetson, Coral, Hailo, mobile NPUs) consistently outperform generic models retargeted late. Our GPU page goes into the training-side hardware in more depth. What to Learn First If you are new to deep learning for vision and want a path that compounds: Train a CNN classifier on CIFAR-10 from scratch. Understand every line. Fine-tune a pre-trained ResNet on a custom dataset of your own. Train a YOLO detector on a small custom set. Learn how labels and anchors work. Use SAM or CLIP for a zero-shot task without training. Understand what foundation models give you. Deploy something to a Jetson or Coral. Latency, memory, and packaging will teach more than another paper. Steps 4 and 5 are where most curricula stop short, and they are the ones that matter for shipping work. What TechnoLynx Does in This Space We build deep-learning vision systems for real products — from defect detection on production lines to surveillance analytics to autonomous-vehicle perception. We also know when not to use deep learning, which saves clients more money than the model itself. If you are evaluating the approach for a project, contact us and we will give you a candid view of what fits. Related TechnoLynx perspectives Compare with adjacent perspectives on custom computer vision software development, computer vision solutions, and how these decisions connect across the broader production computer-vision engineering thread: Why Off-the-Shelf Computer Vision Models Fail in Production How to Architect a Modular Computer Vision Pipeline for Production Reliability Data Quality Problems That Cause Computer Vision Systems to Degrade After Deployment