Introduction Most beginner guides to computer vision list algorithms and frameworks (“here are five models, here are three libraries, good luck”) and leave the reader to compose them into a project. This guide is structured for the technical reader entering CV from an adjacent field — backend engineer, data scientist, product lead — who needs to make build, buy, or scope decisions inside their first month and ship a working component soon after. The fundamentals matter, but they matter in a particular order: the five-stage pipeline that every CV system follows, the language and tooling choice that fits the workload, the difference between practitioner and researcher deliverables, and the foundation that actually maps to production. See computer vision for the broader subdomain this article serves as an entry point to. The naive read is that CV fundamentals are a long list of algorithms. The expert read is that CV fundamentals are a small set of decisions and a short list of concepts; the long list of algorithms is the second-month reading, not the first-week onboarding. What this means in practice Start with the five-stage pipeline; algorithm choices follow from where the team is engineering. Language choice (Python vs C++) is workload-driven, not preference-driven. Practitioner deliverables (production pipelines, evaluation harnesses) differ from researcher deliverables. Canonical textbooks remain useful for foundation; 2020+ deep-learning material updates the top of the stack. What are the five stages of computer vision from acquisition to inference, and where does engineering effort concentrate? Stage one, acquisition: imaging hardware (camera, sensor, lighting) capturing the input. Engineering effort here decides what the rest of the pipeline can do; poor imaging cannot be fixed downstream. Stage two, pre-processing: image normalisation, colour conversion, distortion correction, resizing — the steps that prepare the input for the inference model. Often classical CV; high impact when done correctly, invisible when done well. Stage three, inference: the model that extracts the structured information (detections, classifications, segmentations, embeddings). The stage that gets the most attention in courses and the most marketing in vendor pitches; in production, it is one of five. Stage four, post-processing: applying business rules, deduplication, temporal smoothing, threshold logic, converting model outputs into application-usable signals. Stage five, integration: writing the signals into the calling application, the database, the downstream service. Engineering effort concentrates outside stage three for production teams — acquisition and integration usually consume more engineering time than the model itself. Beginners who concentrate on stage three exclusively ship demos; practitioners who engineer all five ship products. How does computer vision work end-to-end in a 2026 production stack? A typical production stack flows: input device → frame capture and timestamp → pre-processing (de-distortion, colour, scale) → inference (one or more models) → post-processing (NMS, tracking, business rules) → output (event, record, downstream call) → observability (metrics, traces, sample storage). Each block is engineered: capture has its own reliability budget, pre-processing has its own CPU or GPU budget, inference has its own latency budget, post-processing has its own correctness budget, output has its own delivery guarantee, observability has its own sampling and retention policy. The 2026 stack increasingly co-locates pre/post on GPU (CUDA kernels for pre-processing, optimised post-processing in CUDA or in custom CUDA-based libraries), runs inference under TensorRT, Triton, or vLLM-class servers, persists samples for retraining via a feature/sample-store, and integrates observability via OpenTelemetry. The stack is heterogeneous (Python at the edges, C++ in the hot path, CUDA where it matters), and the architecture choice matters more than individual library choices. Beginners who treat the stack as one Python script ship demos that do not scale; practitioners who design the stack as a pipeline with budgets per stage ship systems that operate. Which language (Python vs C++) fits which CV workload, and why is that no longer a religious debate? Python fits research, prototyping, model training, evaluation, and the orchestration layer of production stacks. The libraries (OpenCV-Python, PyTorch, TensorFlow, Pillow) are mature, the iteration speed is high, and the deployment story (TorchScript, ONNX export, Triton serving) has matured to the point where Python-led pipelines can ship to production. Python is the right language for the team’s day-to-day CV work in most cases. C++ fits the hot path where latency or throughput requirements push beyond what Python orchestration can absorb. Latency-critical loops (sub-10ms tracking, sensor fusion, real-time control), embedded deployments (ROS nodes, edge appliances), or framework-internal extensions (custom CUDA ops, custom operators) live in C++. Modern CV development is rarely C++-only or Python-only; it is Python at the boundaries with C++ in the hot inner loops, with the boundaries chosen by latency budget rather than by team preference. The debate is no longer religious because the toolchain handles the bridge — pybind11, Cython, TorchScript, ONNX, CUDA Python — and the team’s choice is to put each component in the language where it pays for itself. What separates a CV practitioner from a CV researcher in deliverables and tooling? CV researcher deliverables: papers, benchmark results, ablation studies, novel architectures, public code releases. Tooling: experiment-management platforms, hyperparameter sweeps, large-scale training infrastructure, paper-writing workflows. The optimisation target is novelty and benchmark performance, with reproducibility for the research community. CV practitioner deliverables: production pipelines, evaluation harnesses on the team’s data, latency and accuracy reports against SLA, runbooks, incident reviews, retraining pipelines, sample stores, monitoring dashboards. Tooling: production model servers (Triton, TensorRT-LLM, vLLM), MLOps platforms, feature/sample stores, observability stacks (OpenTelemetry, Prometheus, Grafana), data-versioning, model-versioning, A/B testing. The optimisation target is operational outcome — accuracy on the deployed harness, latency at SLA, cost per inference, time-to-rollback when a deployment regresses. Beginners who confuse the two — adopting researcher tooling for a production problem, or shipping a researcher deliverable when the team needed a practitioner one — produce work that does not fit the team. The first onboarding question is which role the team needs, and the work follows. Where do the canonical CV textbooks (Szeliski, Nixon, Forsyth) still hold up, and where do they need refresh? Szeliski’s “Computer Vision: Algorithms and Applications” (currently in its second edition with regular updates) remains the production-engineer-friendly reference for geometric vision (camera models, stereo, structure-from-motion, multi-view), classical image processing, and the conceptual scaffolding around modern CV. The 2022 second edition incorporates deep-learning material throughout, making it the closest canonical text to current practice. Use for foundation and as a long-term reference. Nixon and Aguado’s “Feature Extraction and Image Processing” remains the strongest reference on classical feature extraction (SIFT, ORB, HOG, edges, contours), which still ships in production pipelines as ROI selection, image registration, low-power preprocessing, and as the fallback when deep methods do not apply. The deep-learning material in older editions is dated; pair with current deep-learning resources. Forsyth and Ponce’s “Computer Vision: A Modern Approach” remains useful for the breadth of CV concepts but the deep-learning era requires supplementing with current material. For 2026 practice, the canonical books cover the durable foundation; deep-learning specifics come from current papers, courses, and library documentation that update faster than textbooks. What is the minimal foundation needed to ship a production CV system in a real engineering team? Foundation list. Linear algebra (vector spaces, matrix operations, eigenvalues) to the level of being comfortable with image transformations, embeddings, and the mathematical framing of CV operations — not at research depth, at engineering-comfort depth. Probability and statistics to the level of understanding evaluation metrics (precision, recall, mAP, AUC, FAR/FRR), sampling for evaluation, and bias measurement. Classical image processing fundamentals: image representation, colour spaces, convolutions, common filters, geometric transformations. Deep-learning fundamentals at the practical level: CNN architectures and what each layer does, training loops, transfer learning, common loss functions for classification/detection/segmentation. The dominant frameworks (PyTorch primarily, TensorFlow if the team uses it) at the practical level. The toolchain: OpenCV for classical operations, NumPy/SciPy for numerical work, dataset and evaluation tooling. Production basics: model serving, latency measurement, accuracy measurement, observability, version control for data and models. The minimal foundation skips research-grade depth (the team can learn that as needed) and front-loads what enables shipping; teams that demand researcher-grade foundation before letting beginners ship over-train and underdeliver. The shipping-first onboarding produces practitioners faster. How TechnoLynx Can Help TechnoLynx works with teams onboarding CV engineers and scoping first-CV projects — five-stage pipeline design, language and tooling selection, foundation curriculum tailored to practitioner deliverables, and the production discipline that turns a beginner team into a shipping team. If your team is building CV capability and wants the path that ships products rather than papers, contact us. Image credits: Freepik