AI Video Analytics: How It Works and What It Means in Practice

“AI video analytics” sounds like one capability you switch on. It is not. Behind that phrase sits a chain of inference stages — decode, detect, track, classify, index — and each stage has its own compute economics. Treat the chain as a single GPU workload and you will overpay: you put acceleration where it does nothing and starve the stages that actually need it.

That is the core mistake we see when teams price out video analytics for the first time. The vendor demo runs on a GPU, so the assumption forms that the whole pipeline belongs on a GPU. In practice the pipeline is heterogeneous. Some stages are dense matrix math that map cleanly onto CUDA cores; others are branch-heavy, memory-bound, or simply cheap enough that a CPU thread handles them without anyone noticing. The skill is not “buy a GPU” — it is knowing which stage is which.

What Does AI Video Analytics Actually Do, Stage by Stage?

Strip away the marketing and a video analytics system is a sequence of transformations applied to a stream of frames. Each transformation answers a different question, and each has a different cost profile.

Decode turns the compressed bitstream (H.264, HEVC, AV1) back into raw frames. This is codec work, not neural-network work. Modern GPUs carry dedicated decode hardware (NVDEC on NVIDIA parts) that handles it without touching the CUDA cores at all — which matters, because it means decode competing for the same silicon as detection is a misconception.
Detect runs an object detector over each decoded frame to find regions of interest — people, vehicles, faces, logos. This is the canonical dense-inference stage: a convolutional or transformer backbone, heavy on matrix multiplication, and the stage that most justifies a GPU.
Track associates detections across frames so that “person A in frame 100” and “person A in frame 101” are understood as the same entity. Tracking is often classical computer vision — Kalman filters, the Hungarian algorithm for assignment — and frequently runs faster on a CPU than the round-trip cost of shipping data to a GPU would allow.
Classify takes a detected region and assigns a finer label: this vehicle is a delivery van, this garment is a jacket, this clip contains a brand mark. Sometimes a small dedicated model, sometimes a head on the detector. Cost depends entirely on the model.
Index writes the structured output — timestamps, bounding boxes, labels, embeddings — into a store you can query later. This is I/O and database work, not inference at all.

Read that list and the central point follows on its own: only one or two of those five stages are unambiguously GPU-shaped. Decode has its own dedicated path; tracking and indexing are often cheaper elsewhere; classification is a coin toss that depends on the model you chose. We walk through what this looks like as a running production system in our explainer on how video analysis works in production, which sits one level of abstraction below this one.

How Is This Different From a Single AI Model You Switch On?

A single model has one cost number: latency per inference, throughput at saturation, memory footprint. You can reason about it as a unit. A pipeline does not have one cost number — it has a cost mix, and the mix shifts with the workload.

Consider two deployments of nominally the same analytics stack. One processes a sparse warehouse camera where detections are rare; the other processes a crowded transit concourse where every frame is full of people. Same stages, wildly different economics: the concourse stream spends far more time in tracking and classification because there are more objects to track and classify, while the warehouse stream spends most of its budget in decode and detection regardless of how empty the scene is.

This is why “what does AI video analytics cost” has no single answer. The honest answer is a function of your workload mix. A stream where 80% of compute lands in detection wants a GPU; a stream where detection is sparse and tracking is dense may not justify one at all. The divergence is not the hardware — it is the scene.

Which Stages Belong on a GPU, and Which Stay on CPU?

Here is the placement logic as a decision surface. The rows are the five canonical stages; the columns are what determines placement.

Stage	Compute character	Default placement	Moves to GPU when…
Decode	Codec / fixed-function	Dedicated decode hardware (NVDEC) or CPU	Stream count exceeds CPU decode budget; co-locate with detection to avoid PCIe copies
Detect	Dense matrix math	GPU	Almost always — this is the GPU-shaped stage
Track	Branch-heavy, classical CV	CPU	Detector emits dense per-frame outputs and tracking is fused into the model
Classify	Model-dependent	Depends on model size	Classifier is a deep network rather than a shallow head
Index	I/O, database	CPU / storage tier	Effectively never — this is not inference

The pattern that this table encodes is the same one that explains why uniformly GPU-accelerated chains leave compute idle: when you push every stage onto the GPU, the branch-heavy and I/O-bound stages stall the device while contributing little, and your expensive accelerator sits underutilised between detection bursts. We describe that failure mode in detail in our analysis of why GPU utilisation runs low even on GPU-heavy workloads, where mismatched stage placement is the recurring culprit.

The reason latency-bound and throughput-bound stages need different treatment is itself worth grounding rather than asserting. Detection benefits from batching frames to fill the GPU; tracking needs each frame’s result immediately to feed the next. Those are opposite optimisation targets, and the throughput-versus-latency tension behind them is the same one the benchmarking discipline has worked out carefully — see the breakdown of the throughput-versus-latency trade-off in AI inference. Rather than re-derive that reasoning here, we lean on it: it is why you cannot tune the whole pipeline to one number.

How Do I Map My Own Analytics Chain to GPU-vs-CPU Placement?

You do not need a profiler to start. You need to answer a short set of questions honestly, and the answers tell you where the money goes before you commit to any acceleration spend.

How many concurrent streams, at what resolution and frame rate? This sets your decode budget. A handful of 1080p streams decode fine on CPU; dozens of 4K streams need dedicated decode hardware.
How dense are detections per frame? Sparse scenes spend their budget in decode and detection; dense scenes shift it into tracking and classification.
Is your detector running every frame, or every Nth frame? Frame skipping is the single biggest lever on detection cost, and it is free to try.
Is classification a shallow head or a separate deep model? A head rides along with detection at near-zero marginal cost; a separate deep classifier is its own GPU-shaped stage.
What is your indexing and query load? This is CPU and storage work — if it dominates your bill, no GPU will help.

Walk those five questions and you have a first-order estimate of cost-per-stream and where acceleration earns its keep. That estimate is, in our experience, usually enough to kill a fleet-wide GPU rollout that would have placed accelerators on streams that never needed them (observed across TechnoLynx media engagements; not a benchmarked figure). The discipline is decomposition first, hardware second.

This conceptual map is the precondition for the cost discipline that the applied work measures directly: cost-per-analytics-hour against value-per-analytics-hour, real GPU utilisation, and the avoided cost of over-provisioning. The companion piece, when GPU-accelerated video analytics earns its cost in media pipelines, shows what profiling the workload mix actually delivers once you have the map in hand. And because benchmark numbers from a vendor rarely match your scene, it is worth understanding why benchmark figures fail to predict real-workload performance before you trust a quoted throughput.

What About CCTV and Surveillance — Are the Stages Different?

The vocabulary differs but the chain does not. “Types of video analytics in CCTV” usually means a list of outcomes — intrusion detection, line crossing, loitering, people counting, licence-plate recognition — rather than a list of stages. Every one of those outcomes is built from the same decode-detect-track-classify-index chain underneath.

Line crossing is detection plus tracking plus a geometric rule. People counting is detection plus tracking plus an aggregation at the index stage. Licence-plate recognition is detection plus a specialised classification (OCR) stage. The outcome names are configurations of the same five stages, which is precisely why the placement logic above transfers directly from broadcast media to surveillance: a transit-camera analytics chain and a broadcast logo-detection chain are the same machine pointed at different scenes. For the broadcast-specific framing of this work, our media and broadcast engineering overview lays out where it sits in a production pipeline, and the underlying acceleration economics are covered in our GPU engineering practice.

FAQ

How does AI video analytics work, and what does it mean in practice?

AI video analytics is a chain of inference stages — decode, detect, track, classify, index — applied to a stream of frames. In practice it means a heterogeneous workload, not a single capability: each stage answers a different question and carries a different compute cost, so the system’s behaviour and economics depend on which stages dominate for your particular streams.

What are the distinct stages of an AI video analytics pipeline (decode, detect, track, classify, index)?

Decode turns the compressed bitstream back into raw frames; detect finds regions of interest with an object detector; track associates detections across frames into consistent entities; classify assigns finer labels to detected regions; index writes the structured output to a queryable store. Decode is codec work, detect is dense inference, track is classical computer vision, classify is model-dependent, and index is I/O.

Which analytics stages benefit from GPU acceleration and which stay cheaper on CPU?

Detection is the unambiguously GPU-shaped stage and almost always belongs there. Decode runs on dedicated decode hardware or CPU; tracking is branch-heavy classical CV that often runs faster on CPU than the GPU round-trip would allow; indexing is I/O work that is not inference at all. Classification depends on whether it is a shallow head or a separate deep model.

How is AI video analytics different from a single AI model you switch on?

A single model has one cost number you can reason about as a unit. A pipeline has a cost mix that shifts with the workload — a sparse scene spends its budget in decode and detection, while a crowded scene shifts cost into tracking and classification. That is why “what does video analytics cost” has no single answer: it is a function of your scene and stream mix.

How do I map my own analytics chain to GPU-vs-CPU placement before committing to acceleration spend?

Answer five questions: stream count and resolution (decode budget), detection density per frame, whether you detect every frame or every Nth, whether classification is a head or a separate model, and your indexing and query load. Those answers give a first-order estimate of cost-per-stream and show where acceleration earns its keep, often before any profiling is needed.

What types of video analytics are commonly used in CCTV and surveillance, and how do their stages map?

Intrusion detection, line crossing, loitering, people counting, and licence-plate recognition are outcomes, not stages — each is built from the same decode-detect-track-classify-index chain. Line crossing is detection plus tracking plus a geometric rule; people counting adds aggregation at the index stage; plate recognition is detection plus a specialised OCR classification. The placement logic transfers unchanged from broadcast to surveillance.

The thing to carry away is that the question “should this run on a GPU?” is the wrong question to ask of a pipeline. Ask it of each stage. The chain is heterogeneous by construction, and the moment you treat decode, detection, tracking, classification, and indexing as one undifferentiated workload, you have already lost the cost argument. Decompose first; the hardware decision falls out of the workload mix, not the other way around.