A video-analytics pipeline that posts an impressive average throughput number can still miss the frames that matter. Real-time computing is not a synonym for fast. It is a guarantee about behaviour under load: every result the system promises to deliver must arrive inside a stated deadline, and the system is judged on whether it holds that deadline when the input rate is at its worst — not on how quickly it runs on average. That distinction sounds pedantic until it costs you a broadcast. A team building a live sports-graphics overlay or an automated compliance-monitoring feed will measure frames per second, see a comfortable margin, and ship. Then a scene change spikes the per-frame work, the object detector’s queue backs up, and for a few hundred milliseconds the overlay lags the action. The average was fine. The frames that mattered were late. That gap — between average speed and deadline adherence — is what real-time computing is actually about. How Does Real-Time Computing Work in Practice? A real-time system is defined by a deadline, not a clock speed. Each unit of work — in video analytics, typically a frame or a per-stream inference result — carries a latency budget: the maximum time allowed between the input arriving and the result being ready. The system is correct only if it meets that budget under the load it is specified to handle. A computation that produces the right answer too late is, for real-time purposes, wrong. This reframing changes what you optimise. Average throughput tells you how much work the pipeline can chew through over a window. It says almost nothing about whether any individual frame met its deadline. Two pipelines can have identical average throughput while one holds a 33 ms per-frame budget at the 99th percentile and the other blows past it on every burst. The headline number hides the failure. Real-time computing, then, is a design posture: you decide up front which results carry a deadline, you size the system to honour those deadlines under worst-case load, and you measure adherence rather than raw speed. The relevant question is never “how fast is the pipeline” but “what fraction of deadline-bearing results arrived on time when the system was under stress.” Everything else is secondary. What Is the Difference Between Real-Time and Just Running Fast? Running fast is an average property. Running in real time is a tail property. The two diverge precisely where it hurts — at the moments of peak load, when queues fill and the slowest results determine whether the system did its job. Consider a pipeline ingesting multiple camera streams: decode, motion detection, object detection on regions of interest, tracking, and an event-classification stage. Optimise for average throughput and you will happily batch frames across streams to keep the GPU busy, because batching raises utilisation and aggregate frames per second. But batching adds queueing delay — a frame waits for its batch-mates before the GPU touches it. The average improves; the tail gets worse. For a stage with a hard per-frame deadline, that trade is a regression even though the dashboard looks better. The throughput-versus-latency tension is the core of the matter, and it is worth understanding the measurement reasoning behind it rather than treating it as folklore. The relationship between how throughput is actually defined for AI inference and the per-request latency a deadline cares about explains why a system tuned for one routinely sacrifices the other. Real-time design means choosing latency where the deadline lives and reserving throughput optimisation for the stages that can absorb it. Hard, Firm, and Soft Deadlines in a Video-Analytics Pipeline “Real-time” is not one constraint. The classical taxonomy — hard, firm, and soft deadlines — maps cleanly onto the stages of a real analytics pipeline, and getting the mapping right is most of the design work. A hard deadline means a missed result is a system failure: the late result is worthless and the consequence is serious. A firm deadline means a late result is discarded — it has no value past its deadline, but missing the occasional one is tolerable. A soft deadline means a late result still has value, degrading gradually the later it arrives. Most video-analytics functions are firm or soft; genuinely hard deadlines are rarer than teams assume, and conflating them is how budgets get over-spent. Deadline Type vs. What Happens When You Miss It Deadline type A miss means Typical analytics stage Design implication Hard System failure; late result is worthless and harmful Live broadcast graphics sync, safety-critical trigger Provision for worst-case load; no best-effort fallback Firm Late result is discarded; occasional miss tolerable Live overlay detection, real-time alerting Drop late frames cleanly; size for p99, not peak Soft Late result degrades in value but is still usable Live captioning, viewer-facing tagging Best-effort with backpressure; degrade gracefully Best-effort (no deadline) No deadline; result valuable whenever it lands Archival indexing, post-hoc moderation, analytics rollups Batch aggressively; run on spare or CPU capacity The practical payoff of this table is that it tells you where to spend. Only the hard and firm rows justify provisioning GPU capacity to hold latency under load. The soft and best-effort rows can share capacity, batch, run on CPU, or absorb backpressure — which is exactly where cost gets recovered. Why Tail Latency Beats Average Throughput as the Real-Time Metric If you measure one thing for a real-time analytics stage, measure the tail — the p99 (or p99.9) latency — not the average. The average is dominated by the easy frames. The deadline is broken by the hard ones, and those live in the tail. Here is the worked logic. Suppose a detection stage has a 33 ms per-frame budget (a 30 fps stream) and a mean inference time of 18 ms. The mean looks safe — almost double the headroom. But if the p99 latency under load is 41 ms, then roughly one frame in a hundred misses its deadline. On a single 30 fps stream that is about one missed frame every three seconds. Across a 50-stream fleet it is a steady drizzle of late results, every one of which the average happily conceals. (Illustrative figures, used to show the arithmetic — your numbers depend on model, resolution, and batch policy.) Tail latency is also where peak-versus-steady-state behaviour shows up. A system that looks fine in a short test can degrade once queues warm up and thermal or memory pressure sets in; the distinction between peak and steady-state performance under sustained load is exactly the difference between a benchmark that passed and a pipeline that holds its deadline in production. The real-time metric — the share of results delivered inside the budget at p99 under sustained load — is the one that predicts production behaviour. This connects directly to cost: honouring a deadline only on the stages that carry one avoids over-provisioning GPU capacity for stages that could have run best-effort. That is the difference between a fleet-wide low-latency build and a targeted one, and it is usually a large number. Where GPU Acceleration Helps a Stage Meet Its Budget — and Where It Doesn’t GPUs help a stage meet its deadline when the stage is compute-bound and parallelisable: dense convolution, transformer attention over image patches, large-batch matrix work. For object detection or segmentation at high resolution, moving the kernel onto a GPU and compiling it with a runtime like TensorRT or running it through an optimised path such as cuDNN can pull the per-frame compute time well inside the budget. But GPU horsepower does not automatically produce real-time behaviour, and assuming it does is the naive trap this whole subject exists to correct. A few patterns we see regularly: Memory transfer dominates. If a stage spends more time moving frames across PCIe than computing on them, adding a faster GPU barely moves the deadline. The fix is keeping data resident, fusing stages, or batching the transfer — not more FLOPS. Batching to raise utilisation breaks the deadline. A GPU left at low utilisation tempts you to batch across streams. That raises throughput and adds queueing latency — the opposite of what a deadline-bearing stage needs. Underutilisation on a no-deadline stage is fine; on a deadline stage it is a symptom you must read correctly. The bottleneck is on the CPU. Decode, colour conversion, and post-processing often run on the host. A GPU that is starved by a slow CPU pre-stage will miss deadlines no matter how fast the kernel is. When GPU and CPU stages disagree about who owns the latency, the deadline is the casualty. The right move is per-stage: profile each stage against its own budget, and accelerate the stage that is actually on the critical path for the deadline you care about. Throwing a bigger GPU at the whole pipeline accelerates stages that already met their deadline while the laggard stays a laggard. How Do You Confirm a Pipeline Meets Its Deadlines Under Production Load? Profiling for real-time adherence is not the same as profiling for throughput. You are not asking “how many frames per second.” You are asking “under the load this pipeline is specified for, what fraction of deadline-bearing results landed inside their budget, and which stage broke first when they didn’t.” Real-Time Adherence Checklist State the budget per stage. Write down the per-frame or per-stream latency deadline for each analytics stage, and label each one hard, firm, soft, or best-effort. A stage without a written deadline is implicitly best-effort. Drive realistic worst-case load. Replay production traffic at peak input rate with realistic content variation (scene changes, crowd scenes, fast motion) — not a synthetic loop of one easy frame. Measure the tail, not the mean. Record p99 (and p99.9 for hard stages) latency per stage and end-to-end, under sustained load, after queues have warmed up. Find the breaking stage. Identify which stage’s tail first exceeds its budget as load rises. That stage owns the deadline failure. Separate deadline stages from best-effort. Confirm the best-effort stages are not stealing capacity from deadline stages — and that deadline stages are not over-provisioned for work that has no deadline. Re-measure after each change. Tail latency is sensitive to batch policy, memory residency, and co-tenancy; a change that helps the average can hurt the tail. A GPU performance audit scoped to video-analytics workloads does exactly this profiling: it walks each analytics stage against its latency budget and names which stages carry a real-time deadline and which can run best-effort on spare or CPU capacity. That naming step is where over-provisioning gets caught, because it forces the question most pipelines never ask — does this stage actually have a deadline, or did we just assume it did. This is the same profile-first discipline that decides when GPU-accelerated video analytics earns its cost in media pipelines: you do not pay for acceleration a stage doesn’t need, and you do not skimp on a stage whose deadline you’ve promised to hold. For the broader picture of what these pipelines actually do with the frames once they meet their deadlines, see how AI video analytics works in practice. The pipelines that run reliably in production at broadcasters and media operators — the ones described across our media and broadcast work — are the ones built around explicit deadlines from the start. FAQ How does real-time computing work, and what does it mean in practice? A real-time system is defined by a deadline: each result must arrive within a stated latency budget under the load the system is specified to handle. A correct-but-late answer counts as a failure. In video analytics, that budget is typically per-frame or per-stream, and the system is judged on whether it holds the deadline under worst-case load rather than on its average speed. What is the difference between real-time computing and simply running a pipeline as fast as possible? Running fast is an average property; running in real time is a tail property. The two diverge at peak load, where queues fill and the slowest results decide whether the system did its job. A pipeline tuned for average throughput will often batch work to raise utilisation, which improves the average while adding queueing delay that breaks per-frame deadlines. How do hard, firm, and soft real-time deadlines apply to a video-analytics pipeline? A hard deadline means a missed result is a system failure; a firm deadline means a late result is discarded but occasional misses are tolerable; a soft deadline means a late result still has value that degrades over time. Most analytics functions are firm or soft — live alerting and overlay detection are firm, captioning is soft — while archival indexing and post-hoc moderation carry no deadline at all and run best-effort. Why is tail latency (p99) more meaningful than average throughput for a real-time analytics deadline? The average is dominated by easy frames, while the deadline is broken by the hard ones that live in the tail. A stage with an 18 ms mean against a 33 ms budget can still miss roughly one frame in a hundred if its p99 latency is 41 ms — a miss the average completely conceals. Measuring the share of results delivered inside the budget at p99 under sustained load is what predicts production behaviour. Which video-analytics stages actually carry a real-time deadline, and which can run best-effort? Live broadcast graphics sync, real-time alerting, and live overlay detection carry hard or firm deadlines; live captioning and viewer-facing tagging are typically soft; archival indexing, post-hoc moderation, and analytics rollups carry no deadline and can batch on spare or CPU capacity. Only the hard and firm stages justify provisioning GPU capacity to hold latency under load, which is where cost is recovered by not over-provisioning the rest. How does GPU acceleration help — or fail to help — a stage meet its latency budget? A GPU helps when a stage is compute-bound and parallelisable — dense convolution or attention work — where moving the kernel onto the GPU pulls per-frame time inside the budget. It fails to help when memory transfer across PCIe dominates, when batching to raise utilisation adds queueing latency, or when a slow CPU pre-stage starves the GPU. The fix is per-stage: accelerate the stage actually on the critical path for the deadline, not the whole pipeline. How do you profile a pipeline to confirm it meets its real-time deadlines under production load? State the latency budget and deadline class for each stage, drive realistic worst-case load with content variation, then measure p99 (and p99.9 for hard stages) per stage and end-to-end after queues warm up. Identify which stage’s tail first exceeds its budget, confirm best-effort stages are not stealing capacity from deadline stages, and re-measure after every change because tail latency is sensitive to batch policy and co-tenancy.