Low-Latency Video for Automotive Teleoperation: Why Custom Encoders Beat Off-the-Shelf

A teleoperation system that demonstrates clean handovers on a closed track and then misses its response-time threshold on a public road usually has nothing wrong with its perception model. The latency is hiding in the video path. A remote operator sees the world through a camera, an encoder, a network, a decoder, and a display — and every one of those stages adds delay before the operator’s input ever reaches the vehicle. When the round-trip exceeds the threshold for safe intervention, the system fails in a way that no amount of model improvement will fix.

This is a decision article, not a model-tuning guide. The decision it helps you make is narrow and consequential: when your teleoperation loop is too slow, do you re-engineer the perception stack, the transport, or the video encoder? Most teams reach for the model first because that is where they have invested. In our experience, that is usually the wrong place to look. The encoder is the stage teams understand least and inspect last, and it is frequently where the latency budget quietly disappears.

Where the Latency Budget Actually Goes

Teleoperation has a hard physical constraint that autonomous driving does not: a human is in the loop, and the human is somewhere else. The end-to-end latency budget — from photons hitting the vehicle’s sensor to the operator’s command moving the steering actuator — is the number that determines whether remote operation is safe at a given speed. That budget is fixed by physics and by the safety case. Everything inside it competes for the same milliseconds.

The naive mental model splits that budget into “network latency” and “compute latency” and stops there. It misses the part that off-the-shelf tooling hides. A general-purpose codec such as a stock H.264 or H.265 encoder is built to optimise compression efficiency and visual quality for streaming and storage — not to minimise time-to-first-decodable-frame under a real-time control loop. That design goal mismatch is the latency floor.

Concretely, a general-purpose encoder introduces delay through several mechanisms that a streaming use case happily tolerates:

Frame reordering and B-frames. Bidirectional prediction frames need future frames to decode, which forces buffering. Excellent for compression ratio, fatal for a control loop.
Rate-control lookahead. Many encoders analyse several frames ahead to allocate bitrate, deliberately holding frames before emitting them.
Large GOP structures. Long groups-of-pictures reduce keyframe overhead but increase recovery time after packet loss, which on a lossy mobile link translates into stalls.
Buffering for jitter smoothing. Player-side buffers that make a video pleasant to watch add hundreds of milliseconds the operator cannot afford.

None of these are bugs. They are correct behaviour for the workload the codec was designed for. The problem is that a teleoperation video path is a different workload, and reusing the wrong tool imports its priorities. This is the same class of mistake we describe in how visual perception in automotive AI works in practice — the failure is rarely in the algorithm, it is in the data path feeding it.

How Much of the Budget Does the Codec Round-Trip Consume?

This is the question worth measuring before you touch anything else, and it is the fourth of the questions teams most often ask us. The honest answer is that it varies with the encoder, the resolution, and the link — but the share is routinely larger than teams expect, and it is almost never zero.

The point of the table below is not to give you a number to quote. It is to give you a structure for attributing your own measured latency to the right stage, so the decision rests on evidence rather than assumption.

Latency Attribution: A Decision Table for the Teleoperation Loop

Stage	What it contributes	Typical lever	Evidence class
Sensor capture + ISP	Fixed exposure + image-signal-processing delay	Sensor/ISP choice; hard to compress	hardware spec
Video encode	Compression + buffering + lookahead	Custom encoder; the largest controllable lever	observed pattern
Network transport	Propagation + jitter + retransmission	Protocol choice (e.g. low-latency WebRTC paths)	observed pattern
Decode + display	Player buffering + render	Decode pipeline tuning	observed pattern
Operator reaction	Human response time	Out of scope — fixed by the safety case	published-survey

When we look at a stalled teleoperation loop, the encode stage is the one most often carrying delay that the team assumed lived in the network. (This is an observed pattern across the perception-infrastructure work we do, not a benchmarked rate — your split depends on your encoder and link.) The reason it gets overlooked is that “the network is slow” is an intuitive story and “the codec is buffering” is not.

Why Off-the-Shelf Encoders Cannot Meet the Threshold

The structural reason is the design-goal mismatch already named, but it is worth stating as a citable claim: a general-purpose video codec optimises for compression efficiency and perceptual quality, while a teleoperation control loop needs minimum time-to-actionable-frame under packet loss — and those two objectives pull in opposite directions. You cannot tune a streaming encoder into a control-loop encoder, because the features that make it good at streaming are exactly the ones that add latency.

A purpose-built encoder for the teleoperation loop inverts the priorities. It drops B-frames entirely, runs zero or minimal lookahead, uses short GOPs with fast loss recovery, and is tuned so the first usable frame leaves the vehicle as fast as the hardware allows. On the GPU side, this is real performance engineering: choosing between hardware encode blocks (such as NVIDIA’s NVENC) and software paths, managing the handoff between capture, encode, and the transport buffer without redundant copies, and keeping the GPU pipeline from stalling on PCIe transfers. Codec engineering of this kind sits squarely inside GPU performance work rather than model development — which is why teams staffed for autonomy modelling often have no one who owns it.

To be clear about the boundary: our automotive practice is infrastructure-layer. We engineer encoders, transport, and GPU pipelines. We do not build the autonomy or remote-driving decision models — that is the customer’s domain. The value we add is removing the data-path latency floor so the customer’s system can actually meet its response-time requirement without overhauling the perception stack they have already invested in.

A Worked Decision: Which Layer Do You Re-Engineer?

Suppose a prototype remote-driving system has an end-to-end loop measured at, for example, 280 ms against a safety-case budget of 200 ms. The team’s instinct is to optimise the perception model, because that is where the headcount is. Here is the reasoning that should precede that decision.

First, attribute the 280 ms to stages using the table above — measure, do not assume. If the model inference is 30 ms and the encode stage is 110 ms, no model optimisation can close an 80 ms gap, because the model is not where the budget is going. Second, separate controllable from fixed: sensor capture and operator reaction are largely fixed; encode and transport are not. Third, attack the largest controllable lever first. In this illustrative split, replacing the stock encoder with a control-loop-tuned one is the single change with the most headroom.

This is the discipline that separates a system that demonstrates capability from one that meets the threshold. The same separation governs how a perception system is validated before release — the difference between a model that scores well and one that holds under audited conditions is the subject of what a perception robustness audit tests before you stake a release on it.

What a GPU Performance Audit of the Encoder Path Examines

When the bottleneck is unclear — and it usually is, because teams do not have stage-level latency attribution instrumented — the first move is to find out where the milliseconds go. A GPU Performance Audit scoped to the encoder and transport layer answers exactly that. It instruments the video path end to end, attributes latency to encode, transport, decode, and inference separately, and identifies whether the controllable share lives in the codec configuration, the GPU pipeline, or the transport protocol.

The point of scoping the audit this narrowly is to avoid the expensive default: re-architecting the perception stack to chase a latency problem that was never in the model. Most teams that run this audit find the bottleneck is not the model. That finding alone redirects engineering effort to the stage that can actually move the number.

FAQ

Is there an AI for car mechanics?

There are AI systems aimed at automotive diagnostics and maintenance, but that is a different problem space from the one this article addresses. Our automotive work is infrastructure-layer — video encoders, GPU pipelines, and the data paths feeding perception and teleoperation systems — rather than diagnostic tooling for mechanics.

Who are the big 3 AI companies?

There is no fixed “big 3,” and the answer depends on which slice of AI you mean — foundation models, cloud infrastructure, or accelerator hardware. For automotive teleoperation specifically, the names that matter are the GPU and codec hardware vendors (such as NVIDIA, whose NVENC encode blocks sit in many of these pipelines), because the latency floor is set at the hardware-and-encoder layer, not by any single model vendor.

What is AI for autonomous vehicles infrastructure?

It is the layer beneath the autonomy model: the sensors, image-signal processing, video encoders, transport, and GPU pipelines that move data from the vehicle to wherever it is processed or operated. This article focuses on the teleoperation slice of that infrastructure, where the video path’s latency determines whether a remote operator can intervene safely.

How much of a teleoperation system’s end-to-end latency budget is consumed by the video codec round-trip versus model inference?

It varies with the encoder, resolution, and link, but the encode stage routinely carries a larger share than teams expect and is often the single largest controllable contributor. The reliable way to know your split is to instrument the video path and attribute latency per stage — assuming the network is the culprit is the common error.

Why can’t off-the-shelf video encoders meet the response-time threshold required for safe remote driving?

A general-purpose codec optimises for compression efficiency and visual quality, which it achieves through B-frames, rate-control lookahead, long GOPs, and jitter buffering — all of which add latency. Those features are correct for streaming and storage but pull directly against a control loop’s need for minimum time-to-actionable-frame, so the encoder imposes a latency floor that tuning cannot remove.

What does a GPU Performance Audit scoped to the encoder and transport layer actually examine in a teleoperation video path?

It instruments the path end to end and attributes latency separately to encode, transport, decode, and inference, then identifies where the controllable delay lives — codec configuration, GPU pipeline, or transport protocol. The goal is to determine whether the bottleneck is the encoder, the transport, or the model before any expensive re-architecture begins.

The Question Worth Settling First

Before committing a team to model optimisation, settle the cheaper question: how much of your teleoperation latency budget is consumed by the codec round-trip rather than by inference? The answer reframes the entire engineering decision — and codec engineering as a first-order latency lever is the same reasoning that governs low-latency video in broadcast and live-media pipelines, where the encoder, not the model, sets the floor. If your remote-driving loop demonstrates capability but cannot meet its response-time threshold, a GPU Performance Audit scoped to the encoder and transport layer is the artifact that tells you whether the problem is the encoder, the transport, or the model — and most teams find it is not the model.