CASE STUDY

Embedded Video Coding on GPU (Under NDA)

TechnoLynx built a CUDA-based, standards-compliant H.264 encoder for an automotive edge startup targeting a Jetson Nano-class GPU. The brief: 1080p/30fps across four or more simultaneous streams with a hard CPU cap of 5%. Outcome: ~24 FPS on target hardware — more than double the previous implementation — and a ~3.6% average compression gain in low-QP benchmark conditions.

GPU Optimisation CUDA H.264 Main Profile Jetson Nano Embedded Edge

The Challenge

The client was a post-funding startup building an embedded video coding approach with the potential to change how video is compressed, processed, and delivered. However, off-the-shelf encoders were not customisable enough and could not integrate cleanly into their pipeline.

Deep Encoder Customisation

The initial brief called for an OpenCL-based encoder. After assessing the Jetson Nano's architecture, TechnoLynx moved execution to CUDA — better suited to the NVIDIA target and enabling the fine-grained control the encoding pipeline required. Off-the-shelf encoders could not provide this level of customisation.

Standards Compliance Constraint

The client did not control the decoder side. Every bitstream had to be strictly H.264 Main Profile, Level 4.0 compliant — no B-frames, no interlaced coding — while still accommodating the client's proprietary encoding innovations.

Hard CPU Headroom Constraint

The target hardware was a Jetson Nano-class embedded NVIDIA GPU. The CPU had a strict operating cap of ≤5%, leaving the processor available for the rest of the system. Every computationally intensive encoding task had to be offloaded to the GPU.

Multi-Stream Throughput

The encoder had to sustain four or more simultaneous 1080p streams at 30 fps — all within the same CPU and GPU budget. This ruled out any per-stream implementation that did not share resources efficiently.

Embedded video coding on GPU

Image credits: Freepik.

Project Timeline

~11 months active — from agreement to handover

Constraints & Requirements

Agreed the Jetson Nano-class target hardware, the ≤5% CPU cap, 1080p/30fps at 4+ simultaneous streams, and H.264 Main Profile compliance. Switched from OpenCL to CUDA after evaluating the NVIDIA architecture.

Split the encoding pipeline into distinct modules and defined APIs, enabling both teams to work autonomously while ensuring smooth integration.

Modular Pipeline

Core Encoder Work

Implemented and iteratively improved transform and prediction functionality, starting from a state-of-the-art baseline agreed with the client.

Ran video-quality benchmarking to measure compression outcomes, comparing per-frame VMAF scores and per-class gains against the previous delivery.

VQ Benchmarking

Profiling & Performance

Profiling on the Jetson Nano target confirmed approximately 24 FPS — more than double the previous implementation's throughput. GPU utilisation was within budget; CPU headroom was preserved.

Used CUDA to offload computationally intensive work to the GPU, ensuring the GPU was utilised efficiently without overwhelming the system.

GPU Optimisation

Handover & Delivery

Delivered source code, sample application, and documentation. Integration with the client's wider pipeline was validated; the project moved into handover mode.

The Solution

TechnoLynx worked closely with the client’s team in a hands-on collaboration model. The client focused on video coding direction, while TechnoLynx owned GPU-specific optimisation and performance-critical improvements within the encoding process.

Architecture

Adopted a modular pipeline design by defining distinct encoder modules and establishing clear APIs, allowing parallel progress while ensuring seamless system integration.

Encoder Core

Focused on transform and prediction functionality—key drivers of compression efficiency—and iterated improvements from a state-of-the-art baseline through benchmarking and tuning cycles.

CUDA Acceleration

Used CUDA to push computationally intensive tasks to the GPU so the CPU could remain available for other critical system processes.

Technical Specifications

Tools Cross-platform C++, CUDA (NVIDIA), CMake
Modules Transform and prediction functionality (compression efficiency drivers)
Profile H.264 Main Profile, Level 4.0 — no B-frames, no interlaced coding
Requirement Reliable operation across different operating systems and GPU environments
Constraint Maintain compliance with established coding standards (decoder not controlled)
Hardware target Jetson Nano-class embedded NVIDIA GPU
Performance constraint ≤5% CPU usage; 4+ simultaneous 1080p/30fps streams
Deliverables Source code, sample application, documentation
Embedded GPU video coding

The Outcome

TechnoLynx delivered a fully customised CUDA-based encoder that substantially advanced encoding performance on the target hardware. On target Jetson Nano-class hardware, profiling-driven optimisation achieved approximately 24 FPS — more than double the performance of the previous implementation. A controlled benchmark also recorded an average ~3.6% compression gain for low-QP encoding, enabling higher-quality video at lower bitrates.

Key Achievements

Achieved approximately 24 FPS on Jetson Nano-class hardware — more than double the performance of the prior implementation (profiling measurement)

Recorded a ~3.6% average compression gain for low-QP encoding in a controlled benchmark scenario (single test video, 25% static mask condition) vs. the previous delivery

Offloaded transform and prediction tasks to the GPU, targeting the ≤5% CPU cap required by the embedded target

Optimised motion estimation to produce smoother motion fields, enabling more than 50% of macroblocks to be coded as PSkip in optimised runs — reducing bitrate at equivalent quality

Maintained strict H.264 Main Profile, Level 4.0 standards compliance throughout — ensuring full decoder compatibility despite the custom encoding modifications

Consistent performance across different GPU environments and operating systems using cross-platform C++ and CMake

"

When we needed immediate support for a complex software development project, we came in contact with TechnoLynx and in the end worked with them for 1 year. They showed enormous skill and vast domain knowledge. We would recommend TechnoLynx to anyone looking for promptness, quality work and IT expertise.

Anonymous Client

Automotive Edge Startup — 2021

Our Technological Capabilities

Computer Vision Services

Our services feature expertise in classical computer vision, human-supervised system design for legal compliance, video pipeline optimisation with tools like FFmpeg, custom adaptable models, and explainable AI for ethical transparency.

Pharma

Generative AI

We are leaders in generative AI, offering optimised inference for faster deployments, ethical AI systems with bias mitigation, intelligent automation for adaptive workflows, and advanced simulation and prototyping capabilities.

DNA

GPU Performance Engineering

We specialise in GPU-accelerated compute for embedded and edge systems: CUDA kernel development, profiling-driven optimisation, multi-stream throughput, and low-level encoder engineering on NVIDIA hardware. We make demanding workloads fit constrained targets.

Camera

Need GPU-Accelerated Encoding for Embedded Edge?

Let's discuss CUDA-based encoder development for embedded targets — performance profiling, compression optimisation, and standards-compliant bitstream delivery.