How do I compute TCO per useful FLOP?

Divide total workload cost (hardware or rental, plus power and ops) by the integral of achieved throughput over the run. The result is usually 4–7× the marketing peak FLOPs figure.

Which workload patterns leave GPU capacity on the table?

Small-batch inference, CPU-bound data pipelines, tile-misaligned kernels, and gradient accumulation with sync barriers. Each has a distinct fix — batching, prefetching, kernel choice, or parallelism strategy.

Should I procure more GPUs or profile what I have first?

Profile first. Under-profiled workloads typically yield 30–50% capacity from data-pipeline and batching fixes alone, at a fraction of the cost of doubling the fleet.

What cost savings are realistic from optimisation vs renting more capacity?

30–60% reduction in compute spend from a first optimisation pass; 5–15% from each subsequent pass on already-tuned workloads. Renting more capacity to cover underutilisation multiplies unit cost without addressing the cause.

GPU vs TPU vs CPU: Performance and Efficiency Explained

Q: How do I calculate the true cost of an underutilised GPU fleet?

Take purchased capacity, multiply by achieved utilisation to get effective capacity, then divide TCO (hardware amortisation, power, cooling, operations) by that effective capacity. A 30% utilised fleet costs over 3× per useful hour.

Q: What does GPU utilisation actually measure?

Headline utilisation only reports whether at least one SM is active. Real efficiency requires SM occupancy, achieved memory bandwidth, tensor-core utilisation, and compute-vs-transfer ratios from a profiler like Nsight Compute.

Introduction

CPUs, GPUs, and TPUs exist because hardware architects made different bets between flexibility and throughput. The naive read is “which is fastest?” The expert read is “which design matches the bottleneck of this workload, in this software stack, at this scale?” The failure mode that this article exists to name is the one that quietly drains AI infrastructure budgets: paying for accelerator capacity that sits idle because the workload, the data path, or the batching strategy never let the chip reach its design throughput. Underutilised accelerators cost real money, and the cost compounds monthly. See the GPU engineering practice for the broader context.

The naive procurement path is to buy the largest GPU fleet that the budget allows, then discover after deployment that utilisation sits at 25%. The expert path is to profile first, classify the workload, and pick the accelerator that the bottleneck structure actually demands — sometimes a smaller GPU, sometimes a TPU, sometimes a well-tuned CPU pipeline. This article is the failure-mode lens on that decision.

What this means in practice

Measure utilisation per workload before procuring additional capacity — “GPU busy” is not the same as “GPU productive.”
Compute total cost of ownership per useful FLOP, not per purchased FLOP.
Match the accelerator to the dominant pattern: irregular control flow → CPU; data parallelism → GPU; dense matmul → TPU.
Treat underutilisation as a profiling failure, not a hardware-shortage signal.

How do I calculate the true cost of an underutilised GPU fleet?

Start from the purchased capacity and the achieved utilisation. If a cluster of eight A100-class GPUs runs at 30% average utilisation across the year, the effective capacity is 2.4 GPUs — but the cost is for 8. The TCO calculation needs to include the hardware amortisation, the power draw at idle (which is much lower than at load but non-zero), the cooling overhead, and the engineering time spent operating capacity that isn’t being used.

For cloud-rented GPUs the calculation is simpler and more painful: hourly rate times hours rented, divided by the hours of productive compute. A 24/7 reserved instance at 30% utilisation pays 3.3× the cost per useful hour compared to a fully utilised instance. The cost per useful FLOP is the metric that makes the gap visible — purchased FLOPs are a marketing number.

What does “GPU utilisation” actually measure — and why is the GPU-busy percentage misleading?

The headline GPU utilisation number reported by nvidia-smi and most monitoring tools measures whether at least one streaming multiprocessor is active. It does not measure how many SMs are active, whether the active ones are running at full warp occupancy, or whether they are stalled waiting on memory. A GPU can report 99% utilisation while doing 10% of its theoretical work.

The metrics that matter for real efficiency are SM occupancy, achieved memory bandwidth, tensor-core utilisation (on hardware that has them), and the ratio of compute time to data-transfer time. Profilers like Nsight Compute expose these. The shorthand “GPU busy = GPU productive” is the trap that leads teams to procure more capacity when the existing capacity is structurally underused.

How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP?

Useful FLOPs are the operations that contributed to the final result, measured at the workload’s achieved throughput rather than the chip’s peak rating. Divide the workload’s wall-clock cost (hardware amortisation or cloud rental, plus power, plus operations overhead) by the integral of achieved throughput over the run. The denominator is what profiling gives you.

The exercise often produces uncomfortable numbers. A workload that achieves 15% of peak on an H100 has a TCO-per-useful-FLOP that is more than 6× the marketing figure. The corrective action is rarely “buy a different chip” — it is more often “fix the data pipeline” or “change the batching strategy.” The TCO calculation is what makes the case for the engineering work.

Which workload patterns most often leave GPU capacity on the table?

The recurring culprits are: small-batch inference where the kernel launch overhead dominates the useful work; data-loading pipelines that cannot keep the GPU fed because preprocessing is CPU-bound or storage is too slow; attention or convolution kernels operating on sequence lengths that do not align with the hardware’s preferred tile sizes; and training runs with gradient accumulation steps that interleave compute with synchronisation barriers.

Each pattern has a different fix. Small-batch inference benefits from request batching or smaller, latency-optimised hardware. Data-loading bottlenecks need prefetching, pinned memory, and sometimes a separate storage tier. Tile-misaligned kernels need either a different kernel implementation or an input-shape change. Gradient accumulation overhead points at the parallelism strategy, not the hardware.

Should I procure additional GPU capacity or first profile the utilisation of what I have?

Profile first, every time. The cost of a profiling exercise (engineering days, not capital expenditure) is small compared to the cost of a procurement decision that doubles the fleet for a workload that was using 30% of the existing one. The recurring finding from utilisation audits is that 30–50% improvements are available from changes to the data pipeline, batching strategy, or kernel implementation — without touching the hardware budget.

The exception is when the workload is genuinely capacity-bound and profiling confirms it: high utilisation, balanced kernels, no data-pipeline stalls, and the only lever left is more devices. Even then, profiling produces the procurement specification — which class of accelerator, what memory profile, what interconnect — rather than a blanket “more of the same.”

What cost savings are realistic from optimising utilisation versus renting more cloud GPUs?

For most under-profiled workloads, 30–60% reduction in compute spend is realistic from a one-time optimisation pass: data-pipeline fixes, batching changes, mixed-precision adoption, and kernel-level tuning where the workload warrants it. Workloads that have already been profiled and tuned see smaller incremental wins (5–15%) from continued effort, which is why the first profile is the highest-leverage one.

Renting additional capacity to cover an underutilisation problem multiplies the unit cost without addressing the cause. The pattern that reliably saves money is: profile, identify the largest single bottleneck, fix it, re-profile, repeat until the marginal engineering hour costs more than the saved compute hour.

Limitations that remained

Utilisation-driven optimisation has a ceiling: at the point where the workload is well-batched, the data pipeline is saturated, and kernels are tuned, the remaining gap to peak is set by the hardware itself and the workload’s intrinsic structure. Some workloads — sparse attention, irregular graph computation, control-heavy inference — will never reach the FLOP utilisation that dense matrix multiplication achieves on a TPU, and pursuing the last few percent costs more than it returns. Cross-vendor portability also remains a real friction: a workload optimised for one accelerator class does not automatically translate, and the engineering cost of staying portable is a tax that has to be priced into the TCO calculation.

How TechnoLynx Can Help

TechnoLynx runs GPU utilisation audits that measure the actual achieved throughput of your workloads, identify where capacity is leaking, and produce a prioritised list of changes — data-pipeline fixes, batching strategy, kernel-level tuning, accelerator-class choices — sized to the cost of the engineering work. If you are about to approve another GPU procurement cycle, contact us before you sign — the utilisation number you have may not be the one you think.

Image credits: Freepik