Introduction A capacity plan built on accelerator nameplate TDP is a fiction. The number on the spec sheet is the maximum the silicon can draw under sustained worst-case load; the number the deployed workload actually pulls is workload-conditional, often off by 30β60% in either direction, and varies across the same hardware depending on which model is loaded and how the serving stack is configured. This article makes the case that AI data center power is a function of the workload first and the hardware second, walks the failure modes the nameplate-multiplied plan produces, and frames power as a capacity input that needs the same profiling discipline as GPU FLOPs. See the GPU engineering practice for the broader audit framework. The naive read is βmultiply count by TDP, add cooling overhead, done.β The expert read is that the same fleet runs at radically different power footprints across inference-bound and training-bound workloads, and that the cost of overbuilding capacity is paid forever in capex and stranded power-purchase agreements. What this means in practice Nameplate TDP overstates power for memory-bound inference; understates it for sustained training under aggressive utilisation tuning. Cooling and PUE assumptions depend on the workload mix, not on the count of accelerators. Profile the actual workload power draw before signing the next power-purchase agreement. Track power per useful FLOP, not power per purchased FLOP β the metric scales with the procurement decision. How do I calculate the true cost of an underutilised GPU fleet? For the power axis specifically: cost is hardware amortisation plus power (kWh Γ tariff) plus cooling (kWh Γ tariff Γ (PUEβ1)) plus operational overhead, divided by useful FLOPs (the model FLOPs actually consumed by the workload, not the GPUβs theoretical peak). The composite picture is regularly 3β10Γ the back-of-envelope estimate teams use when they pitch the next procurement cycle. This range is an observed pattern across our GPU audit engagements, not a benchmarked rate β the multiplier in any specific environment depends on tariff, PUE, and workload mix. The discipline that closes the gap is profiling. A representative two-week capture of the workloadβs actual power draw (DCGM exports power-per-GPU at second-granularity) plus the useful FLOPs the workload consumed (via Nsight Compute or framework-level profilers) produces the per-useful-FLOP power number. Procurement decisions made against this number rather than against nameplate TDP avoid the 30β60% overbuild that becomes stranded power and capex. What does GPU utilisation actually measure β and why is the GPU-busy percentage misleading? nvidia-smiβs GPU-utilisation percentage measures whether any Streaming Multiprocessor (SM) had any warp scheduled during the sampling window. It is silent on how many SMs were active, what fraction of memory bandwidth was used, and whether the active kernel did useful work. A 5%-occupancy single-SM kernel shows up as β100% utilised.β Power draw correlates with the composite picture β SM occupancy, memory bandwidth, frequency state β not with the nvidia-smi headline. For power-planning purposes, the useful triple is GPU power draw (DCGM), SM occupancy plus memory bandwidth utilisation (Nsight Compute), and arithmetic intensity vs roofline. These together explain the power footprint and predict how the footprint will change when the workload mix or model size shifts. Power-planning that uses only the GPU-busy headline systematically misestimates both the average and the peak draw β we see this pattern regularly in audits of fleets sized against nvidia-smi dashboards. How do I compute total cost of ownership per useful FLOP rather than per purchased FLOP? Useful FLOPs are the FLOPs the workloadβs actual computation requires (matmul, conv, attention) β measured at kernel level, not the spec sheet peak. For the power-conditional TCO: include power and cooling alongside hardware amortisation, divide by annualised useful FLOPs. Teams that adopt this metric typically discover their existing fleet has 2β5Γ the useful capacity the procurement plan assumed, which directly translates to deferred power-purchase commitments. The 2β5Γ is an observed range across our audit engagements; the upper bound is most common in fleets where mixed precision has not yet been adopted. The operational version of the metric: instrument the workload to log per-kernel FLOPs and per-period GPU power draw; aggregate into useful-FLOPs-per-kWh as the headline efficiency number. This number, tracked over time, exposes whether optimisations actually move the cost-per-output needle or just shift work between bottlenecks. Which workload patterns most often leave GPU capacity on the table? Four patterns recur and each has a distinct power signature. Pattern Power signature Useful-work signal Typical recovery Host-bound (CPU canβt feed GPU) Moderate, with idle troughs SMs idle large fraction of time 20β80% (observed) Small-batch Sustained but inefficient High launch/sync overhead 30β100% (observed) Memory-bound run as compute-bound Below nameplate plateau Bandwidth saturated before compute 30β200% (observed) Mixed-precision ignored (fp32 only) 2β4Γ higher per useful FLOP bf16/int8 path unused 2β4Γ useful FLOPs (observed) Host-bound workloads pay for capacity that very little is consumed of. Small-batch workloads burn launch and synchronisation overhead; GPU power is sustained but useful work per Joule is poor. Memory-bound kernels run at lower power than the silicon can sustain because bandwidth saturates before compute does; the throughput plateau is at a lower power point than the nameplate implies. Mixed-precision opportunities ignored: a workload running in fp32 where bf16 or int8 would deliver 2β4Γ throughput consumes 2β4Γ more power per useful FLOP than necessary. Each costs the data center money β the four together typically account for the majority of stranded capacity we find in audits. Should I procure additional GPU capacity or first profile the utilisation of what I have? Profile first, always, before any additional power or accelerator procurement. A GPU performance audit measures actual utilisation and power per workload, identifies where capacity (and the associated power) is wasted, and quantifies the achievable improvement. In our experience, the audit typically recovers 30β80% of headroom that procurement would have spent to buy, and the power-purchase-agreement implications often dwarf the hardware cost. The 30β80% range is an observed pattern; outliers in either direction exist when workloads are already well-tuned or, conversely, when no optimisation work has ever been done. The audit also exposes when additional procurement is genuinely needed β the workload saturates well-optimised existing capacity, growth demands capacity beyond the optimisation envelope, the new workload mix has a different power profile that the existing infrastructure cannot serve. Those cases become easier to defend against finance after the audit because the number rests on measurement, not on spec-sheet multiplication.There is one more axis the procure-versus-profile decision has to carry: where the wasted spend lands on the bill. On cloud (AWS, Azure), underutilisation shows up directly as billed instance-hours β you pay the published per-GPU-hour rate whether the silicon is saturated or idle, so a host-bound or small-batch workload bleeds money every hour the reservation is live. On-prem, the same waste is buried in already-committed capex and in the power-purchase agreement: the stranded power capacity is paid for regardless of draw, and the marginal cost of one underused GPU is harder to see because it is amortised. The practical consequence is that cloud waste is faster to detect (it is itemised) but on-prem waste is larger in aggregate (it is pre-committed). Either way, profiling the workload before scaling the reservation or the PPA is the move that converts an invisible recurring loss into a measured, fundable optimisation. For the broader conversation about why GPUs are the dominant capacity-planning unit for AI in the first place, our piece on why GPUs are the bottleneck unit for AI sits upstream of this one. What cost savings are realistic from optimising utilisation versus renting more cloud GPUs? Audit ranges, with the power axis explicit. Host-pipeline fixes (data loading, preprocessing placement, async transfers) recover 20β80% useful FLOPs without changing average GPU power, dropping power-per-useful-FLOP proportionally. Batch-size and kernel-fusion fixes deliver 30β100% useful FLOPs at moderately higher average power β net power-per-useful-FLOP improves substantially. Mixed-precision adoption delivers 2β4Γ useful FLOPs at lower average power per FLOP (the multiply-accumulate units for fp16/bf16/int8 draw less than the fp32 equivalents). Memory-layout and bandwidth optimisations recover 30β200% on memory-bound kernels β power often stays flat as bandwidth saturates earlier, so power-per-useful-FLOP improves dramatically. In aggregate, audited fleets commonly absorb 2β3Γ the current workload without additional procurement, and the power-purchase-agreement implications make this the highest-leverage optimisation a CFO can fund. These ranges are observed across our engagements rather than externally benchmarked; the per-fleet number depends on starting state. Limitations that remained The power axis adds discipline to capacity planning but does not eliminate the rest of the work. Power profiling needs the same tooling (DCGM, Nsight Compute) and the same operational discipline as FLOPs profiling β organisations new to GPU operations underestimate the ramp. Some workloads are genuinely close to optimal; the audit confirms procurement is needed. Cooling and PUE assumptions depend on data center context that the per-workload audit cannot fully model β facilities engineering needs to be in the loop for the full picture. Mixed-precision adoption needs accuracy validation that some teams treat as optional; rushed adoption produces correctness regressions that look like model failures but are precision artefacts. How TechnoLynx Can Help TechnoLynx runs GPU performance audits that include the power-and-cooling axis explicitly, exposing the power-per-useful-FLOP picture before the next procurement or power-purchase-agreement signing. If you are about to commit power capacity based on nameplate TDP arithmetic, contact us for a workload-conditional audit first. Frequently Asked Questions How does GPU underutilisation on cloud procurement differ from on-prem, and where does the wasted spend show up on each bill? On cloud (AWS, Azure) the waste is itemised: you pay the published per-GPU-hour rate for every hour a reservation is live, so a host-bound or small-batch workload bleeds money directly into billed instance-hours whether the silicon is saturated or idle. On-prem, the same waste is buried in committed capex and in the power-purchase agreement β the stranded power capacity is paid for regardless of draw, and one underused GPU is hard to spot because its cost is amortised across the fleet. Cloud waste is faster to detect because it is line-itemed; on-prem waste tends to be larger in aggregate because it is pre-committed. In both cases the fix is the same: profile the workload before scaling the reservation or signing the next PPA. How do I tell whether a cloud GPU reservation is overprovisioned before the next renewal? Capture per-GPU power draw with DCGM and useful FLOPs with Nsight Compute over a representative one-to-two-week window, then compute power and capacity per useful FLOP against what you are billed for. If the workload is host-bound, small-batch, memory-bound, or stuck in fp32, the useful-work fraction will sit well below the instance-hours you pay for. Renewing or scaling the reservation against the nvidia-smi headline rather than this profile is what locks in the recurring overspend. Does the power-per-useful-FLOP metric still matter when GPUs are rented rather than owned? Yes β on rented capacity the power cost is folded into the per-GPU-hour rate, so the metric reframes as useful work per billed hour, but the diagnosis is identical. A workload that wastes power per useful FLOP on-prem wastes billed hours on cloud, because both are paying for capacity the workload never converts into useful work. Tracking useful-FLOPs-per-kWh (or per billed hour) over time is what tells you whether an optimisation actually moved the cost-per-output needle or just shifted the bottleneck. Image credits: Freepik