How does MIG change the economics of multi-tenant inference?

Partitions A100/H100 into up to seven isolated instances with dedicated compute/memory/bandwidth. Models fitting 10–20GB with sub-saturation batch sizes benefit; large models or saturating workloads do not. Configuration is per-host, held stable.

GPU Performance Settings for AI: Persistence Mode, Power Limits, MIG, and NUMA Pinning

Q: What does persistence mode do, and when should it be enabled for AI serving?

Keeps driver/firmware context loaded between processes, eliminating 1–3s cold-start penalty on A100/H100. Enable on inference-serving hosts (`nvidia-smi -pm 1`). Small idle power cost; trade overwhelmingly favourable for latency-sensitive serving.

Q: When do power limits at factory default leave performance on the table?

Factory cap balances worst-case mix. Sustained training in well-cooled DC can benefit from raising cap; bursty inference rarely benefits; thermally constrained environments can benefit from lowering cap to avoid throttling cycles. Workload-dependent.

Q: Why does NUMA pinning matter for multi-GPU AI workloads?

Cross-socket memory penalty on every CPU→GPU transfer. Co-locate threads/memory on socket connected to GPU (`numactl` or orchestrator equivalents). Data-loader-bound workloads can see 10–25% improvement; compute-bound less.

Q: How do GPU clock settings interact with the rest of the configuration?

Fixed clocks eliminate dynamic-boost oscillation; improve latency tail consistency at cost of some peak. Clock target must be supportable at configured power cap, else throttling at the fixed clock.

Q: What is the audit sequence for surfacing the configuration gap?

Capture settings (`nvidia-smi -q`) + workload profile (DCGM). Cross-reference defaults vs profile-suggested changes. Test on non-prod, measure delta, roll to prod with documented rationale.

Introduction

GPU configuration left at NVIDIA defaults costs 20–40% of measurable AI throughput on the workloads that pay for the hardware, and the cost is invisible until someone measures useful FLOPs against purchased FLOPs. Persistence mode unset, power limits at factory cap, MIG unused or misconfigured, NUMA pinning ignored — each setting is a single config command but the compounded effect is large enough to change procurement decisions. This article maps the settings that materially affect AI workloads, what each does, and how to decide what to change. See the GPU engineering practice for the audit work that quantifies the gap before procurement reflexively adds more capacity.

The naive read is “the GPU runs at the speed it runs; the workload determines throughput.” The expert read is that the workload does determine throughput, and the configuration determines how much of the GPU’s capability the workload actually gets to use.

What this means in practice

Persistence mode eliminates the 1–3s cold-start penalty on first GPU access — critical for serving.
Power limits at factory cap may be sub-optimal for sustained AI workloads; tuning is workload-dependent.
MIG partitions a single H100/A100 into multiple isolated instances; multi-tenant inference benefits.
NUMA pinning matters when GPUs and host CPUs are on different sockets; multi-GPU training especially.

What does persistence mode do, and when should it be enabled for AI serving?

Persistence mode keeps the NVIDIA driver and the GPU firmware context loaded between processes. Without it, the driver unloads the context when no process is using the GPU; the next process to access the GPU incurs a 1–3 second cold-start penalty on A100/H100 class hardware. For interactive inference serving, this cold start translates directly to user-visible latency on the first request after an idle period.

Enable persistence mode (nvidia-smi -pm 1) on inference-serving hosts and on any host where workloads expect immediate GPU availability. The cost is a small amount of idle power (the GPU stays in a higher power state than fully unloaded). For latency-sensitive serving the trade is overwhelmingly favourable. For batch-only workloads with infrequent GPU access, persistence mode is less critical but rarely harmful.

When do power limits at factory default leave performance on the table?

The factory power cap is set to balance peak performance against thermal and electrical envelopes that cover the worst-case workload mix. For sustained AI workloads in a well-cooled data centre, the factory cap is sometimes lower than the GPU can sustainably run at. Raising the cap (nvidia-smi -pl <watts>) within the GPU’s supported range lets the boost clocks sustain higher frequencies on sustained workloads, with measurable throughput gains.

The direction is workload-dependent. For training workloads that thermally saturate the GPU, raising the cap helps modestly. For inference workloads with bursty GPU utilisation, raising the cap rarely helps because the workload does not sustain at the cap long enough to matter. For workloads in thermally constrained environments (edge, partially cooled racks), lowering the cap can improve sustained throughput by avoiding thermal throttling cycles. The setting needs profiling against the actual workload rather than copying a number from a benchmark blog.

How does MIG (Multi-Instance GPU) change the economics of multi-tenant inference?

MIG partitions an A100 or H100 into up to seven isolated instances, each with its own dedicated compute, memory, and memory bandwidth slice. For multi-tenant inference where each tenant runs a model that does not need the full GPU, MIG lets a single physical card serve multiple tenants concurrently with hardware-enforced isolation — no noisy-neighbour failure modes, no contention for the memory hierarchy.

The decision to use MIG depends on whether the inference workloads are sized for partial GPUs. Models that fit comfortably in 10–20 GB and have inference batch sizes that do not saturate the full GPU benefit from MIG; large models or high-throughput inference workloads that do saturate the full GPU do not. The configuration is set at the GPU level and changes require recycling workloads — operationally, MIG configurations are typically picked per host and held stable rather than reconfigured dynamically.

Why does NUMA pinning matter for multi-GPU AI workloads?

Modern multi-socket servers have NUMA topology: each CPU socket has its own memory and the PCIe lanes that connect to specific GPUs. A workload running on a CPU socket that is not directly connected to the GPU it is feeding pays a cross-socket memory penalty on every data transfer to the GPU. For data-intensive workloads (data loaders feeding training, large-batch inference with substantial preprocessing), this cross-socket cost shows up as reduced throughput and increased CPU usage.

NUMA pinning (numactl --cpunodebind=N --membind=N, or container-orchestrator equivalents) co-locates the workload’s CPU threads and memory allocations on the socket directly connected to the GPU. The throughput gain depends on the workload — data-loader-bound workloads can see 10–25% improvement, compute-bound workloads less. The setting is invisible without explicit topology awareness, and is one of the most common “we left throughput on the table” issues found in audits.

How do GPU clock settings interact with the rest of the configuration?

Clock settings (application clocks, graphics/memory clock limits) let the operator fix the GPU at specific clock targets rather than letting the dynamic boost algorithm pick. For sustained workloads where the dynamic boost ends up oscillating (boost up to a thermal limit, throttle back, oscillate), fixing clocks at a sustainable target eliminates the oscillation and produces more consistent throughput.

For inference workloads where consistent latency matters more than peak throughput, fixed clocks at a sustainable target improve latency tail consistency at the cost of some peak performance. For training workloads where throughput-over-time is the metric and brief throttling is acceptable, default dynamic boost is often fine. The interaction with power limits is that the clock target must be supportable at the configured power cap — fixing clocks at a target the power cap cannot sustain produces throttling at the new fixed clock rather than the dynamic algorithm’s adaptive response.

What is the audit sequence for surfacing the configuration gap before procurement?

Five steps. Capture the current settings: nvidia-smi -q snapshots persistence mode, power limit, application clocks, MIG configuration, and ECC settings on each GPU. Capture the workload’s actual GPU utilisation, memory utilisation, and memory bandwidth utilisation across a representative period using DCGM or framework-level profiling.

Cross-reference: which settings are at default, and does the workload profile suggest a non-default setting would help? Test the changes on a non-production host with the same workload, measure the throughput delta, and decide whether to roll the change to production. Document the settings and the rationale — undocumented configuration changes get reverted by the next operator who assumes defaults are intentional. The audit sequence catches the easy wins before procurement adds GPUs for a problem configuration would have solved.

Limitations that remained

Configuration tuning recovers throughput within the existing hardware envelope but cannot fix architectural mismatches — workloads that need more memory than the GPU has, workloads that need faster interconnect than the current generation provides, or workloads that need a different accelerator class entirely. Some tuning interacts with vendor support contracts in ways the operator should verify (aggressive power-limit raises on consumer-grade GPUs in particular). The configuration audit produces a list of changes; rolling those changes in production requires the operational discipline to verify and document, which not every team has the bandwidth to sustain.

How TechnoLynx Can Help

TechnoLynx runs GPU performance audits that capture the configuration gap, quantify the throughput recoverable from tuning, and produce the documented settings and rationale that survive operator turnover. If your AI infrastructure is running at vendor defaults and the throughput numbers are below what the hardware should deliver, contact us for an audit.

Image credits: Freepik