Phoronix Benchmark for GPU AI Testing: Setup, Results, and Interpretation

Phoronix provides reproducible AI-relevant GPU benchmarks

Unlike Geekbench or 3DMark, the Phoronix Test Suite (PTS) ships test profiles that exercise actual AI framework code: TensorFlow training benchmarks, PyTorch inference tests, ONNX Runtime profiles, and quantised LLM runs through llama.cpp. For comparing GPU hardware in a documented, reproducible way, PTS is more relevant to AI than most consumer benchmark alternatives — provided you read the numbers correctly. The trap is treating a PTS score as a forecast of production throughput. It is not. It is a controlled snapshot of one fixed workload running on a particular stack, and the gap between that snapshot and what your inference service will actually see can be larger than the gap between two competing GPUs.

We run PTS regularly in our engagements, mostly as a driver and stack validation tool. That framing — scaffolding, not verdict — is the one most teams miss.

Setting up Phoronix for GPU AI testing

# Install Phoronix Test Suite
wget https://phoronix-test-suite.com/releases/phoronix-test-suite-10.8.4.tar.gz
tar xzf phoronix-test-suite-10.8.4.tar.gz
cd phoronix-test-suite
sudo ./install-sh

# Run TensorFlow benchmark
phoronix-test-suite benchmark tensorflow

# Run PyTorch benchmark
phoronix-test-suite benchmark pytorch

# Run ONNX Runtime benchmark
phoronix-test-suite benchmark onnxruntime

Each profile pins a model, a batch shape, and a precision. That pinning is what makes the run reproducible — and also what makes it narrow. The profile that runs on your test node is the same profile that ran on the published comparison node, which is exactly the property production workloads do not have.

What are the key AI-relevant Phoronix test profiles?

Profile	Model tested	Metric	What it measures
tensorflow-benchmark	ResNet-50	Images/second	Training throughput, fixed batch
pytorch-benchmark	ResNet-50, BERT	Items/second	Training/inference, fixed batch
onnxruntime	ResNet-50	Latency/throughput	Inference framework path
llama.cpp	Quantised LLM	Tokens/second	CPU+GPU LLM inference

Note the pattern: every profile is single-stream, fixed batch, fixed model. That is the right shape for a reproducible test. It is the wrong shape for predicting an inference service that sees concurrent requests, variable sequence lengths, and queuing.

Interpreting Phoronix GPU benchmark results

A few interpretive rules we apply on every PTS result we read.

ResNet-50 training throughput is a useful relative comparison across training infrastructure but does not transfer cleanly to modern architectures. ViTs, diffusion U-Nets, and decoder-only transformers stress different parts of the GPU — attention kernels, memory bandwidth into HBM, tensor-core utilisation under FlashAttention — in different proportions than ResNet’s convolutional stack. A GPU that wins on ResNet-50 by a comfortable margin can lose on long-context attention because the bottleneck has moved.

ONNX Runtime inference tests use smaller models and batch sizes than production. GPU efficiency at batch=1 versus batch=32 is non-linear, and the curve is model-dependent. Scaling a PTS latency number linearly to production batch size is one of the most common misreadings we see.

Cross-submission comparisons require matching the software environment. PTS publishes community benchmark results, and the temptation is to compare your number to the leaderboard. Driver version, CUDA version, cuDNN version, and framework version each move the result. In our experience, differences in the software stack alone produce roughly 20–40% variation on identical hardware (observed pattern across the engagements where we have pinned hardware and varied stacks; not a published benchmark).

Quick-answer block: what PTS does and does not tell you

Question	PTS answer quality
Is my GPU driver stack functional end-to-end?	High — a failing PTS run reliably indicates a problem
Will this GPU outperform that GPU on my workload?	Low — only on PTS’s fixed workload, not yours
How will the GPU scale to my batch and concurrency?	None — PTS is single-stream, fixed batch
Did a driver update regress AI throughput?	High — controlled before/after on identical hardware
What is my absolute production throughput?	None — PTS measures a proxy workload

This is the discipline that separates useful PTS use from misleading PTS use. The suite is a controlled-environment instrument. It is not a production predictor.

What does a Phoronix GPU test tell you about AI readiness?

Phoronix’s GPU benchmarks fall into three families: OpenGL rendering (Unigine, GpuTest), Vulkan compute (vkpeak), and framework-specific AI tests (PyTorch, TensorFlow, ONNX Runtime, llama.cpp). The AI-specific tests are the only ones that predict AI workload performance with reasonable accuracy — and even those measure a narrow slice of the surface a production system actually traverses.

The PyTorch benchmark in PTS runs ResNet-50 inference at a fixed batch size. That tells you whether the GPU, driver, and CUDA/cuDNN stack are correctly installed and functioning. It does not tell you how the GPU will behave on your specific model architecture, sequence length, or batch configuration. A Stable Diffusion run, an LLM inference run, and a ResNet-50 inference run stress different subsystems — compute units, memory bandwidth, tensor cores, the attention kernel path — in different proportions. The PTS profile gives you one point in a high-dimensional space.

We use PTS primarily as a driver validation tool. After installing or updating NVIDIA drivers on Linux, running the PTS PyTorch test confirms that the full chain — driver, CUDA runtime, cuDNN, PyTorch, model execution — is functional. A passing PTS result does not guarantee production readiness. A failing PTS result reliably indicates a stack problem. That asymmetry is the useful property.

For cross-vendor comparison (NVIDIA versus AMD), PTS provides a controlled environment where both vendors run the same test code. This eliminates the software-stack variable that confounds most ad-hoc cross-vendor comparisons. The catch: PTS framework tests typically do not use vendor-specific optimisations — no FlashAttention on the NVIDIA side, no MIOpen tuning on the AMD side — so the results reflect unoptimised baseline performance rather than what a production-tuned deployment would reach. In rough terms, we see the gap between PTS results and production-tuned performance run at roughly 20–40% on NVIDIA (where framework optimisations are mature) and 40–60% on AMD (where additional tuning effort is required). These are observed patterns from our deployments, not published benchmarks.

That is enough margin to make PTS useful for sanity-check ballpark comparison and unreliable for procurement decisions where the difference between two configurations is being weighed in single-digit percent.

Comparing PTS results across driver versions

One under-used application of PTS is tracking AI performance across driver updates. Running the same profile before and after a driver update on identical hardware produces a controlled comparison that isolates the driver’s performance impact — exactly the kind of variable that disappears into noise in production traffic.

We maintain a PTS result database for our production GPU configurations. When evaluating a driver update — for example, moving from a 535.x line to a 550.x line — we run the PTS PyTorch and TensorFlow profiles on a test node before updating, then again after. A throughput change of more than 3% triggers investigation: either the new driver has introduced a regression (which we then report upstream) or it has enabled an optimisation worth understanding before rolling forward.

This approach has caught three significant driver regressions before they reached production over an 18-month window. In each case the PTS test showed a 5–12% throughput drop that had been invisible in manual production observation because it fell inside the normal variation of live traffic. The controlled, identical-workload comparison made the regression visible (observed in our own engagement history; not a benchmark you can run against our environment).

Why the PTS number is not the production number

This is the crux. PTS pins the workload. Production does not.

A real inference service runs multiple concurrent requests with variable arrival times. It queues. It batches dynamically. Sequence lengths vary across requests; KV-cache sizes vary across the lifetime of a session. Memory pressure interacts with concurrency in ways that single-stream throughput cannot reveal. The PTS number describes the GPU’s behaviour on one fixed point in that space. The production number is an integral over a distribution of points, weighted by the actual traffic pattern.Dynamic batching makes this gap sharper, not softer. A serving stack like vLLM coalesces independent requests into batches on the fly, so the GPU’s observed utilisation is a function of arrival rate, queue depth, and sequence-length spread — none of which a single-stream PTS run exercises. That is also why a GPU can report 96-98% utilisation under concurrent, variable load and still deliver poor real-world throughput: the device is busy, but a large share of that work is padding, stalled KV-cache reads, and partially-filled batches rather than useful tokens. High utilisation measures occupancy, not goodput, and workload shape is what decides how far the two diverge.

That structural mismatch is why otherwise-honest published benchmark numbers — PTS or otherwise — routinely fail to predict what a team sees in production. The benchmark is not wrong. The benchmark is answering a different question than the one procurement is asking. Realism is not a binary property the benchmark either has or lacks; it is a question of how close the benchmark’s workload shape sits to the production workload shape on the axes that matter for the architecture in question. For an extended treatment of that mismatch, why benchmarks fail to match real AI workloads covers the structural gap in more detail.

When PTS is the right tool

PTS earns its place when the question matches its shape:

Driver and stack validation. After any change to the driver, CUDA, or cuDNN, a fixed PTS profile is the cheapest end-to-end confirmation that the chain still functions.
Regression detection across software updates. Identical hardware, identical profile, before-and-after — the controlled comparison PTS makes possible is hard to get any other way.
Baseline documentation for a new deployment. Recording PTS results for a freshly provisioned node creates a reference point for the next time something looks off.
Cross-vendor sanity check. Same test code on NVIDIA and AMD, unoptimised path — useful for ballpark, not for procurement margin.

PTS is the wrong tool when the question is “how will this GPU perform on my model under my traffic?” That question requires running your model under your traffic, on the candidate hardware, with the precision and executor configuration that production will actually use.

Frequently Asked Questions

How does dynamic batching in a serving stack like vLLM change observed GPU utilisation versus a single-stream Phoronix run?

A single-stream PTS profile pins one request at a time, so utilisation reflects a fixed batch on a fixed model. A stack like vLLM coalesces concurrent requests into batches dynamically, which means observed utilisation becomes a function of arrival rate, queue depth, and sequence-length spread. The same GPU can sit far below its PTS occupancy when traffic is sparse, or saturate with mostly-padding work when sequence lengths vary widely — neither of which the benchmark captures.

Why can a GPU show 96-98% utilisation and still deliver poor real-world throughput under concurrent load?

Utilisation measures occupancy — whether the device has work scheduled — not goodput, the useful tokens or items it actually completes. Under concurrent, variable request load, much of that occupied time can be padding inside partially-filled batches, stalled KV-cache reads, or scheduling overhead. So a GPU can read 96-98% busy while a large fraction of its cycles produce no useful output, which is why the headline utilisation number and the production throughput number can disagree sharply.

Should I use Phoronix Test Suite results to make a GPU procurement decision?

Not on their own. PTS runs single-stream, fixed-batch, fixed-model profiles without vendor-specific optimisations, and we observe the gap between PTS results and production-tuned performance running roughly 20-40% on NVIDIA and 40-60% on AMD (observed across our deployments; not a published benchmark). That margin is wide enough to make PTS a useful sanity check but unreliable when a procurement decision turns on single-digit-percent differences between configurations.

What is Phoronix Test Suite actually reliable for in an AI GPU context?

It is most reliable as a controlled-environment instrument: driver and stack validation, regression detection across software updates, and baseline documentation for a new node. A failing PTS run reliably indicates a stack problem, and a before-and-after run on identical hardware isolates a driver’s performance impact — which is exactly what disappears into noise in live traffic. It is the wrong tool for predicting how a GPU behaves on your model under your traffic.