An inference request crosses a p95 latency alert, and the first instinct is to open the APM dashboard the team already pays for. The trace waterfall lights up a red span inside the model-serving service, and someone proposes replacing the model. That decision is often wrong — and the dashboard that prompted it cannot tell you whether it is wrong, because the span it coloured red is the outside of a box it never looked inside. This is the most common misdiagnosis we see when teams try to manage AI inference performance with the application performance management (APM) tools they adopted for ordinary microservices. APM is excellent at what it instruments. The failure is not in the tool — it is in treating a request-scoped tracing tool as if it could see the GPU kernel time, batching policy, and quantisation overhead that actually determine inference cost and latency. Acting on APM alone is how teams launch a model-replacement project that profiling inside the serving boundary would have shown to be unnecessary. How Does Application Performance Management Actually Work, and What Does It Mean in Practice? APM tools — Datadog APM, New Relic, Dynatrace, Elastic APM, and open-source stacks like OpenTelemetry with Jaeger or Tempo — instrument an application at the request boundary. They attach a trace context to an incoming request, propagate it across service-to-service hops, and record the wall-clock duration of each span: the API gateway, the auth service, the database call, the model-serving call. The output is a waterfall that shows where time went between services and an aggregate view of request latency percentiles and error rates. In practice, this means APM answers questions of the form which service is slow and how often does a request fail. It is a distributed-systems lens. It was designed for a world where the expensive thing is a network hop, a lock-contended database, or a misbehaving downstream dependency — and for that world, it is the right tool, and you should keep running it. The trouble starts at the model-serving span. To APM, the call into your inference server (Triton Inference Server, TorchServe, a vLLM endpoint, a custom FastAPI wrapper around PyTorch) is a single duration: request in, response out, N milliseconds. Everything that happens inside that duration — and everything that determines whether N is large — is opaque. What Can APM See Inside an Inference Serving Path, and What Is It Blind To? The divergence is sharp enough to draw as a table. APM instruments the request; it does not instrument the math. Signal Visible to APM? Where it actually lives Request latency (p50/p95/p99) Yes Span duration at the serving boundary Error rate, status codes Yes Response metadata Service-to-service hop latency Yes Distributed trace context Queue wait before the request reaches the model Sometimes (if the queue is a separate service) Often inside the serving process — invisible Dynamic batching behaviour No Inside the inference server’s scheduler GPU kernel execution time No CUDA stream, surfaced only by a GPU profiler Memory-copy / PCIe transfer overhead No Host-to-device traffic, GPU profiler territory KV-cache hit rate (for LLM serving) No Inside the serving runtime Quantisation / dequantisation overhead No Inside the kernel path Cost-per-request No Derived from GPU utilisation × instance price, not latency The pattern in that table is consistent: APM owns everything up to and including the serving boundary, and is blind to everything inside it. That blindness is not a defect — it is the design. A request-scoped tracer cannot afford to instrument every CUDA kernel; the overhead would dwarf the workload. So it draws the box and measures the box, and the contents stay dark. This is the failure class. When the bottleneck lives inside the box — a batching policy that holds requests too long, a kernel that falls back to FP32 because a TensorRT engine was built for the wrong precision, a model whose layers serialise badly on the available hardware — APM shows you a slow span and nothing else. The reddest line on the waterfall is the symptom, not the cause. Why Does My APM Dashboard Show High Latency but Not Tell Me Whether the Model, Runtime, or Hardware Is the Cause? Because all three present identically to a request-scoped tracer. A model that is genuinely too large, a runtime that is misconfigured, and a GPU that is saturated by a neighbouring tenant all produce the same observable: a long serving span. APM cannot disambiguate them, and the three fixes could not be more different — a model swap, a runtime or batching change, and a hardware or placement change, respectively. This is exactly where the expensive mistake happens. Faced with a slow span and pressure to act, the path of least resistance is to blame the most visible artefact: the model. Model replacement is a multi-month project — re-training or re-fine-tuning, re-validation, re-deployment, sometimes a new vendor contract. And in a meaningful share of the cases we have seen, the GPU-level profile later exonerates the model entirely. The fix was a batching policy or a precision setting that a profiler surfaces in an afternoon. The early warning sign that you are in this trap is recognisable: the APM trace is the only evidence in the room, and the proposed remedy is the most expensive one available. When the diagnosis rests on a single tool that cannot see the layer where the proposed fix lives, the diagnosis is not yet decision-grade. Resolving the kernel- and batching-level detail requires a different instrument — the kind of GPU profiling that exposes kernel time, memory traffic, and scheduler behaviour that request tracing structurally cannot reach. How Do APM Latency Metrics Relate to Cost-Per-Request? They do not translate directly, and this is the second blind spot that catches teams. A p95 latency number tells you how long the slow tail of requests takes. It tells you nothing about how much each request costs, because cost is a function of how efficiently the GPU is used — utilisation, batch size, instance price per hour — not of wall-clock duration. Two deployments can have identical p95 latency and a 3× difference in cost-per-request, because one batches efficiently and saturates the GPU while the other runs one request at a time on an under-utilised accelerator. APM sees both as “fine, latency is within SLO.” The cost difference is invisible to it. This is why cost-per-request, not latency, is the right optimisation target for production inference — it is the only metric that ties the serving behaviour to the spend. There is a subtler trap here too: high GPU utilisation on a monitoring dashboard is not proof of efficiency, and low utilisation is not proof of waste. Utilisation counts whether the GPU is busy, not whether it is doing useful work — a point that benchmark numbers regularly get wrong when they fail to reflect real workloads rather than synthetic ones. Reading a utilisation graph as if it were a cost graph is its own misdiagnosis. Does APM Use AI, and Can Its AI Features Detect Inference-Specific Bottlenecks? Yes, modern APM tools use AI — and no, it does not close this gap. Dynatrace’s Davis, New Relic’s applied intelligence, and Datadog’s Watchdog apply anomaly detection and correlation over the telemetry the platform already collects: latency, error rates, trace topology, infrastructure metrics. They are genuinely useful for catching a latency regression early, clustering related alerts, and pointing at the service that changed. But an anomaly-detection model can only reason over the signals it ingests, and those signals stop at the serving boundary. The AI in an APM tool will reliably tell you that a serving span got slower at 14:32. It cannot tell you that the cause was the dynamic batcher’s max-queue-delay being set too high, or that a model fell back from an INT8 to an FP16 kernel after a runtime upgrade, because those signals were never instrumented. The intelligence is real; the input data has the same blind spot the raw traces do. When Should I Move from APM Tracing to GPU-Level Profiling? The transition point is a diagnostic, not a calendar event. Use the rubric below. Stay with APM when: The slow span is between services (gateway, auth, database, queue) — APM owns this territory completely. Error rates, not latency, are the problem. The serving span is fast and within SLO; the latency is accumulating elsewhere. Move to GPU-level profiling when: The serving span is the dominant, confirmed contributor to p95 latency. Cost-per-request is the problem and latency is within SLO — APM has no signal for this. The proposed remedy is a model swap, a runtime migration, or a hardware change, and the evidence for it is a single APM trace. Utilisation looks high but throughput is low, or vice versa. The clean division of labour: use APM to confirm the request is slow and to rule out the rest of the system; then profile inside the serving path to find the actual cause. APM narrows the suspect to the serving boundary. A GPU profiler — Nsight Systems, the PyTorch profiler, Triton’s own metrics — opens the box. The two are complements, not competitors. The same logic applies to choosing the right instrument once you are inside: knowing what each profiling tool measures and how to read its output is what turns a profile into a decision. How Does an Inference Cost Audit Complement the APM Tooling I Already Run? An AI inference cost audit profiles inside the serving boundary that APM cannot reach and names the actual bottleneck, then ties it back to the metrics that matter for a spend decision: p95 latency before and after, cost-per-request, and GPU utilisation. It does not replace your APM — APM remains the right tool for the request-level and service-level view, and most audits start from an APM trace that has correctly identified the serving span as the suspect. What the audit adds is the layer the dashboard cannot instrument. We take the slow span APM hands us and profile the batching, the kernel time, the precision path, and the cache behaviour, so the eventual fix is the cheapest one that works rather than the most visible one. In our experience, that frequently means proving a model replacement unnecessary — the audit’s most valuable output is sometimes the project it cancels. This is the work behind the AI Inference Cost Audit, and the structure of the engagement is described in the inference cost-cut pack. What About Open-Source and Free APM Tools — Do They Have the Same Blind Spot? They hit the wall in exactly the same place. OpenTelemetry, Jaeger, Zipkin, Tempo, SigNoz, and the Elastic APM stack are all request-scoped distributed tracers by design. They are genuinely good — for many teams, an OpenTelemetry-plus-Tempo stack is all the request-level observability they need, and it costs nothing in licensing. But the blind spot is architectural, not commercial. A free tracer and a six-figure commercial APM contract draw the same box around the model-serving call and measure the same span duration. Neither instruments the CUDA stream. Spending more on APM does not buy visibility into the serving path; it buys better correlation, retention, and alerting over the same request-scoped signals. The layer where inference cost and latency are actually decided is reached by a profiler, regardless of which APM tool drew the box. FAQ How does application performance management tools work, and what does it mean in practice? APM tools attach a trace context to each incoming request and record the wall-clock duration of every service-to-service hop, producing a waterfall view plus aggregate latency percentiles and error rates. In practice they answer which service is slow and how often a request fails — a distributed-systems lens designed for network hops, databases, and downstream dependencies. They treat the call into your inference server as a single duration and do not look inside it. What can APM tools actually see inside an AI inference serving path, and what are they blind to? APM sees request latency, error rates, and service-to-service hop timing up to and including the model-serving boundary. It is blind to everything inside that boundary: dynamic batching behaviour, GPU kernel execution time, memory-copy overhead, KV-cache hit rates, and quantisation overhead. That blindness is by design — instrumenting every CUDA kernel at request scope would cost more than the workload itself. Why does my APM dashboard show high latency but not tell me whether the model, runtime, or hardware is the cause? Because a too-large model, a misconfigured runtime, and a saturated GPU all present identically as a long serving span. APM cannot disambiguate the three, yet the corresponding fixes — a model swap, a batching or runtime change, and a hardware or placement change — could not be more different. Acting on the trace alone is how teams launch an expensive model-replacement project that GPU-level profiling would have shown to be unnecessary. How do APM metrics like p95 latency relate to cost-per-request for an inference deployment? They do not translate directly. Latency measures how long the slow tail takes; cost-per-request is a function of GPU utilisation, batch size, and instance price. Two deployments with identical p95 latency can differ several-fold in cost-per-request because one batches efficiently and saturates the GPU while the other runs one request at a time — a difference APM cannot see. When should I move from APM tracing to GPU-level profiling to find an inference bottleneck? Move when the serving span is the confirmed dominant contributor to p95 latency, when cost-per-request is the problem but latency is within SLO, or when the proposed remedy is a model swap, runtime migration, or hardware change supported only by a single APM trace. Stay with APM when the slow span sits between services or when error rates rather than latency are the issue. Use APM to confirm the request is slow and rule out the rest of the system, then profile to find the cause. How does an inference cost audit complement the APM tooling I already run? The audit profiles inside the serving boundary APM cannot reach — batching, kernel time, precision path, cache behaviour — and names the actual bottleneck, then ties it to p95 latency before and after, cost-per-request, and GPU utilisation. It does not replace APM, which remains the right tool for the request- and service-level view; most audits start from an APM trace that has correctly flagged the serving span. Its most valuable output is sometimes proving a model replacement unnecessary. Does APM use AI, and can the AI features in modern APM tools detect inference-specific bottlenecks like batching or GPU kernel time? Modern APM tools do use AI — anomaly detection and correlation in features like Dynatrace Davis, New Relic applied intelligence, and Datadog Watchdog — and they are useful for catching latency regressions and clustering alerts. But an anomaly-detection model can only reason over the signals it ingests, and those stop at the serving boundary. It will tell you a span got slower; it cannot tell you the cause was a max-queue-delay setting or an INT8-to-FP16 kernel fallback. What are some examples of open-source or free APM tools, and where do they hit the same model-serving blind spot as commercial ones? OpenTelemetry, Jaeger, Zipkin, Tempo, SigNoz, and the Elastic APM stack are capable, free or low-cost request-scoped tracers. They hit the wall in exactly the same place as commercial APM: they draw a box around the model-serving call and measure its duration, but never instrument the CUDA stream inside it. The blind spot is architectural, not commercial — spending more on APM buys better correlation and retention over the same signals, not visibility into the serving path. Where This Leaves the Dashboard The next time an inference span turns red, the useful question is not which model do we replace but what evidence do we have below the serving boundary. If the answer is “only the APM trace,” the diagnosis is incomplete by construction — the tool that raised the alarm cannot see the layer where the fix lives. APM tells you a request is slow and reliably rules out the rest of the system; that is exactly the right starting point, and exactly the wrong stopping point. The named failure class here is symptom-driven model replacement, and the artefact that closes it is profiling inside the serving boundary the dashboard was never built to reach.