“Quantization in machine learning” is not one technique The phrase “quantization in machine learning” routinely gets used as if it referred to a single, well-defined transformation — flip a switch, the model becomes faster, occasionally something breaks. It does not. Quantization in ML is a family of transformations, parameterized by what is being quantized, by which calibration data, by which method, and by which target format. Two models described as “quantized to INT8” can have substantially different accuracy and runtime behavior depending on which member of the family was applied and how. Treating quantization as one thing produces a particular kind of mistake: generalizing a result obtained on one model family to another. The accuracy regression observed for INT8 on a convolutional vision model is not predictive of the regression for INT8 on a transformer language model, because the two have different activation-distribution properties that interact differently with low-precision representations. The generalization is not slightly off — it is structurally wrong. What does the technique actually do? Quantization in ML is the practice of replacing higher-precision numerical representations — typically FP32 or FP16 weights and activations — with lower-precision ones, most commonly INT8 or INT4, under a calibration procedure that minimizes the resulting numerical error on a representative input distribution. Three things in that sentence carry weight: The first is replacement. The lower-precision values stand in for the original values during inference. They are not approximations layered on top of the originals; they are the numerical representation the runtime uses end-to-end. The second is calibration. The mapping from the original value range to the discrete set of representable values in the lower-precision format is not arbitrary. It is chosen to minimize the error introduced into the model’s behavior on inputs drawn from a calibration set. The calibration set is the implicit assumption of the quantized model — it is the workload distribution the quantization is optimized for. The third is family. There is no single quantization scheme. There is symmetric versus asymmetric quantization, per-tensor versus per-channel scale factors, post-training quantization versus quantization-aware training, weight-only versus weight-and-activation quantization. Each combination produces a different accuracy/throughput trade-off and a different runtime kernel requirement. Why model family changes the risk The accuracy risk of a given quantization scheme is conditional on the activation distributions of the model it is applied to. Different model families have systematically different activation behavior, which is why a “this works at INT8” generalization rarely transfers cleanly across families. Convolutional models with bounded-range activations — for example, vision backbones with batch normalization and ReLU activations — typically have well-behaved activation distributions: most values cluster near zero, the range is bounded, and outliers are rare. INT8 quantization on this regime tends to produce negligible accuracy regression with off-the-shelf calibration, because the format’s representable range covers the activation distribution comfortably. Attention-based models — transformers, including but not limited to LLMs — have activation distributions that include occasional large outliers, particularly in the input embeddings and attention scores. The same INT8 scheme that works cleanly on a convolutional backbone can produce substantial accuracy regression on a transformer when applied without an outlier-aware calibration scheme, because the format’s representable range either has to be set wide enough to cover the outliers — leaving the typical values represented at coarse granularity — or has to clip the outliers, which discards information the model uses. LLMs amplify the transformer pattern further: their long autoregressive generation paths compound small per-token probability shifts into qualitatively different outputs many tokens downstream. A quantization scheme that produces a small per-token accuracy regression on a single-shot benchmark can produce a large generation-quality regression on long outputs. The implication is not that quantization “doesn’t work” on transformers or LLMs. It is that the quantization scheme that works on them is different from the one that works on convolutional models, and the calibration requirements are stricter. Comparing quantization risk across model families Model family Activation distribution Typical INT8 risk What calibration must capture Convolutional vision models Bounded, near-zero-centered, few outliers Small; off-the-shelf calibration usually sufficient Representative image distribution Transformer encoders (e.g. classification) Includes occasional outliers in embeddings and attention Moderate; outlier-aware schemes helpful Representative input distribution including edge cases Transformer decoders (LLMs) Outliers compound through autoregressive generation Substantial without scheme adjustment; weight-and-activation INT8 is risky Workload-shaped prompts including long generations Recurrent models Activation magnitudes can drift along long sequences Variable; sequence-length-conditional Calibration over deployment-representative sequence lengths The pattern is consistent: the more the model’s activation distribution can produce values that strain the representable range of the low-precision format, the stricter the calibration requirements become, and the smaller the universe of off-the-shelf quantization schemes that produce acceptable accuracy. What this means for evaluating quantization claims A quantization claim that omits the model family it was demonstrated on is structurally incomplete. “Quantization works at INT8 with negligible accuracy loss” is a true statement about some models and a false one about others. The same is true of claims framed at the format level — “INT4 quantization preserves accuracy” — without naming the calibration scheme and the model family. The right question for any quantization claim is not “is the accuracy loss small?” but “on which model family, with which scheme, with which calibration, evaluated on which workload, was the accuracy loss small?” Removing any of those four dimensions removes information the result depends on. The framing that actually helps Quantization in ML is a calibrated approximation discipline. Its results are conditional on the calibration data, the scheme parameters, the model family’s activation distribution, and the evaluation workload. The general principle that quantization is controlled approximation rather than damage holds across the whole family — but the constants in the equation differ across model families, and reporting a result without those constants is reporting a number without its units. LynxBench AI treats quantization as a per-model-family, per-scheme, per-calibration evaluation regime — not as a single binary “quantized or not” axis — because the conditions of the result are what determine whether the result transfers to the workload that needs it.