Distillation vs Quantisation for Multi-Platform Edge Inference: How to Choose

Distillation and quantisation both shrink models for edge inference, but for three-or-more platforms only distillation keeps quality consistent.

Distillation vs Quantisation for Multi-Platform Edge Inference: How to Choose
Written by TechnoLynx Published on 28 Apr 2026

How do you choose between distillation and quantisation for a multi-platform edge target?

A model that runs acceptably on a development machine needs to run in real time on iOS, Android, and browser targets. Both distillation and quantisation reduce the model to something that fits within mobile and edge memory budgets. They do not make the same tradeoffs, and choosing the wrong approach for a multi-platform deployment creates a hidden problem that appears after the first target ships: the model that validates correctly on iOS behaves differently on Android, and the browser implementation has quality characteristics that diverge from both.

The divergence is not a bug in any individual implementation. It is the expected consequence of applying precision reduction techniques independently to runtime-specific implementations of the same model, without a shared quality baseline across platforms. The decision between distillation and quantisation is, at root, a decision about where the quality contract lives — inside one portable artefact, or distributed across N runtime-specific ones.

What distillation and quantisation actually do

Distillation trains a smaller student model to replicate the behaviour of a larger teacher model. The student has a different architecture — fewer layers, reduced hidden dimensions, or a simpler design — but is trained to match the teacher’s output distribution rather than just the ground-truth labels. The resulting model is smaller because it has fewer parameters, not because its parameters have been reduced in precision. A distilled model is a portable artefact: it can be exported to CoreML, ONNX Runtime, or TensorRT and its behaviour is determined by its architecture and weights, not by the runtime’s precision implementation.

Quantisation reduces the numerical precision of an existing model’s weights and activations — typically from FP32 to FP16 or INT8. The architecture is preserved, but the precision reduction changes the numerical results between the original and quantised implementations. Critically, INT8 quantisation on CoreML and INT8 quantisation on ONNX Runtime apply different quantisation schemes, different calibration methods, and potentially different operator implementations. A model quantised for CoreML is not the same numerical object as the same model quantised for ONNX Runtime, even though both are labelled “INT8.” This is the practical consequence of treating AI quantisation as an engineering trade-off rather than a single setting — the choice of precision format shapes behaviour across runtimes, not just artefact size.

The distinction matters because the two techniques compose differently with multi-platform deployment. Distillation is applied once, before export. Quantisation is applied separately per runtime, after export — or not at all, depending on what the runtime supports natively.

The pre-condition: device-capability profiling on the actual target hardware

Choosing between distillation and quantisation only makes sense once the team knows what the target devices can actually execute. The same kernel-level discipline that diagnoses server-side bottlenecks applies to constrained inference targets — browser WebGL/WebGPU contexts, mobile NPUs, and edge accelerators — but the inputs change: median-device performance rather than peak hardware specs, runtime-version coverage across the deployment cohort, and the operator subset each runtime actually supports without falling back to CPU.

The device-capability baseline is a project-specific measurement (benchmark-class only when the deployment cohort is named), not an industry benchmark. Skipping it produces a recurring failure mode in our experience across edge-deployment engagements: a model validated on the development machine fails to hit the latency budget on the median user device by a factor of 5–10× — an observed range across our edge engagements, not a guaranteed outcome — because the architectural decision was made against the wrong device profile.

Establishing the baseline before the compression decision is what lets the decision be made once rather than reversed after the first deployment cycle. It is also what lets the team distinguish a model that is genuinely too large from a model that is large enough but mis-mapped to runtime operators that silently fall back to CPU. Those two failure modes look identical in a profiler trace, and they have different remedies.

The platform-count decision criterion

Situation Recommended approach Rationale
Single target platform Quantisation Platform-specific quantisation is well-documented, tooling is mature, quality validation is a single cycle
Two platforms, tolerant of minor quality divergence Quantisation per platform, cross-platform validation Manageable if quality variation between platforms is acceptable to the use case
Three or more platforms Distillation to a shared portable model Quantisation per platform creates N independent validation cycles; quality divergence grows with N
Real-time quality consistency required across all targets Distillation Only distillation guarantees identical numerical behaviour across runtimes from a shared weight set
Memory budget tight, accuracy threshold flexible Quantisation Quantisation achieves higher compression ratios than distillation for an equivalent architecture
Memory budget adequate, quality threshold strict Distillation Distillation preserves quality more reliably across the precision boundary

The threshold sits at three platforms because that is where the validation arithmetic flips. With two targets, the team runs two quantisation calibration passes, two validation suites, and reconciles two sets of edge cases — manageable. With three, the pairwise comparison surface grows (AB, AC, BC) and the cost of keeping all three runtimes converged on the same behaviour exceeds the cost of training one shared student.

In a text-to-speech inference optimisation project on edge we ran — deploying an audio synthesis model to iOS (CoreML), Android (ONNX Runtime), and browser targets — the three-platform requirement made quantisation impractical. Separate INT8 quantisation for CoreML and ONNX Runtime produced audible quality differences at certain phoneme transitions that were acceptable on one platform and not on the other. The resolution was distilling the full-size model to a smaller architecture that could be exported directly to both runtimes from a shared set of weights, producing consistent audio quality across all targets. The distilled model ran within latency targets on the lowest-specification test devices in the deployment cohort.

How does the ONNX deployment architecture change the picture?

For models targeting ONNX Runtime across multiple platforms, the deployment decision is separate from the compression decision. ONNX functions as a cross-platform model exchange format: a model exported to ONNX runs on any ONNX Runtime-compatible environment without modification. This makes it a natural choice for multi-platform deployments where the runtime environments are heterogeneous — and it is one of the reasons real-time edge processing with GPU acceleration tends to converge on ONNX as the canonical interchange layer.

The key distinction is between ONNX-as-file-format (converting a model once for compatibility) and ONNX-as-deployment-architecture (designing the model export, versioning, and validation pipeline around ONNX Runtime as the canonical runtime). The latter requires applying cross-platform performance-portability thinking to inference targets: the model should be validated against all target ONNX Runtime versions in the deployment pipeline, not just the development version.

When distillation is combined with ONNX export, the validation burden drops sharply. A single distilled model is exported to ONNX once, validated once against the ONNX Runtime specification, and deployed to all target platforms. CoreML targets receive the model through CoreML Tools’ ONNX import path, maintaining a single model artefact throughout the pipeline. The validation graph is one node, not N.

The quality variation to expect across CoreML, ONNX Runtime, and TensorRT for the same quantised model is not zero even with careful calibration — operator implementations and rounding behaviour differ. Across our deployments this shows up as roughly 1–3% task-metric drift between runtimes for INT8, with the spread widening for models that lean heavily on attention kernels or batch normalisation (an observed pattern across edge-deployment engagements, not a guaranteed range). For tasks where that drift is invisible to the user, per-platform quantisation is fine. For tasks where it is audible, visible, or measurable in downstream metrics, distillation removes the variable entirely.

The distillation training procedure

Distillation is more involved than a quantisation pass; the team needs an explicit training procedure rather than a one-shot conversion. The structure below is what we use; specific hyperparameters depend on the task and the teacher–student capacity gap.

1. Capacity targeting. Choose the student architecture before training, not during. The student’s parameter count should be sized against the deployment memory budget on the lowest-tier target device, with a safety margin (we typically target 60–70% of the available budget to leave headroom for activations and runtime overhead — an observed planning heuristic, not a benchmarked rule). Smaller students need more training compute to close the quality gap; the smallest viable student is rarely the fastest path to an acceptable model.

2. Loss formulation. The standard distillation loss is a weighted sum of two terms: a task loss against ground-truth labels (cross-entropy for classification, L1 or L2 for regression, a domain-specific loss for generative models) and a distillation loss against the teacher’s outputs. For classification, the distillation loss is typically the KL divergence between teacher and student softmax distributions, with a temperature parameter T (commonly 2–5) that softens the distributions to expose the teacher’s relative confidence over non-target classes. For regression and generative models, an L1 or L2 distance between teacher and student outputs is the standard form. The weighting is tuned per task; a starting point is 0.5 * task_loss + 0.5 * (T * T) * distillation_loss, with the T*T factor compensating for the gradient scaling induced by temperature.

3. Layer alignment (optional but high-value). When the student architecture is similar enough to the teacher to permit it, adding intermediate-layer matching losses — typically L2 distance between selected hidden states at corresponding depths — accelerates convergence and improves final student quality. This is the FitNets approach and the wider family of feature-distillation methods. Evenly spaced selections (student layer k/K matches teacher layer k/K of teacher depth) are a robust default. Hugging Face Transformers’ distillation recipes for DistilBERT and similar models implement this pattern in PyTorch and can be adapted as a reference.

4. Training data. Distillation benefits from training data that is broader than the original task training set, because the teacher provides supervision on every input regardless of whether ground-truth labels exist. Unlabelled data from the target domain — product images from the actual deployment environment, audio clips from the target user population, telemetry from the target device cohort — is high-value distillation data even without labels. The teacher’s outputs serve as the supervision signal.

5. Validation protocol. Quality validation should be conducted at three points: against the teacher (does the student match the teacher within the defined quality tolerance?), against the original task ground truth (does the student perform the task acceptably?), and against the deployment runtime (does the student, after export to CoreML / ONNX Runtime, produce numerically equivalent results to the PyTorch reference?). The third check is the one most often skipped and most often responsible for deployment-time surprises.

6. Iteration cadence. A first-pass distilled model rarely meets the quality target. Plan for at least three iteration cycles: initial training, quality assessment against the validation suite, and refinement of either the student architecture, the loss weighting, or the training data composition. Distillation projects that allocate time for one cycle and assume success ship under-trained students.

The hidden cost of the wrong choice

For a two-platform deployment, quantisation per platform is a reasonable choice and the validation cost is manageable. The hidden cost appears when a third target is added — a new device category, a new operating system version, or a new runtime — and the team discovers that the quantisation work done for the first two platforms does not transfer. The calibration data, the operator coverage decisions, the per-runtime quality thresholds: all of it has to be redone, and the gap between the three platforms has to be closed in a way that was never planned for.

That is why the cost asymmetry in the decision table above is structural rather than incidental. Quantisation’s compute savings are real, but they are paid for in validation cycles that scale with platform count. Distillation’s training cost is real, but it is paid once. For deployments that are unlikely ever to add a third target, the asymmetry favours quantisation. For deployments that already have three or where the roadmap suggests a third is coming, distillation pays back inside the first additional platform.

The device-baseline audit that precedes this decision — establishing which runtimes the target devices support, what quantisation schemes each runtime implements, and what the latency budget is across the device cohort — is the step that makes the choice tractable. Without it, the choice is made on the wrong information. For teams approaching this decision for the first time, a GPU and Inference Optimisation Assessment evaluates the compression strategy against the platform count and quality requirements before implementation begins.

FAQ

When should I choose distillation over quantisation for edge inference?

Choose distillation when the deployment matrix has three or more target platforms, when real-time quality consistency across those platforms is part of the product contract, or when memory budgets are adequate but the quality threshold is strict. Choose quantisation when there is a single target runtime, when tooling maturity matters more than portability, or when the compression ratio required exceeds what an architecture change can deliver.

Why does INT8 quantisation behave differently on CoreML, ONNX Runtime, and WebGL — and what does that mean for QA?

Each runtime applies its own quantisation scheme, calibration method, and operator implementation under the “INT8” label. The result is that the same source model produces three numerically distinct artefacts after per-runtime quantisation. QA must validate each runtime independently and reconcile the differences against the product’s quality tolerance — a cost that grows with platform count.

How many edge platforms before distillation’s portability advantage outweighs quantisation’s compute savings?

The threshold is three. With two platforms the pairwise validation work is tractable; with three the pairwise comparison surface and per-runtime calibration cost exceed the cost of training one shared distilled student that exports cleanly to all targets.

What quality variation should I expect across CoreML, ONNX Runtime, and TensorRT for the same quantised model?

Across our edge engagements the variation typically lands in the 1–3% range on task metrics for INT8, widening for models that lean on attention kernels or batch normalisation (an observed pattern, not a benchmarked rate). Whether that variation is acceptable depends on whether it is visible to the user or measurable in downstream metrics.

How do I evaluate model-compression options against my deployment matrix without re-validating per platform?

Establish a device-capability baseline first — runtime versions, operator coverage, median-device latency budgets — and then test compression options against the lowest-capability target. If a distilled student meets the budget on that target, it will meet it across the matrix from a single artefact. If only per-platform quantisation meets the budget, the validation cost is the cost of the strategy.

Where do ONNX models fit in a multi-platform pipeline — and what are the real performance-vs-portability tradeoffs?

ONNX is most powerful as a deployment architecture, not just a file format: it works best when the export, versioning, and validation pipeline treats ONNX Runtime as the canonical runtime, with CoreML and other targets receiving the model through documented import paths. The portability gain comes with a small per-runtime performance tax compared to fully native artefacts; that tax is usually worth paying when the alternative is N parallel pipelines.

For the wider picture of how precision decisions interact with task accuracy, accuracy loss from lower precision is task-dependent sets the framing this decision sits inside.

Back See Blogs
arrow icon