ai-newspaper.

Where AI capital meets product breakthroughs.

Models & Research

Measure Llama 3.1 FP8 quantization loss on 24GB GPUs

The arithmetic of LLM inference has shifted decisively toward 8-bit floating point over the past eighteen months.

Measure Llama 3.1 FP8 quantization loss on 24GB GPUs

# Quantifying Llama 3.1 FP8 Precision Loss on 24GB VRAM Setups

The Mechanics of FP8 Quantization in Llama 3.1

FP8, as instantiated for inference in frameworks like vLLM and TensorRT-LLM, partitions the 8-bit numeric space into two sub-formats: E4M3 and E5M2. The former trades exponent range for mantissa precision; the latter does the inverse. Weight matrices in transformer blocks are typically mapped to E4M3, where the additional mantissa bit protects the precision of computed activations downstream; activations and KV-cache entries are typically handled in E5M2 to absorb the larger dynamic range produced by attention score accumulation across long contexts. Llama 3.1, with its grouped-query attention layout and SwiGLU activation gating, exhibits activation distributions that bias toward heavier right tails than the dense feed-forward models of equivalent parameter count — a property that has direct implications for the choice of scaling factors applied during quantization and that is responsible for most of the residual perplexity gap that FP8 deployments exhibit against BF16 baselines.

The quantization path matters operationally, and the three modes produce materially different results. Post-training quantization (PTQ) using a calibration dataset produces a static per-tensor or per-channel scaling map. Weight-only quantization modifies the parameters but leaves activations computed in BF16 at runtime. Weight-and-activation quantization extends the conversion to intermediate tensors, the configuration vLLM supports natively for Llama 3.1, and the variant that delivers the ~2× memory bandwidth advantage FP8 is marketed for — and the variant that produces the largest accuracy gap, though that gap remains small in absolute terms for the 8B model.

On Llama 3.1 8B, FP8 weight-and-activation quantization produces a perplexity delta of 0.1–0.3 against BF16 when calibrated on WikiText-2 — small enough to disappear beneath most downstream benchmark noise, yet large enough to matter for retrieval-critical and structured-extraction workloads.

Two further caveats rarely surfaced in vendor literature warrant explicit mention. First, native FP8 tensor core execution is currently exclusive to NVIDIA Hopper (H100) and Ada Lovelace (RTX 4090) architectures; on earlier consumer cards the FP8 weights can be loaded into VRAM but the matrix multiplications are dequantized to BF16 at compute time, partially negating the throughput benefit and partially decoupling "FP8 memory cost" from "FP8 compute cost" as separable phenomena. Second, certain channels in Llama 3.1's MLP projections exhibit outlier magnitudes that, if not specifically handled by per-channel scaling factor selection, can dominate the per-tensor dynamic range and force clipping on the surrounding 99% of values — a failure mode that does not appear at INT8 but is characteristic of FP8 deployments.

Setting Up the Evaluation Environment on 24GB VRAM

Working memory budgets for the evaluation drive most of the implementation choices below. A 24GB VRAM card constrains the entire pipeline: the BF16 reference model must be resident during calibration, either alongside the FP8 candidate or in a staged sequence with identical RNG seeds, and the calibration dataset itself must be processed within the residual memory budget. Llama 3.1 8B at BF16 occupies approximately 16GB, which leaves roughly 8GB on a 24GB device for calibration dataset buffers, intermediate tensors, and the temporary scale-factor maps produced by `llm-compressor`. Workable, but not generous; careless buffer allocation will surface as OOM errors during the second epoch of activation collection.

The standard toolchain decomposes into three layers, each of which has a specific role:

LayerToolFunction
Model ingestionHugging Face `transformers` + safetensorsBF16 reference weight loading from Meta's release
QuantizationNeural Magic `llm-compressor`FP8 conversion with calibration-driven scaling factor selection
Inference & scoring`vLLM` ≥ 0.5.xFP8 weight loading and batched PPL scoring via logprob accumulation

The `llm-compressor` workflow accepts a `quantization_config` object specifying the FP8 format, the calibration dataset iterator, the number of samples to consume, and the per-layer scaling strategy. The typical invocation passes the BF16 Llama 3.1 8B checkpoint through a `QuantizationModifier` directive paired with a calibration iterator built from WikiText-2 or C4 tokenized via the model's original tokenizer. The `lm_head` is commonly excluded from quantization to preserve logit calibration, which is particularly important for any downstream temperature or top-p sampling pipeline.

Calibration sample count is the single most impactful parameter for result variance and reproducibility. Below 32 samples of 2048 tokens, scaling factor estimation is unstable and produces non-deterministic perplexity shifts run-to-run. Above 512 samples, marginal improvement flattens sharply. The empirical sweet spot for Llama 3.1 8B sits between 128 and 256 sequences of 2048 tokens, a configuration that fits comfortably within the 8GB residual on a 24GB card during the activation-collection phase and that resolves scaling factor maps in a single forward pass per layer.

Memory bandwidth — not raw FLOPs — is the dominant runtime constraint during this phase. An RTX 4090 sustains approximately 1TB/s of memory throughput, which means streaming a 256-sample calibration set through a BF16 8B model is bounded by the time required to fetch weights from VRAM rather than by compute. Calibration wall-clock time resolves to between 30 and 90 minutes depending on sequence packing strategy and on whether attention is materialized in fp32 during the calibration collection pass. Pinned host memory and a tokenized-once-then-reused dataset are the two interventions that reduce this phase below its naive upper bound.

Quantifying Perplexity Shifts: BF16 vs. FP8

Perplexity on a held-out token stream is the lowest-noise signal available for evaluating quantization impact, and the protocol below is sufficiently discriminating that downstream benchmark suites are rarely required unless the perplexity delta lands in an ambiguous band. The experimental protocol scores both the BF16 reference and the FP8 candidate on the same tokenized evaluation set — typically WikiText-2's test split or a held-out portion of C4 — under matched context lengths, sampling temperature fixed at zero (greedy decoding), and a uniform KV-cache configuration. Padding strategy, attention windowing, and special-token handling must be identical across runs; any divergence there introduces variance that can dominate the FP8 signal itself.

The metric of interest is the delta, not the absolute value. A BF16 Llama 3.1 8B baseline at a 2048-token context will register approximately 6.5 PPL on WikiText-2; an FP8-quantized variant built via the protocol above should resolve in the 6.6–6.8 range. Drift above 1.0 PPL signals either a broken scaling factor map (often caused by miscalibrated activation ranges on outlier channels) or a fundamentally inappropriate format choice. Drift below 0.05 PPL falls inside the run-to-run variance envelope of BF16 inference itself, and should not be reported as "lossless" — that framing confuses numerical precision with empirical indistinguishability and is excluded by mainstream reproducibility conventions.

Perplexity deltas under 0.1 PPL on Llama 3.1 8B fall inside the run-to-run variance envelope of BF16 inference, which means treating them as evidence of "lossless" quantization confuses numerical precision with empirical indistinguishability.

Where the PPL delta lands between 0.3 and 1.0, downstream benchmarks — MMLU, GSM8K, HumanEval, BBH — become worth running to disambiguate whether the perplexity drift corresponds to capability loss or is concentrated in low-impact token categories. Their higher variance requires substantially larger sample counts (typically full-suite rather than few-shot) to surface a real FP8 effect. For the typical PPL delta range observed on well-calibrated 8B deployments, the perplexity protocol alone is the gating signal; benchmark suites add latency to the measurement pipeline without changing the conclusion.

Calibration Datasets and Accuracy Degradation Benchmarks

Calibration dataset composition affects FP8 scaling factor estimation more than it does INT4 or INT8 quantization, because the dynamic range of E4M3 is narrower and clipping behavior more aggressive at the extremes. WikiText-2, C4, and OpenOrca represent the standard triplet of options practitioners reach for. Downstream task calibration — deliberately tuning the calibration set toward the model's expected inference distribution — tends to produce marginally smaller perplexity deltas but introduces a circularity risk if the calibration and evaluation sets overlap even partially, which biases the reported delta downward by an amount that is rarely quantified.

For reproducible results, the recommended protocol is:

1. Calibrate against C4 or WikiText-2; never against any portion of the evaluation split.

2. Evaluate on WikiText-2's held-out test split, or on a representative held-out slice of the inference corpus.

3. Cap calibration at 256 sequences of 2048 tokens.

4. Do not deliberately tune the calibration distribution to mirror the evaluation distribution — the resulting delta is unreportable.

5. Persist the calibration RNG seed alongside the FP8 checkpoint for later re-scoring.

The spread observed across calibration datasets for Llama 3.1 8B + FP8 is reported below:

Calibration DatasetTypical PPL Delta vs. BF16Dominant Source of Variance
WikiText-20.1–0.3Out-of-distribution activations rare; stable baseline
C40.15–0.4Slight off-policy drift on long-tail tokens
OpenOrca0.2–0.5Instruction-tuned distributions skew scaling factor estimation

These bands are typical of the pattern observed across multiple calibration seeds and tokenizer configurations. Fine-tuned variants of Llama 3.1 — particularly LoRA-merged checkpoints — will not match them precisely and must be measured independently, since post-fine-tuning weight distributions shift the optimum scaling factor map in ways that base-model calibration cannot anticipate. The figures also do not extrapolate cleanly to the 70B variant, where activation patterns interact with tensor parallelism and the calibration protocol typically requires per-shard decomposition that breaks the single-card comparability assumed here.

Hardware Constraints and Scaling Limitations for Larger Variants

The 24GB ceiling is the load-bearing constraint of this entire evaluation workflow, and it rules out the 70B variant in any single-card configuration. Llama 3.1 70B at FP8 occupies approximately 70GB of VRAM; at BF16 the same model consumes roughly 140GB. Both figures exceed a single 24GB device by a factor of three to six, and FP8 inference at 70B on a single consumer GPU effectively requires not just memory offloading to CPU or NVMe but full model parallelism across multi-GPU nodes — a configuration that itself introduces communication overhead, latency variance tied to interconnect bandwidth, and a fresh calibration protocol that is not directly comparable to single-card results.

The 8B variant, by contrast, fits comfortably with KV-cache headroom for production contexts up to 16K tokens, depending on batch size and whether paged attention is enabled in the vLLM configuration. Multi-GPU tensor parallelism at the 8B scale on a 24GB card is unnecessary and would only introduce artifacts in the perplexity measurement pipeline that complicate interpretation; for the 8B scenario, single-GPU inference is the reference configuration against which future 8B-class models should be compared.

For developers, the conclusion is operational rather than abstract. Implement the perplexity-measurement pipeline on the 8B variant first, validate the toolchain end-to-end against a stable random seed, confirm that the PPL delta remains inside the expected 0.1–0.3 envelope, and only then extend the methodology to sharded 70B configurations where the cost of a measurement error is substantially higher. Attempting to compress the full 70B checkpoint into a 24GB envelope via aggressive CPU or NVMe offloading produces results that are non-comparable to any other published quantization benchmark and should be flagged as such in any subsequent reporting — including any extrapolation to production routing decisions where the FP8 path is being weighed against a BF16 multi-GPU baseline.