How to quantize models (without killing quality)

Official Session Summary

Pulled from the live conference page.

Four-bit quantization has a bad reputation for destroying model quality. While it’s true that post-training quantization in 4-bit integer formats makes models noticeably worse, new microscaling data formats like MXFP4 and NVFP4 deliver on the promise of fast low-precision inference without meaningful quality loss. This talk introduces these data formats along with a shift from quantization as a binary decision to quantization as a granular process with model-level considerations (quantization across weights, activations, KV cache, attention) and layer-level considerations (quantization of input, output, and hidden layers) to help you preserve quality while accessing improved performance and cost characteristics from low-precision inference.

Speaker Background

Quick context on the person or people on stage.

Philip KielyBasetenHead of Developer Relations

Head of Developer Relations at Baseten, translating model serving and inference tradeoffs into practical guidance for builders shipping AI products.

Why This Slot Matters

A compact framing layer for navigating the conference.

This is one of the more substantive abstract-backed sessions on the schedule; worth opening when you need enough context to decide whether to stay in the room.

Back to schedule Open speaker index