Refactor quantization scheme #3042

maxtremblay · 2025-04-17T17:55:22Z

Pull Request Template

Checklist

Confirmed that run-checks all script has been executed.
Made sure the book is up to date with changes in this PR.

Changes

Refactor QuantizationScheme to a struct and add two new parameters. The quantization accumulator precision and the output mode.

Testing

No new tests, but the existing ones are still succeeding.

codecov · 2025-04-17T18:24:30Z

Codecov Report

Attention: Patch coverage is 45.30612% with 402 lines in your changes missing coverage. Please review.

Project coverage is 81.40%. Comparing base (eb57d7a) to head (bf2b20a).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-fusion/src/ops/qtensor.rs	37.55%	143 Missing ⚠️
crates/burn-tensor/src/tensor/ops/qtensor.rs	7.82%	106 Missing ⚠️
crates/burn-tch/src/ops/qtensor.rs	0.00%	27 Missing ⚠️
crates/burn-tensor/src/tensor/api/numeric.rs	46.15%	21 Missing ⚠️
crates/burn-cubecl/src/kernel/matmul/base.rs	0.00%	20 Missing ⚠️
...ates/burn-tensor/src/tensor/quantization/scheme.rs	45.94%	20 Missing ⚠️
crates/burn-ndarray/src/ops/qtensor.rs	33.33%	14 Missing ⚠️
crates/burn-cubecl/src/ops/qtensor.rs	60.00%	10 Missing ⚠️
...tes/burn-cubecl/src/kernel/quantization/qtensor.rs	0.00%	7 Missing ⚠️
crates/burn-tch/src/tensor.rs	0.00%	7 Missing ⚠️
... and 12 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3042      +/-   ##
==========================================
+ Coverage   81.36%   81.40%   +0.03%     
==========================================
  Files         821      821              
  Lines      117791   118058     +267     
==========================================
+ Hits        95844    96103     +259     
- Misses      21947    21955       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

laggui

Much more flexible given the different configurations for quantization! So I agree with the change from an enum to struct for the QuantizationScheme.

I have some comments regarding naming, otherwise LGTM.

crates/burn-tensor/src/tensor/quantization/scheme.rs

laggui · 2025-04-22T19:53:27Z

Can't self-assign to an opened PR, but just stating in the open that I will take over the WIP and complete the refactor.

laggui

The change to a new quantization scheme struct impacts a lot of files at a superficial level, but I've left some comments below to highlight the important changes.

Left this as a draft even if it is in a good state to review. Mostly because we should avoid merging before the multi-tensor handle PR so we are the ones dealing with the conflicts 😅

laggui · 2025-05-02T19:35:02Z

crates/burn-tensor/src/tensor/api/float.rs

        match (self.primitive, other.primitive) {
            (TensorPrimitive::QFloat(lhs), TensorPrimitive::QFloat(rhs)) => {
-                Self::new(TensorPrimitive::QFloat(B::q_matmul(lhs, rhs)))
+                Self::new(B::q_matmul(lhs, rhs))
+            }
+            (TensorPrimitive::QFloat(lhs), TensorPrimitive::Float(rhs)) => Self::new(
+                TensorPrimitive::Float(B::float_matmul(B::dequantize(lhs), rhs)),
+            ),
+            (TensorPrimitive::Float(lhs), TensorPrimitive::QFloat(rhs)) => {
+                // NOTE: in a typical workflow with linear layers (e.g., transformers), the rhs
+                // represents the weights.
+                //
+                // Since `q_matmul(lhs_f16, rhs_quant)` isn't currently supported, in practice it makes
+                // more sense to re-quantize the input back. Better usability.
+                //
+                // This might change in the future (dequantize on read in fusion?).
+                Self::new(B::q_matmul(B::quantize_dynamic(lhs, rhs.scheme()), rhs))
+            }
+            (TensorPrimitive::Float(lhs), TensorPrimitive::Float(rhs)) => {
+                Self::new(TensorPrimitive::Float(B::float_matmul(lhs, rhs)))
            }
-            (lhs, rhs) => Self::new(TensorPrimitive::Float(B::float_matmul(
-                lhs.tensor(),
-                rhs.tensor(),
-            ))),
        }


See special note for matmul with lhs float and rhs quantized

laggui · 2025-05-02T19:35:53Z

crates/burn-tensor/src/tensor/ops/qtensor.rs

+/// Operations on quantized tensors.
+///
+/// # Return Type Semantics
+///
+/// The return type of each operation indicates how quantization is handled:
+///
+/// ## [`QuantizedTensor<B>`]
+/// If the method returns a `QuantizedTensor<B>`, the operation is expected to preserve the quantized
+/// representation. Implementations should avoid dequantizing when possible to maintain performance.
+/// For example, shape or layout changes such as expand or transpose preserve quantization.
+///
+/// *Note: while this currently doesn't affect the quantized tensor parameters (only per-tensor is
+/// supported at the time of writing), other quantization levels (e.g., per-block) may require re-ordering
+/// the quantization parameters to match the new layout.*
+///
+///
+/// ## [`TensorPrimitive<B>`]
+/// If the method returns a `TensorPrimitive<B>` enum, the return type should align with propagation
+/// strategy specified in the quantization scheme. The output should remain quantized ([`TensorPrimitive::QFloat`])
+/// returned in floating-point form ([`TensorPrimitive::Float`]).
+///
+/// This distinction allows for fine-grained control over mixed-precision flows while still operating
+/// through a unified API.
 pub trait QTensorOps<B: Backend> {


Important specification related to quantization scheme propagation.

This is like a contract.

laggui · 2025-05-02T19:36:51Z

crates/burn-tensor/src/tensor/ops/qtensor.rs

@@ -140,6 +208,110 @@ pub trait QTensorOps<B: Backend> {
        false
    }

+    /// Broadcasts the `tensor` to the given `shape`.
+    fn q_expand(tensor: QuantizedTensor<B>, shape: Shape) -> QuantizedTensor<B>;


For example, this should always return a quantized tensor

laggui · 2025-05-02T19:37:48Z

crates/burn-tensor/src/tensor/ops/qtensor.rs

@@ -411,8 +539,8 @@ pub trait QTensorOps<B: Backend> {
    /// # Returns
    ///
    /// The result of multiplying the two tensors together using matrix multiplication.
-    fn q_matmul(lhs: QuantizedTensor<B>, rhs: QuantizedTensor<B>) -> QuantizedTensor<B> {
-        dequant_op_quant!(
+    fn q_matmul(lhs: QuantizedTensor<B>, rhs: QuantizedTensor<B>) -> TensorPrimitive<B> {


But compute operations like matmul, which affect the qparams, can return a TensorPrimitive::Float or TensorPrimitive::QFloat based on the propagation.

laggui · 2025-05-02T19:38:32Z

crates/burn-tensor/src/tensor/ops/qtensor.rs

-    fn q_mask_where(
-        tensor: QuantizedTensor<B>,
-        mask: BoolTensor<B>,
-        value: QuantizedTensor<B>,
-    ) -> QuantizedTensor<B> {


I removed some ambiguous ops for now.

laggui · 2025-05-02T19:39:04Z

crates/burn-tensor/src/tensor/quantization/scheme.rs

+/// Describes a quantization scheme/configuration.
+#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
+pub struct QuantScheme {
+    /// Granularity level of quantization (e.g., per-tensor).
+    pub level: QuantLevel,
+    /// Quantization mode (e.g., symmetric).
+    pub mode: QuantMode,
+    /// Data type used for storing quantized values (e.g., QInt8).
+    pub q_type: QuantInputType,
+    /// Precision used for accumulating intermediate values (e.g., during matmul).
+    pub acc_precision: QuantAccPrecision,
+    /// Whether to propagate quantization to outputs or return unquantized results.
+    pub propagation: QuantPropagation,
+}


New scheme struct

laggui · 2025-05-02T19:39:39Z

crates/burn-tensor/src/tensor/quantization/scheme.rs

+/// Specify if the output of an operation is quantized using the scheme of the input
+/// or returned unquantized.
+#[derive(Clone, Copy, Debug, Hash, PartialEq, Eq, PartialOrd, Ord, Serialize, Deserialize)]
+pub enum QuantPropagation {
+    /// The output is quantized using the scheme of the input.
+    Propagate,
+    /// The output is not quantized.
+    Inhibit,
 }


Still unsure about the naming, especially the name of the variants.

laggui · 2025-05-02T19:41:08Z

crates/burn-tensor/src/tests/quantization/ops/matmul.rs

+    #[test]
+    fn test_matmul_lhs_float_rhs_quantized() {
+        // Simulates a typical workflow with linear layers (e.g., transformers), where the rhs
+        // represents the weights. The lhs might be a float if a previous operation did not propagate
+        // the quantization. We still want to perform an efficient matmul with quantized weights.
+        //
+        // Since `q_matmul(lhs_f16, rhs_quant)` isn't currently supported, in practice it makes
+        // more sense to re-quantize the input back at this time. Better usability.
+        //
+        // This might be handled differently in the future (dequantize on read in fusion?).
+        let tensor_1 = TestTensor::<2>::from([[1.0, 6.35], [2.0, 3.0], [1.0, 3.0]]);
+        let tensor_2 = QTensor::<TestBackend, 2>::int8([[4.0, 8.0, 12.7], [2.0, 3.0, 6.0]]);
+        let tensor_3 = tensor_1.matmul(tensor_2);
+
+        let expected = TensorData::from([[16.7, 27.05, 50.8], [14., 25., 43.4], [10., 17., 30.7]]);
+        let output = tensor_3.into_data();
+        output.assert_approx_eq::<FT>(&expected, Tolerance::rel_abs(1e-2, 1e-1));
+
+        // Default quantization scheme does not propagate quantization with matmul
+        assert!(output.dtype.is_float());
+    }


Detailed example and explanation for the matmul special case

laggui · 2025-05-02T19:41:55Z

crates/burn-tensor/src/tests/quantization/scheme.rs

+    #[test]
+    fn quant_scheme_should_propagate() {
+        let device = Default::default();
+        let scheme = QuantScheme {
+            propagation: QuantPropagation::Propagate,
+            ..Default::default()
+        };
+
+        let tensor_1 = TestTensor::<2>::from_floats([[1.0, 6.35], [2.0, 3.0], [1.0, 3.0]], &device)
+            .quantize_dynamic(&scheme);
+        let tensor_2 = TestTensor::<2>::from_floats([[4.0, 8.0, 12.7], [2.0, 3.0, 6.0]], &device)
+            .quantize_dynamic(&scheme);
+
+        let tensor_3 = tensor_1.matmul(tensor_2);
+        assert_eq!(tensor_3.to_data().dtype, DType::QFloat(scheme));
+
+        let tensor_4 = tensor_3.add_scalar(1.);
+        assert_eq!(tensor_4.to_data().dtype, DType::QFloat(scheme));
+    }
+
+    #[test]
+    fn quant_scheme_should_not_propagate() {
+        let device = Default::default();
+        let scheme = QuantScheme {
+            propagation: QuantPropagation::Inhibit,
+            acc_precision: QuantAccPrecision::Full, // f32
+            ..Default::default()
+        };
+
+        let tensor_1 = TestTensor::<2>::from_floats([[1.0, 6.35], [2.0, 3.0], [1.0, 3.0]], &device)
+            .quantize_dynamic(&scheme);
+        let tensor_2 = TestTensor::<2>::from_floats([[4.0, 8.0, 12.7], [2.0, 3.0, 6.0]], &device)
+            .quantize_dynamic(&scheme);
+
+        // Some ops like reshape, swap_dims, permute, expand, select, slice, etc. do not affect
+        // the propagation. It mostly applies to compute kernels.
+        let tensor_1 = tensor_1
+            .permute([1, 0])
+            .swap_dims(0, 1)
+            .reshape([1, 6])
+            .reshape([3, 2]);
+        assert_eq!(tensor_1.to_data().dtype, DType::QFloat(scheme));
+
+        // When propagation is not desired, compute kernels like matmul should return tensor
+        // in floating point precision
+        let tensor_3 = tensor_1.matmul(tensor_2);
+        let dtype = tensor_3.to_data().dtype;
+        assert!(dtype.is_float());
+
+        // Subsequent ops will therefore be performed on floats
+        let tensor_4 = tensor_3.add(TestTensor::<2>::ones([3, 3], &device).cast(dtype));
+        assert!(tensor_4.to_data().dtype.is_float());
+    }


Detailed tests that demonstrate the propagation expectations

* refactor quantization scheme * impl acc precision and output mode * clean test * cargo fmt * remove unused import * wip * Cargo fmt * Make it work * Narrow, chunk and split are all high-level slice-based methods * Add argwhere empty test * Cleanup qtensor ops * Better docstrings * Remove unused * Add propagate test example * Add return type semantics description * Fusion ops passthrough * Cleanup * Handle lhs float rhs quant for practical use cases * Fix clippy * Use matches * Remove comment * Cleaner * Fix merged conflicts --------- Co-authored-by: Guillaume Lagrange <lagrange.guillaume.1@gmail.com>

maxtremblay added 4 commits April 17, 2025 12:34

refactor quantization scheme

cc95328

impl acc precision and output mode

fdf3845

clean test

476072a

cargo fmt

eb2e82f

maxtremblay requested a review from laggui April 17, 2025 17:55

remove unused import

bff1824

laggui reviewed Apr 17, 2025

View reviewed changes

wip

34e7227

laggui marked this pull request as draft April 18, 2025 11:56

laggui added 13 commits April 28, 2025 10:56

Merge branch 'main' into refactor-quantization-scheme

b4113d5

Cargo fmt

829c2bc

Make it work

b89d19c

Merge branch 'main' into refactor-quantization-scheme

beb896f

Narrow, chunk and split are all high-level slice-based methods

5849d8e

Add argwhere empty test

a586dcd

Cleanup qtensor ops

9c76292

Better docstrings

77e48ed

Remove unused

237f921

Add propagate test example

0a22259

Merge branch 'main' into refactor-quantization-scheme

3494fdd

Add return type semantics description

d265f29

Fusion ops passthrough

e73baa7

This was referenced May 2, 2025

Migrate to new cubecl multi tensor handle changes #3136

Merged

Refactor narrow, chunk and split #3137

Merged

laggui added 4 commits May 2, 2025 12:42

Merge branch 'main' into refactor-quantization-scheme

d295d82

Cleanup

1f61966

Handle lhs float rhs quant for practical use cases

c9de3bd

Fix clippy

f5645b4

laggui added 3 commits May 2, 2025 14:42

Use matches

239cb81

Remove comment

3e338c0

Cleaner

e52a486

laggui reviewed May 2, 2025

View reviewed changes

laggui requested a review from nathanielsimard May 2, 2025 19:46

laggui marked this pull request as ready for review May 6, 2025 12:56

nathanielsimard approved these changes May 6, 2025

View reviewed changes

laggui added 2 commits May 6, 2025 14:50

Merge branch 'main' into refactor-quantization-scheme

4dc5597

Fix merged conflicts

bf2b20a

laggui merged commit f77b405 into main May 6, 2025
10 of 11 checks passed

laggui deleted the refactor-quantization-scheme branch May 6, 2025 20:10

polarathene mentioned this pull request Jun 5, 2025

4 bit / 8 bit model training / inference capabilities #464

Open

Refactor quantization scheme #3042

Refactor quantization scheme #3042

Uh oh!

Conversation

maxtremblay commented Apr 17, 2025

Pull Request Template

Checklist

Changes

Testing

Uh oh!

codecov bot commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

laggui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laggui commented Apr 22, 2025

Uh oh!

laggui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 17, 2025 •

edited

Loading

laggui left a comment •

edited

Loading