Skip to content

Releases: tracel-ai/burn

v0.18.0

18 Jul 16:27
f5d889d
Compare
Choose a tag to compare

Summary

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for:

  • Matrix-vector (mat@vec)
  • Vector-matrix (vec@mat)
  • Inner product
  • Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.

Fusion Enhancements

  • Improved reliability and performance of Burn Fusion through advanced optimizations.
  • Added support for basic dead code elimination.
  • Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

  • Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
  • Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
    • Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
    • Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

  • profiling: Controls performance profiling and logging.
  • autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
  • compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book.

As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig
let pool = MaxPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();
MaxPool1dConfig
let pool = MaxPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();
AvgPool2dConfig
let pool = AvgPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();
AvgPool1dConfig
let pool = AvgPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

Module & Tensor

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Enhancements

Refactoring

Miscellaneous

  • Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
  • Replace run-checks scripts with command alias (#3118) @laggui
  • Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
  • Add cubecl.toml config (#3150) @nathanielsimard
  • Use CUBECL_DEBUG_OPTION=profile macos ci (#...
Read more

v0.17.1

03 Jun 13:07
Compare
Choose a tag to compare

Bug Fixes & Improvements

v0.17.0

23 Apr 19:50
0ad54ca
Compare
Choose a tag to compare

Summary

This release brings major upgrades in performance and platform compatibility (most notably, a new Metal backend via WGPU passthrough). CubeCL now powers backends for Cuda, Metal, Rocm, Vulkan and WebGpu. Tensor operation fusion support has been greatly expanded to optimize element-wise, reductions and matmul operations.

A new compilation cache and improved autotune cache speed up repeated runs by reusing precompiled binaries and tuned kernel configurations. Data parallel training now scales better across multiple GPUs with automatic batch assignment to each worker. A new tensor slice API offers a simpler, more intuitive way to index tensors.

This version also comes with broad performance gains across tensor operations, especially for reductions, matmul, and convolutions. An initial implementation of quantized matmul is now available, with further quantization improvements planned in the future.

As with previous releases, this includes various bug fixes, further optimizations and enhanced documentation.

Be sure to check out the new burn-bench to compare performance across different versions, hardware and backends.

CubeCL Backends

Burn supports Cuda, Rocm, Vulkan, WebGpu, and the newly added Metal backend.

Each backend can be used through their respective type aliases, provided that the appropriate backend feature flag is also enabled.

Metal
burn = { version = "0.17.0", features = ["metal"] }
use burn::prelude::*;
use burn::backend::wgpu::{Metal, WgpuDevice};

let tensor = Tensor::<Metal, 2>::zeros([2, 4], &WgpuDevice::default());
Cuda
burn = { version = "0.17.0", features = ["cuda"] }
use burn::prelude::*;
use burn::backend::cuda::{Cuda, CudaDevice};

let tensor = Tensor::<Cuda, 2>::zeros([2, 4], &CudaDevice::default());
Rocm
burn = { version = "0.17.0", features = ["rocm"] }
use burn::prelude::*;
use burn::backend::rocm::{Rocm, HipDevice};


let tensor = Tensor::<Rocm, 2>::zeros([2, 4], &HipDevice::default());
Vulkan
burn = { version = "0.17.0", features = ["vulkan"] }
use burn::prelude::*;
use burn::backend::wgpu::{Vulkan, WgpuDevice};


let tensor = Tensor::<Vulkan, 2>::zeros([2, 4], &WgpuDevice::default());
WebGpu
burn = { version = "0.17.0", features = ["webgpu"] }
use burn::prelude::*;
use burn::backend::wgpu::{WebGpu, WgpuDevice};


let tensor = Tensor::<WebGpu, 2>::zeros([2, 4], &WgpuDevice::default());

Warning

When using one of the wgpu backends, you may encounter compilation errors related to recursive type evaluation. This is due to complex type nesting within the wgpu dependency chain.
To resolve this issue, add the following line at the top of your main.rs or lib.rs file:

#![recursion_limit = "256"]

The default recursion limit (128) is often just below the required depth (typically 130-150) due to deeply nested associated types and trait bounds.

Data Loader and Batcher

The Batcher trait has been updated to improve multi-device support. Previously, batcher implementations stored a device internally, which could lead to all data being loaded on the same device. The latest changes have the DataLoader generic over the backend, while the device is passed explicitly:

-impl<B: Backend> Batcher<MyItem, MyBatch<B>> for MyBatcher<B> {
+impl<B: Backend> Batcher<B, MyItem, MyBatch<B>> for MyBatcher {
-   fn batch(&self, items: Vec<MyItem>) -> MyBatch<B> {
+   fn batch(&self, items: Vec<MyItem>, device: &B::Device) -> MyBatch<B> {
        // The correct `device` is already provided for the batching logic to use
    }
}

The device can now be set when building a data loader:

let dataloader = DataLoaderBuilder::new(batcher)
        .batch_size(batch_size)
        .shuffle(seed)
        .num_workers(num_workers)
+       .set_device(device)
        .build(dataset);

This step is not required for the Learner, which handles the device configuration automatically.

Better Tensor Slicing & Indexing

Tensor slicing now fully adopts idiomatic Rust range syntax, replacing the older (i64, i64) and Option tuple forms.

For example:

let tensor = Tensor::<B, 2>::zeros([m, n], &device);
-let slice = tensor.slice([(0, -1), (0, -2)]);
+let slice = tensor.slice([0..-1, 0..-2]);

For more complex or mixed range types, use the s![] macro:

let tensor = Tensor::<B, 3>::zeros([b, s, d], &device);
-let slice = tensor.slice([None, Some((t as i64, t as i64 + 1)), None]);
+let slice = tensor.slice(s![.., t..t + 1, ..]);

The macro is inspired by ndarray's s![] (at least, by name) and helps build flexible slice patterns.

use burn::prelude::*;

let tensor = Tensor::<B, 4>::zeros([8, 4, 2, 3], &device);
let slice = tensor.slice(s![..=4, 0..=3, .., -1]);
assert_eq!(slice.dims(), [5, 4, 2, 1]);

Changelog

Module & Tensor

Bug Fixes

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Read more

v0.16.1

03 Apr 21:38
Compare
Choose a tag to compare

Fixes / Improvements

v0.16.0

14 Jan 21:16
Compare
Choose a tag to compare

Summary

This release significantly enhances GPU utilization through a new tensor transaction mechanism for batched sync operations and simultaneous reads of multiple bindings for CubeCL runtimes. It also includes multiple performance optimizations like mixed precision support for matrix multiplication and convolution operations, as well as notable GEMM improvements.

Backend capabilities have been expanded with a new remote backend for distributed computing, improved SPIR-V support, custom operations fusion and an experimental fused matrix multiplication.

Training components have been expanded to support semantic segmentation and object detection datasets, new training metrics and improved training performance thanks to an async metric processor.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations and enhanced documentation.

Module & Tensor

Bug Fixes

  • Fix unsqueeze dims with multiple trailing negative indices (#2496) @laggui
  • Fix one_hot implementation for Int Tensors (#2501) @maun
  • Fix tensor prod and prod dim containing nan values (#2515) @quinton11
  • Expose ItemLazy to be able to implement for custom types (#2525) @laggui
  • Check nonzero stride, dilation and groups (#2540) @laggui
  • Module derive types should inherit visibility (#2610) @laggui
  • Add dropout prob check (#2695) @laggui

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Enhancements

Refactoring

Miscellaneous

v0.15.0

28 Oct 19:45
65aa8b5
Compare
Choose a tag to compare

Summary

This release brings major performance improvements to tensor operations, particularly in matrix multiplication and convolution, along with experimental ROCm/HIP and SPIR-V support enabled by CubeCL runtimes. It also introduces foundational features for multi-backend compatibility and adds new quantization operations.

Support for ONNX models has been expanded, with additional operators and bug fixes for better operator coverage.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations, and enhanced documentation.

Module & Tensor

Bug Fixes

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Enhancements

Refactoring

Miscellaneous

v0.14.0

27 Aug 17:24
Compare
Choose a tag to compare

Summary

This release marks the debut of our CubeCL integration, which brings cross-platform GPU programming capabilities directly to Rust.
With CubeCL now supporting both CUDA and WebGPU, Burn benefits from a new CUDA backend that can be enabled using the cuda-jit feature.
Please note that this backend is still considered experimental, and some operations, particularly those related to vision, may experience issues.

Additionally, this release features significant enhancements to ONNX support, including bug fixes, new operators, and improvements in code generation.

As always, it also includes numerous bug fixes, performance enhancements, new tensor operations, and improved documentation.

Burn 0.14.0 introduces a new tensor data format that significantly enhances serialization and deserialization speeds and introduces Quantization, a new Beta feature included in this release. The format is not compatible with previous versions of Burn, but you can migrate your previously saved records using this guide.

Module & Tensor

Bug Fixes

ONNX Support

Bug Fixes

Enhancements

Refactoring

Documentation & Examples

Read more

v0.13.2

03 May 14:23
Compare
Choose a tag to compare

Bugfix

Fix autodiff graph memory management strategy to improve performance (#1702 #1710) @louisfd
Fix matmul double broadcasting for ndarray (#1646 #1679) @lancelet

v0.13.1

26 Apr 20:01
Compare
Choose a tag to compare

Bugfix

Fix autodiff memory leak and improve performance with a new graph memory management strategy (#1698) @nathanielsimard @louisfd
Fix inplace fused operations (#1682) @nathanielsimard

Improvements

Linear 1D support, helpful for ONNX support (#1682) @nathanielsimard
Upgrade wgpu to 0.19.4 (#1692) @nathanielsimard

v0.13.0

12 Apr 20:12
cf7b279
Compare
Choose a tag to compare

The Burn Release 0.13 is a significant update introducing numerous new features and performance enhancements. One major change is the removal of the Sync trait implementation from most Burn types, see Core User APIs. Additionally, the release introduces several new tensor operations, module features, optimizers, as well as improvements to the autodiff backend. Notably, a new bridge mechanism facilitates runtime switching between backends, and significant work has been done on the Just-in-Time and Wgpu backends. The release also addresses numerous bug fixes, documentation improvements, infrastructure updates, CI enhancements, and miscellaneous changes to improve code quality and usability.

Core User APIs

A major change in this release is that most Burn types no longer implement the Sync trait, such as modules, optimizers, and tensors. This change should not impact users of the Learner struct for model training. However, it may affect those who implemented their own training loop and inference server. While modules, optimizers and tensors can be sent to other threads, they cannot be accessed concurrently by multiple threads. This aligns with Burn's workflow, where each tensor operation requires an owned version of the tensor. The change was made to safely reduce the number of locks needed when modifying the state of the autodiff graph, fusion state, allocation cache, and various other use cases. While not all locks have been removed, the type signature no longer poses a problem for follow-up optimizations. Note that the same tensor can still be sent to multiple threads without copying the underlying data. However it will require cloning before sending a tensor to a thread. (#1575) @nathanielsimard

Tensor

Module

Optimizer

Train

Backend

This release also introduces the backend bridge, a new mechanism for runtime switching between backends.
While an improvement, it remains compatible with previous methods of supporting mixed precision. (#1529) @nathanielsimard

JIT

Significant effort has been devoted over the past few months to refactor the previous Wgpu backend into a shader-agnostic Just-in-Time backend.
All lower-level dependencies have been abstracted into the Just-in-Time Runtime trait, requiring a compiler, compute server, and storage.
The bulk of this work was carried out by @nathanielsimard and @louisfd.

Commits: #1274 #1280 #1313 #1340 #1356 #1359 #1378 #1391 #1396 #1398 #1417 #1429 #1423 #1424 #1433 #1456 #1474 #1457 #1480 #1472 #1493 #1509 #1530 #1528 #1541 #1550 #1569

Wgpu

Autodiff

Extensive work has also been undertaken on Burn's autodiff backend.
The backend now supports gradient checkpointing to reduce memory usage and has been refactored into a client/server architecture.
These updates result in significantly less blocking when tracking gradients, enhancing performance particularly on smaller models.
Furthermore, various bugs have been fixed where some graph nodes weren't used, potentially truncating the autodiff graph.
Overall, these changes make the autodiff process more reliable and efficient. (#1575) (#1358) @louisfd @nathanielsimard

Candle

Data

Import

Benchmarks

We have implemented a system that enables the comparison of backends across a variety of tasks.
Currently, most of these tasks consist of micro-benchmarks, but we plan to expand the range of benchmarks in the future.
To ensure Burn's portability and performance across different devices, the community can run and upload benchmarks! 🔥

Bug Fix

Infrastructure

The minimum Rust version has been updated to 1.75. (#1297) @syl20bnr

Docs

Read more