18 Jul 16:27

laggui

f5d889d

v0.18.0 Latest

Latest

Summary

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for:

Matrix-vector (mat@vec)
Vector-matrix (vec@mat)
Inner product
Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.

Fusion Enhancements

Improved reliability and performance of Burn Fusion through advanced optimizations.
Added support for basic dead code elimination.
Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

profiling: Controls performance profiling and logging.
autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book.

As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig

let pool = MaxPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

MaxPool1dConfig

let pool = MaxPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

AvgPool2dConfig

let pool = AvgPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

AvgPool1dConfig

let pool = AvgPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

Module & Tensor

Add tensor grid::meshgrid (#3107 #3191) @crutcher
Add scalar tensor operations (#3127) @ArthurBrussee
Orthogonal initialization (#3109) @dymat
Support importing safetensors format (#2721) @wandbrandon @antimora
Add burn::linalg norms (#3131) @crutcher
Extract Linear.forward to nn::functional::linear (#3147) @crutcher
Base impl of matmul for Int tensor (#3201) @crutcher
(perf) generate_mask functions optimizations (#3203) @tafia
Add CosineEmbeddingLoss module and cosine_similarity function (#3207) @antimora
Tensor::slice_fill() (#3221 #3223) @crutcher
Base impl of tensor.slice_dim(dim, range) (#3235) @crutcher
Support shifting pre-computed RoPE values (#3275) @laggui
Improve RoPE partial shift case (#3290) @laggui
Add tensor.roll() and improve AsIndex (renamed IndexConversion) (#3281) @crutcher
[Breaking] Update pooling default strides to match kernel size (#3338) @lucianyao
Add is_finite tensor element wise op and fix is_close/all_close inf (#3341) @jonboh

Backends

[Perf] Interpolate optimizations (#3077) @wingertge
[Perf] Slice assign (#3069) @wingertge
Add multi stage conv (#3105) @wingertge
[Perf] Convolution migration to NHWC (#3090) @wingertge
Merge different convolution dimensional kernels (#3115) @wingertge
Support reduce mixed precision accumulation w/ fusion (#3132) @nathanielsimard
Update remote backend (#3175) @Cielbird
Feat/autotune optional (#3188) @nathanielsimard
cubecl unit matmul (#3214) @louisfd
Update CubeCL for client based profiling (#3222) @ArthurBrussee
Update cubecl unit matmul double buffered (#3233) @louisfd
Burn-remote to_device function (#3189) @Cielbird
Add Drop operation for fusion (#3263) @nathanielsimard
Lazy tensor downloading in burn-remote (#3276) @Cielbird
Improve specialized matmul (#3304) @louisfd
Add autotune priority (#3347 #3378) @nathanielsimard
Fix local tuner deadlock (#3384) @nathanielsimard
Fix fusion wasm unsafe input (#3385 #3386) @nathanielsimard

Bug Fixes

Fix WASM deadlock by really properly not capturing locks (#3123) @ArthurBrussee
Fix burn-cubecl with autotune disabled (#3141) @wingertge
Fix fusion multiple reshapes (#3220) @nathanielsimard
Fix/fusion multiple streams (#3297) @nathanielsimard
Fix gather broadcasted indices in kernel impl and fusion (#3337) @laggui
Fix rand interval (#3321) @laggui
Restrict binary op lhs/rhs alias (#3349) @laggui
Fix sum fallback when atomic add is not supported (#3369) @laggui

Documentation & Examples

Update pytorch-model.md with a new troubleshooting help (#3081) @antimora
Contributor example instructions (#3153) @AshAnand34
Update README.md with DeepWiki badge (#3192) @antimora
Add recursion_limit macro to getting started exemples code (#3238) @Marc-AnthonyG
KaTeX for Mathematical expressions in docstrings (#3278) @BhavyeMathur
Add Metal backend support to custom-image-dataset (#3335 #3354) @TsaoLun
Add link to license in README badge (#3356) @Olexandr88

Fixes

Fix typo in Burn Book (#3113) @danny-burrows
fix typos (#3186) @omahs
Fix Typos in Documentation Comments (#3280) @leopardracer
Fix typo in code documentation for BurnGraph codegen (#3286) @kilavvy
Fix error messages from tensor checks for flatten (#3319) @NoVegetable
Fix broken link to burn-tch (#3365) @dbdr
Update documentation description for nonzero and nonzero_async (#3368) @catch-twenty-two

ONNX Support

ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info (#3037) @antimora
Restrict ONNX opset to 16 and up (#3051) @antimora
Allow Shape input type for Slice operation (#3092) @antimora
Support onnx and, or & xor nodes (#3173) @tye-singwa
Add support ONNX instance norm (#3177) @tye-singwa
Onnx ceil & round (#3225) @tye-singwa
Add support onnx group norm (#3245) @tye-singwa
Add onnx SpaceToDepth / DepthToSpace (#3277) @tye-singwa
Fix onnx topological sort check (#3284) @tye-singwa
Add onnx ArgMin node (#3285) @tye-singwa
Add support onnx size (#3301) @tye-singwa
Support flexible backend selection for import tests (#3372 #3380) @lucianyao
Fix ONNX node name sanitization and allow ai.onnx.ml domain (#3371) @antimora

Enhancements

Replace some powf->powi (#3152) @ArthurBrussee
Improve fusion compilation speed (#3155) @nathanielsimard
Perf/remove repeat dim (#3183) @nathanielsimard
Perf: Fusion search for composed optimization (#3258) @nathanielsimard
Improve matmul selector (#3307 #3343 #3350 #3376) @nathanielsimard

Refactoring

Refactor CubeCL slices (#3104) @nathanielsimard
CubeCL init refactor (#3128) @nathanielsimard
Refactor narrow, chunk and split (#3137) @laggui
Refactor quantization scheme (#3042) @maxtremblay
Migrated prng (random) to CubeCL (#3165 #3170) @Cielbird
Break down test_onnx.rs into test subdirectories (#3144) @antimora
Refactor: Move op_configuration.rs from burn-import to onnx-ir (#3126) @antimora
Fix relative cmp + debug tools (#3197) @nathanielsimard
Refactor cubecl line size matmul (#3219) @louisfd
Absolute tolerance is too tight for strict/balanced/permissive (#3242) @laggui
Fix clippy rust 1.88 and cargo run checks usage (#3325 #3320) @laggui
Remove hip os cfg flags (#3336) @laggui
Update cubecl matmul refactor / docs (#3366) @louisfd

Miscellaneous

Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
Replace run-checks scripts with command alias (#3118) @laggui
Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
Add cubecl.toml config (#3150) @nathanielsimard
Use CUBECL_DEBUG_OPTION=profile macos ci (#...

Contributors

antimora, lucianyao, and 29 other contributors

Assets 2

03 Jun 13:07

laggui

v0.17.1

179731b

v0.17.1

Bug Fixes & Improvements

Downgrade to zip 2.4.2 (fixes #3224) @laggui
Fix non contiguous bug with comparison op (#3241) @nathanielsimard
Fix/reduce fusion (#3172) @nathanielsimard
Fix: fusion multi-block scalar index sharing (#3167) @nathanielsimard
Fix to NdArray int_max_dim bug (#3140) @crutcher
Make is_contiguous check common (#3083) @laggui
Fix clamp min/max line size > 1 (#3078) @laggui
Fix vectorization problem with fusion on reshaped not contiguous tensors (#3075) @nathanielsimard

Contributors

laggui, crutcher, and nathanielsimard

Assets 2

23 Apr 19:50

laggui

v0.17.0

0ad54ca

v0.17.0

Summary

This release brings major upgrades in performance and platform compatibility (most notably, a new Metal backend via WGPU passthrough). CubeCL now powers backends for Cuda, Metal, Rocm, Vulkan and WebGpu. Tensor operation fusion support has been greatly expanded to optimize element-wise, reductions and matmul operations.

A new compilation cache and improved autotune cache speed up repeated runs by reusing precompiled binaries and tuned kernel configurations. Data parallel training now scales better across multiple GPUs with automatic batch assignment to each worker. A new tensor slice API offers a simpler, more intuitive way to index tensors.

This version also comes with broad performance gains across tensor operations, especially for reductions, matmul, and convolutions. An initial implementation of quantized matmul is now available, with further quantization improvements planned in the future.

As with previous releases, this includes various bug fixes, further optimizations and enhanced documentation.

Be sure to check out the new burn-bench to compare performance across different versions, hardware and backends.

CubeCL Backends

Burn supports Cuda, Rocm, Vulkan, WebGpu, and the newly added Metal backend.

Each backend can be used through their respective type aliases, provided that the appropriate backend feature flag is also enabled.

Metal

burn = { version = "0.17.0", features = ["metal"] }

use burn::prelude::*;
use burn::backend::wgpu::{Metal, WgpuDevice};

let tensor = Tensor::<Metal, 2>::zeros([2, 4], &WgpuDevice::default());

Cuda

burn = { version = "0.17.0", features = ["cuda"] }

use burn::prelude::*;
use burn::backend::cuda::{Cuda, CudaDevice};

let tensor = Tensor::<Cuda, 2>::zeros([2, 4], &CudaDevice::default());

Rocm

burn = { version = "0.17.0", features = ["rocm"] }

use burn::prelude::*;
use burn::backend::rocm::{Rocm, HipDevice};


let tensor = Tensor::<Rocm, 2>::zeros([2, 4], &HipDevice::default());

Vulkan

burn = { version = "0.17.0", features = ["vulkan"] }

use burn::prelude::*;
use burn::backend::wgpu::{Vulkan, WgpuDevice};


let tensor = Tensor::<Vulkan, 2>::zeros([2, 4], &WgpuDevice::default());

WebGpu

burn = { version = "0.17.0", features = ["webgpu"] }

use burn::prelude::*;
use burn::backend::wgpu::{WebGpu, WgpuDevice};


let tensor = Tensor::<WebGpu, 2>::zeros([2, 4], &WgpuDevice::default());

Warning

When using one of the wgpu backends, you may encounter compilation errors related to recursive type evaluation. This is due to complex type nesting within the wgpu dependency chain.
To resolve this issue, add the following line at the top of your main.rs or lib.rs file:

#![recursion_limit = "256"]

The default recursion limit (128) is often just below the required depth (typically 130-150) due to deeply nested associated types and trait bounds.

Data Loader and Batcher

The Batcher trait has been updated to improve multi-device support. Previously, batcher implementations stored a device internally, which could lead to all data being loaded on the same device. The latest changes have the DataLoader generic over the backend, while the device is passed explicitly:

-impl<B: Backend> Batcher<MyItem, MyBatch<B>> for MyBatcher<B> {
+impl<B: Backend> Batcher<B, MyItem, MyBatch<B>> for MyBatcher {
-   fn batch(&self, items: Vec<MyItem>) -> MyBatch<B> {
+   fn batch(&self, items: Vec<MyItem>, device: &B::Device) -> MyBatch<B> {
        // The correct `device` is already provided for the batching logic to use
    }
}

The device can now be set when building a data loader:

let dataloader = DataLoaderBuilder::new(batcher)
        .batch_size(batch_size)
        .shuffle(seed)
        .num_workers(num_workers)
+       .set_device(device)
        .build(dataset);

This step is not required for the Learner, which handles the device configuration automatically.

Better Tensor Slicing & Indexing

Tensor slicing now fully adopts idiomatic Rust range syntax, replacing the older (i64, i64) and Option tuple forms.

For example:

let tensor = Tensor::<B, 2>::zeros([m, n], &device);
-let slice = tensor.slice([(0, -1), (0, -2)]);
+let slice = tensor.slice([0..-1, 0..-2]);

For more complex or mixed range types, use the s![] macro:

let tensor = Tensor::<B, 3>::zeros([b, s, d], &device);
-let slice = tensor.slice([None, Some((t as i64, t as i64 + 1)), None]);
+let slice = tensor.slice(s![.., t..t + 1, ..]);

The macro is inspired by ndarray's s![] (at least, by name) and helps build flexible slice patterns.

use burn::prelude::*;

let tensor = Tensor::<B, 4>::zeros([8, 4, 2, 3], &device);
let slice = tensor.slice(s![..=4, 0..=3, .., -1]);
assert_eq!(slice.dims(), [5, 4, 2, 1]);

Changelog

Module & Tensor

Feature add new one hot function meeting multi-dimensions (ranks) (#2613) @tiruka
Expand GRU support (#2704) @nwhitehead
feat: bitwise-ops-for-tensors (#2498) @quinton11
Feat: Add PoissonNLL loss (#2765) @salvomcl
Add metric parametrized name (#2808) @laggui
Add boolean and/or to bool tensors (#2802) @wingertge
Add ATOL/RTOL defaults (#2824) @crutcher
Feat: Add tan trig function (#2854) @Msa360
Refactor quantization schemes (#2849 #3036) @laggui @maxtremblay
Vectorize pooling for optimization (#2905) @wingertge
Feat: Add Cosh and Sinh (#2959) @Msa360
Refactor in-memory recorder load args (#2892) @BjornTheProgrammer
Improve gradient checkpointing (#2997) @nathanielsimard
Optimize minmax (#3009) @nathanielsimard
Improve tensor.slice(...) to support multiple range types (#3061) @laggui

Bug Fixes

Fix bce loss log (#2741) @laggui
Fix repeat_dim backward w/ dim size > 1 (#2777) @laggui
[Fix] tch upgrade (#2834) @wingertge
Check channels_in matches in convolution layers (#2944) @chlobes
Fixed GroupNorm implementation (#2945) @computer-whisperer

Backends

Migrate to type magic autotune (#2710) @wingertge
Feat/fused matmul tune (#2726) @nathanielsimard
Feat/shared sum (#2737) @maxtremblay
Improve fusion for broadcasting, mix vectorization and reshape operation (#2773 #2833) @nathanielsimard
Fuse gather (#2793) @nathanielsimard
Feat/fuse select (#2797 #2804 #2903) @nathanielsimard
Remove from_data conversions in backends (#2783) @laggui
Feat fuse swap dims (#2801 #2877) @nathanielsimard
[Feature] reduce fuse on read (#2870) @nathanielsimard
[Feat] SIMD acceleration for ndarray backend (#2851) @wingertge
Perf/reduce fuse on write (#2937) @nathanielsimard
[metal] Add CubeCL metal compiler support (#2993) @syl20bnr
Compilation Cache (#3020) @nathanielsimard
Cubecl quantize matmul (#3022 #3030) @maxtremblay

Bug Fixes

Fix from data fusion (#2735 #2778) @laggui @nathanielsimard
Fix constant creation in fusion to cast at compile time, not runtime (#2782) @wingertge
Fix two autotune issues on wasm (#2899) @ArthurBrussee
Fix/reduce out of bounds (#2906) @nathanielsimard
Fix fusion bug (#3031) @nathanielsimard
Fix metal backend name (#3040) @nathanielsimard
Fix matmul dynamic line size support (#3056) @nathanielsimard
Fix: matmul lower precision / flex32 (#3059) @nathanielsimard
Fix/autotune cache conflicts (#3070) @nathanielsimard

Documentation & Examples

Wasserstein Generative Adversarial Network (#2660) @wangjiawen2013
Add modern lstm (#2752) @wangjiawen2013
Improve tensor docs (#2951) @PtiLuky

Fixes

chore: fix some comments (#2717) @sunxunle
Add hardsigmoid formula and fix WGAN doc + default lr (#2706) @laggui
Fix db-pedia-infer backend (#2736) @laggui
Fixed typo in the burn book chapter advanced unit no-std. (#2731) @xmy314
typo - correct smp_serde to rmp_serde as per crate's name in url (#2744) @cameronbraid
typo - missing tick which was breaking formatting (#2745) @cameronbraid
Remove autodiff from generate (#2759) @laggui
Remove empty format precision specifier (#2785) @hkBst
Update tch instructions (#2844 #2976) @laggui
Fix from_embedded and bool ops docs (#2848) @laggui
Fix tiny typo in mathematical expression (#2867) @janhohenheim
Fix typos (#2927) @crutcher
Fix/web example (#2954 #2978) @laggui
Fix: burn-book getting-started Use Declarations (#2966) @jerryshell
chore: fix comment (#3008) @tsinghuacoder

ONNX Support

Code generation bug fix for ONNX import (#2708) @antimora
Floor Node (#2792) @akshitgaur2005
One hot ONNX (#2784) @akshitgaur2005
Onnx op topk (#2305) @oojo12
Fix output elem type for unsqueeze and reshape (#2807) @christeefy
Feat/Split ONNX Import (#2568) @agelas
Refactor GatherNode to support scalar outputs. (#2828) @loloxwg
Rename dim to rank for ONNX import (#2831) @antimora
Add rank inference for tan (#2868) @Msa360
Add Gemm (#2841) @akshitgaur2005
Fix RandomNormalLike ONNX node output rank (#2936) @knight-o...

Contributors

nwhitehead, cameronbraid, and 36 other contributors

Assets 2

03 Apr 21:38

laggui

v0.16.1

b04d381

v0.16.1

Fixes / Improvements

Update bincode dependency (fixes #2876) @laggui
Fix TUI renderer display summary (#2967) @laggui

Contributors

laggui

Assets 2

14 Jan 21:16

laggui

v0.16.0

dd628ec

v0.16.0

Summary

This release significantly enhances GPU utilization through a new tensor transaction mechanism for batched sync operations and simultaneous reads of multiple bindings for CubeCL runtimes. It also includes multiple performance optimizations like mixed precision support for matrix multiplication and convolution operations, as well as notable GEMM improvements.

Backend capabilities have been expanded with a new remote backend for distributed computing, improved SPIR-V support, custom operations fusion and an experimental fused matrix multiplication.

Training components have been expanded to support semantic segmentation and object detection datasets, new training metrics and improved training performance thanks to an async metric processor.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations and enhanced documentation.

Module & Tensor

Add warning in docstring for indices bound checks (#2462) @laggui
Add remainder op for tensor (#2427) @med1844
Add float cast tensor op (#2483 #2511 #2538 #2586 #2671) @laggui
Add step learning rate scheduler (#2423) @towerpark
Add tensor split operator (#2490) @agelas
Add tensor transaction mechanism to batch multiple sync operations (#2521) @nathanielsimard
[Breaking] Make .init() method of LR schedulers return Result (#2527) @towerpark
Make optimizer state public (#2561) @ArthurBrussee
Accept function pointer or closure for freq scaling (#2634) @laggui
Change pad value w/ ElementConversion (#2653) @laggui
Add checks for even padding when kernel size is even (#2677) @laggui

Bug Fixes

Fix unsqueeze dims with multiple trailing negative indices (#2496) @laggui
Fix one_hot implementation for Int Tensors (#2501) @maun
Fix tensor prod and prod dim containing nan values (#2515) @quinton11
Expose ItemLazy to be able to implement for custom types (#2525) @laggui
Check nonzero stride, dilation and groups (#2540) @laggui
Module derive types should inherit visibility (#2610) @laggui
Add dropout prob check (#2695) @laggui

Backends

Add remote Backend (#2463) @nathanielsimard
Add support for custom operations fusion (#2486) @ArthurBrussee
[Breaking] Remove precision bridge (#2538) @laggui
Add fused matmul under fusion experimental feature flag (#2622 #2690) @nathanielsimard

Bug Fixes

Prevent various OOB accesses and discontiguous buffer bugs (#2467) @wingertge
Fix autodiff memory management by verifying parent nodes' existence (#2488) @jnamika
Fix burn remote deadlock + burn fusion draining (#2492) @nathanielsimard
Remove dtype rewrite (#2528) @ArthurBrussee
Fix reduce autotune key no anchor (#2696) @nathanielsimard

Documentation & Examples

Add wgpu-spirv and hip-jit features to text-classification example (#2422) @syl20bnr
Add tensor basic ops examples (#2468) @quinton11
Add segmentation mask to burn book (#2495) @anthonytorlucci
Add numeric tensor examples (#2514) @quinton11
Add module mapper book examples (#2621 #2632) @laggui

Fixes

Fix output dim in embedding nn docstring (#2452) @getumen
Fix tri mask ops return docstring (#2517) @laggui
Fix the incorrect link in contributor-books (#2583) @tiruka
Fix the broken WGSL link in the README (#2607) @korbexmachina
Fix module visitor and mapper trait definition in the book (#2609) @laggui
Fix load_file usage to keep using model (#2672) @laggui
Don't mention a fixed candle bug (#2689) @kitterion

ONNX Support

Format all type names (#2436) @samolego
Add ONNX op Random Normal Like (#2441) @tiruka
Add ONNX op Random Uniform Like (#2448) @tiruka
Infer convolution kernel shape from weight (#2544) @laggui

Enhancements

Improve ndarray tensor creation from memory (#2439) @nathanielsimard
Dont attempt naive reduction when reduce_dim is too high (#2414) @ArthurBrussee
Add more type support for burn-jit (#2454) @wingertge
Rewrite legacy cpa kernels (#2455) @wingertge
Implicit GEMM optimizations/bug fixes (#2499) @wingertge
Add custom NCHW to NHWC kernel for implicit GEMM (optimization) (#2530) @wingertge
Support 8-bit bool for JitBackend (#2526) @wingertge
Implicit gemm rewrite optimization (#2545) @wingertge
Fix autotune error handling (#2670) @nathanielsimard
Use float intrinsics for deform_conv2d backward, fix into_data for padded tensors (#2681) @wingertge

Refactoring

Migrate to cubecl IR refactor (#2418) @wingertge
DefaultDevice should be an alias of BestAvailable (#2443) @ArthurBrussee
Replace crates by dependi (#2477) @vincentmasse
Refactor quantization tensor data representation (#2479) @laggui
Use alias for more consistent typing (#2497) @loganbnielsen
Add QTensorOps docs + refactor tests to simplify inputs (#2557) @laggui
Update for rust 1.83 (#2562 #2605) @laggui
Matmul + CubeCL Update (#2551) @nathanielsimard
Migrate matmul autotune to macro and fix accelerated (#2584) @wingertge
Refactor jit quantized tensor representation (#2604) @laggui
[Breaking] Fix alignment issue of TensorData bytes (#2416) @WorldSEnder
Refactor quantized bytes representation (#2627) @laggui
Update to new cubecl with improved compilation times (#2654) @nathanielsimard
Refactor unary + binary kernels (#2665) @nathanielsimard
Import code from github-device-flow crate for burnbench (#2667) @syl20bnr
Fix web examples and conflicting feature flags w/ default-features = false (#2691) @laggui
Use cubecl reduce w/ autotune (#2673) @maxtremblay

Miscellaneous

Use core::error::Error for no-std (#2346) @antimora
Update deny.toml to follow the spec changes of cargo-deny (#2408) @tiruka
Add segmentation mask to ImageFolderDataset (#2426) @anthonytorlucci
Add ROC AUC metric (#2466) @vincentmasse
Async Processor: run train metrics & dashboard on another thread (#2482) @nathanielsimard
Add precision classification metric (#2293) @tsanona
Add test int one_hot and change ops docs in the book (#2519) @tsanona
Add option to request manual quit on tui (#2489) @vincentmasse
Reduce log spam (#2556) @ArthurBrussee
Add ImageDatasetItem image path field (#2558) @wangjiawen2013
Fix xtask command with last version (#2566 #2582) @syl20bnr
Remove duplicate jit conv2d test (#2581) @tiruka
Relax Fn requirements for param map (#2620) @ArthurBrussee
Extend ImageFolderDataset to support import of COCO detection (#2612) @jin-eld
Add recall metric (#2518) @tsanona
Propagate audio feature flag (#2633) @laggui
Add F-score metric (#2648) @tsanona
Implement benchmark for reduce kernel (#2692) @maxtremblay

Contributors

jnamika, antimora, and 23 other contributors

Assets 2

28 Oct 19:45

laggui

v0.15.0

65aa8b5

v0.15.0

Summary

This release brings major performance improvements to tensor operations, particularly in matrix multiplication and convolution, along with experimental ROCm/HIP and SPIR-V support enabled by CubeCL runtimes. It also introduces foundational features for multi-backend compatibility and adds new quantization operations.

Support for ONNX models has been expanded, with additional operators and bug fixes for better operator coverage.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations, and enhanced documentation.

Module & Tensor

Remove copy restriction for const generic modules (#2222) @laggui
Add deform_conv2d as implemented in torchvision (#2147) @wingertge
Add dim checks on output rank for unsqueeze and stack (#2331) @laggui
Add Softmin (#2358) @NoahSchiro
Add round, floor, ceil for float tensor (#2372) @med1844
Make tensor sync (#2392) @kingwingfly
Add tensor.one_hot int operation (#2413) @tsanona
[Breaking] Change LR schedulers to return the initial LR at first .step() (#2337) @towerpark
Move LrSchedule generic to make it easier to use (#2309) @ArthurBrussee
Add quantization ops default implementation (#2125 #2275 2301) @laggui

Bug Fixes

Avoid 0 denominator in interpolate frac (#2224) @laggui
Nonzero should return an empty vec for zero tensors (#2212) @laggui
Change ndarray mask_where implementation to correctly deal with NaNs (#2272) @laggui
Fix mask_where broadcasted input (#2381) @laggui
Make powf broadcastable (#2398) @laggui

Backends

Add candle CudaDevice and MetalDevice to avoid creating a new unique device each time (#2290) @laggui
Add fusion mix precision (#2247) @nathanielsimard
Add SPIR-V compiler backend to burn-wgpu (#2386) @wingertge
Add burn-hip (#2399) @syl20bnr
Add BackendRouter to handle multiple backends on the way to distributed (#2353 #2419) @laggui

Bug Fixes

Fix autodiff memory leak (#2347) @nathanielsimard
Fix autodiff abs NaN when output is 0 (#2249) @AsherJingkongChen

Documentation & Examples

Add documentation for custom cubecl kernels, update some outdated docs (#2404) @wingertge
Add comments to burn fusion (#2130) @cBournhonesque
Improve doc for burn-tch (#2288) @kingwingfly
Improve regression example (#2405) @laggui
Create CITATION.cff (#2231) @antimora
Enable doc_auto_cfg to show feature-req-hint in docs.rs (#2271) @kingwingfly

Fixes

Fix tensor data elem type conversion in book (#2211) @laggui
Fix target convert in batcher and align guide imports (#2215) @laggui
Fix huber loss documentation (#2232) @kingwingfly
Fix debugger settings doc in contributor book (#2223) @tiruka
Fixed raspberry pi pico example not compiling (#2220) @BjornTheProgrammer
Fixed path in book (#2262) @mehmetalianil
Fix unresolved import regression (#2285) @tiruka
Fix burn book links (#2303 #2327) @laggui @tiruka
Contributor Book: Fix the link of primitive types in the "Serialization" page (#2362) @towerpark
Fix simple regression batch targets (#2379) @wangjiawen2013
Fix xtask args which are unmodified when upgrading xtask commands (#2364) @tiruka

ONNX Support

Add gather support for multi-dim indices (rank > 1) (#2199) @alteredoxide
Allow onnx-import expand op with non-const shapes (#2189) @hexd0t
Improve ONNX import tensor shape tracking (#2213) @hexd0t
Add missing output padding to conv transpose ONNX (#2216) @laggui
Fix ONNX where op for scalar inputs (#2218) @hexd0t
simplify scope tracking in burn-import (#2207) @skewballfox
Add onnx op trilu (#2323) @tiruka
Add ConvTranspose1d ONNX op (#2349) @tiruka

Enhancements

Improve slice kernel performance (#2252) @nathanielsimard
Fix burn-jit conv2d excessive loop unrolling (#2263) @AsherJingkongChen
Introduce autotuning to conv2d and conv_transpose2d with a new im2col/GEMM algorithm (#2287) @wingertge
Further data locality optimizations for implicit GEMM (#2300) @wingertge
Add utility methods to split gradients to GradientParams (#2311) @ArthurBrussee
Add bounds checking to implicit GEMM to allow arbitrary input shapes (#2354) @wingertge
Initialize accumulator to bias for implicit GEMM to save an expensive float_add (#2383) @wingertge

Refactoring

Select kernel from CPA to CubeCL (#2168) @mepatrick73
Migrate cubecl macro (#2266) @wingertge
Remove primitves const D generic (#2298) @laggui
Refactor elemwise fusion (#2344) @nathanielsimard
Refactor Adaptive Avg Pool to CubeCL (#2351) @nathanielsimard
Refactor pooling kernels (#2356) @nathanielsimard
Refactor burn-tensor: Split conv backward ops to allow conditional gradient computation (#2278) @AsherJingkongChen

Miscellaneous

Fix panic messages being invisible in tui mode (#2226) @PaulWagener
Refactor xtask to use tracel-xtask and refactor CI workflow (#2063) @syl20bnr
Automatic minimum rust version in README (#2227) @syl20bnr
Set MSRV to 1.81 (#2388) @nathanielsimard
Don't panic when the progress is > 1.0 (#2229) @PaulWagener
Fix compile for dataset crate with vision feature (#2228) @PaulWagener
Update CI workflow for last version of setup-linux action (#2248) @syl20bnr
[CI] Fix llvmpipe, lavapipe install for valgrind and vulnerabilities (#2264) @syl20bnr
Use CliMetricsRenderer when not in a terminal (#2307) @lancelet
Update rusqlite and associated libraries (#2328) @paulirotta
Fix missing fusion feature flag @nathanielsimard
Move conv autotune under feature flag (except key) (#2330) @laggui
Add should_run for convs instead of panicking (#2403) @ArthurBrussee
Make changes for latest ratatui version (#2421) @laggui
Add Windows/WindowsIterator/WindowsDataset (#2338) @NicoZweifel

Contributors

lancelet, mehmetalianil, and 23 other contributors

Assets 2

27 Aug 17:24

laggui

v0.14.0

4e99dde

v0.14.0

Summary

This release marks the debut of our CubeCL integration, which brings cross-platform GPU programming capabilities directly to Rust.
With CubeCL now supporting both CUDA and WebGPU, Burn benefits from a new CUDA backend that can be enabled using the cuda-jit feature.
Please note that this backend is still considered experimental, and some operations, particularly those related to vision, may experience issues.

Additionally, this release features significant enhancements to ONNX support, including bug fixes, new operators, and improvements in code generation.

As always, it also includes numerous bug fixes, performance enhancements, new tensor operations, and improved documentation.

Burn 0.14.0 introduces a new tensor data format that significantly enhances serialization and deserialization speeds and introduces Quantization, a new Beta feature included in this release. The format is not compatible with previous versions of Burn, but you can migrate your previously saved records using this guide.

Module & Tensor

(@laggui) Add RoPE init_with_frequency_scaling (#2194)
(@laggui) Add 0-dim tensor checks for creation ops and validate TensorData shape w/ num values (#2137)
(@wingertge): Add Hard sigmoid activation function (#2112)
(@antimora): Add is_nan and contains_nan tensor ops (#2088)
(@laggui) Convert compatible prelu weights to rank 1 (#2054)
(@laggui) Refactor tensor quantization for q_* ops (#2025)
(@RuelYasa): Adding burn::nn::Sigmoid (#2031)
(@laggui) Module weight quantization (#2000)
(@louisfd): Cube: Matmul tiling (#1994)
(@antimora): Enhance slice operation to support more range variation (#1989)
(@laggui) Add static tensor quantization (#1963)
(@johnhuichen): Enable negative starts and ends for slice op (#1981)
(@booti386): Implement 3D and transposed 3D convolutions. (#1945)
(@antimora): Print module - implement module display for remaining modules (part2) (#1933)
(@antimora): Print model structure like with PyTorch - Part 1 (#1912)
(@DieracDelta): Tanh nn wrapper (#1903)
(@laggui) Implement Element for bool (#1878)
(@LilDojd) Feat: Add movedim tensor operator (#1876)
(@ArthurBrussee): Make autodiff compile on wasm (#1889)
(@ArthurBrussee): Make Param.id public (#1859)
(@kantic) Remainder operator (#1726)
(@McArthur-Alford) Indices Operator (#1735)
(@laggui) Add seq start position when applying RoPE encoding (#1796)
(@JachymPutta): Adding max import (#1769)
(@agelas): Feat/squeeze dims (#1779)
(@wcshds) Implement bidirectional LSTM (#1035)
(@agelas): Feat/remainder (#1597)

Bug Fixes

(@laggui) Fix root-mean-square precision issue (#2193)
(@laggui) Fix indices dim check in gather_update_outputs (#2149)
(@antimora): Fix #2091 bug (in-place after expand) (#2114)
(@laggui) Fix aggregation results slice (#2110)
(@nathanielsimard): Fix: fusion auto bound checks (#2087)
(@laggui) Extend [min, max] range to ensure zero-point (#2055)
(@agelas): Bug/Remove Squeeze Panic for Multiple Dimensions (#2035)
(@nathanielsimard): Fix wgsl remainder definition (#1979)
(@laggui) Fix output tensor dtype (#1938)
(@femshima): feat: Make RetroForward public (#1905)
(@laggui) Fix conv2d_weight_grad_groups (#1891)
(@nathanielsimard): Fix select assign backward (#1739)
(@louisfd): Fix repeat for dims > 1 (#1713)
(@nathanielsimard): Fix lstm batch size bug (#1695)
(@antimora): Reshape bug fix (#1684)
(@antimora) Fix bug: Filling tensor containing f32::NEG_INFINITY will result in NaN for burn-ndarray (#2095)

ONNX Support

(@hexd0t): Allow ONNX scalar greater/less with scalar (#2146)
(@hexd0t): Implement ONNX Gather for scalar indices (#2141)
(@mepatrick73): feat: adding shape support for gather ONNX operation (#2128)
(@mepatrick73): ONNX Tile operation (#2092)
(@cBournhonesque): Add onnx mean (#2119)
(@mepatrick73): Repeat operation (#2090)
(@antimora): Add 1d and 2d modules for interpolate with scaling (also fix ONNX Resize op) (#2081)
(@johnhuichen): Implement ONNX Pad Operator (#2007)
(@hexd0t, @antimora): Implement ONNX ConstantOfShape (#1815)
(@johnhuichen): Add subtract tensor from scalar for ONNX sub op (#1964)
(@Dirleye): Add ReduceProd ONNX Import (#1955)
(@JachymPutta) feat: added reduce min onnx import (#1894)
(@mosure): feat: resize onnx import (#1863)
(@JachymPutta) feat: added slice onnx import (#1856)
(@skewballfox): Optimize argument handling and improve ONNX graph building (#1857)
(@JachymPutta) feat: add sum onnx import (#1846)
(@agelas): Feat/gather import (#1843)
(@JachymPutta): feat: expand onnx import (#1813)
(@JachymPutta): feat: added range onnx import (#1834)
(@will-maclean): Feature/onnx argmax (#1814)
(@hexd0t): Feat: Implement ONNX RandomUniform + RandomNormal in burn-import (#1806)
(@JachymPutta): feat: Greater + GreaterOrEqual onnx import (#1801)
(@JachymPutta): feat: Less + LessOrEqual onnx import (#1800)
(@JachymPutta): feat: added min onnx import (#1778)
(@agelas): Squeeze Onnx Import (#1753)
(@Arjun31415): Added ONNX AvgPool1d (#1744)
(@Arjun31415): Add MaxPool1d ONNX Op(#1725)
(@AntBlo) Add reduce sum onnx ops to burn imports (#1723)
(@Arjun31415): PReLu ONNX import (#1721)
(@antimora): Update SUPPORTED-ONNX-OPS.md (#1717)
(@antimora): ONNX debug improvements (#1712)
(@antimora): Skip updating shape for linear if not present (#1700)
(@laggui) Remove leaky relu ONNX file (#1697)
(@antimora): ONNX support for scalar unsqueeze (#1690)
(@laggui) Add layer norm onnx op support (#1680)
(@antimora): Fix reshape bug (support for opset version 1) (#1667)
(@wufniks) Add sign ONNX op import support (#1663)
(@laggui) Add where onnx op support (#1653)
(@laggui) Add matmul ONNX op support (#1638)
(@laggui) Add reduce max ONNX op support (#1636)
(@laggui) Add shape ONNX op support (#1639)
(@laggui) [ONNX] Add not op and extend cast support to tensors (#1634)
(@laggui) Add reduce mean ONNX op support (#1637)
(@antimora): Update SUPPORTED-ONNX-OPS.md (#1641)
(@laggui) Add sin onnx op support (#1633)

Bug Fixes

(@mepatrick73) Tensor type indent fix (#2196)
(@mepatrick73) pad-input-fix: adding support for pads as attributes (#2195)
(@hexd0t) Fix ONNX Gather codegen for Shape input (#2148)
(@mepatrick73): bug fix: adding bounds checking to pad ONNX inputs (#2120)
(@laggui) Fix checks_channels_div_groups condition and ONNX conv import with groups (#2051)
(@nathanielsimard): Support linear 1d (#1682)
(@laggui) Fix ONNX and PyTorch import section links in burn book (#1681)
(@antimora): Fix bug 1645 (Unsqueeze OpSet 11) (#1661)
(@laggui) Fix transpose onnx op (permute) (#1657)

Enhancements

(@laggui) Add scientific notation formatting for small metric values (#2136)
(@ArthurBrussee): Always derive Cube features from adapter (#1958)
(@mepatrick73, @nathanielsimard): Dynamic memory management preset + updated wgpu buffer memory management (#1962)
(@mepatrick73): Feat/fixed chunk alloc by class (#1960)
(@ArthurBrussee): Consistent sync/async handling, allow more functions to be async for wasm. (#1936)
(@varonroy): Replaced str with Path (#1919)
(@louisfd, @nathanielsimard): New autodiff graph memory management strategy (#1698)
(@syl20bnr): Move HandleContainer and Tensor Ops descriptions from burn-fusion to burn-tensor (#1654)
(@NicoZweifel) WindowDataset/windows function (#1553)
(@antimora): Improve pickle (CandleTensor) conversions to NestedValue (#1944)

Refactoring

(@mepatrick73) Scatter kernel from cpa to cubecl (#2169)
(@nathanielsimard): Refactor binary op (#2085)
(@omahs): Fix typos (#2098)
(@nathanielsimard): Refactor/jit/unary (#1965)
(@skewballfox): Separating ONNX parsing from burn-import (#1921)
(@laggui) Refactor tensor data (#1916)
(@ArthurBrussee): Remove GraphicsAPI generic for WgpuRuntime (#1888)
(@skewballfox): add dependency management for python (#1887)
(@louisfd): refactor reduce into separate traits (#1798)
(@nathanielsimard): Refactor/jit fusion (#1750)
(@nathanielsimard): Refactor/burn compute (#1580)

Documentation & Examples

(@nathanielsimard) Enable cuda-jit in burn-core + in text classification example (#2160)
(@cBournhonesque): Add comments for matmul kernel (#2138)
(@laggui) Fix inner backend typo in book guide (#2135)
(@antimora): Improve ONNX import book section (#2059)
(@antimora): Update slice documentation (#2024)
(@syl20bnr): Remove mention of example in backend section of the book (#2014)
(@laggui) Fix image-classsification-web + autotune flag usage (#2011)
(@nathanielsimard): Cube/doc/readme (#1904)
(@laggui, @syl20bnr) Add models and examples reference (#1966)
(@antimora): Print module part3 - Update book (#1940)
(@towerpark): Book: Fix the link to burn-train in "Learner" page (#1920)
(@nathanielsimard): Doc: Improve module to_device/fork docs (#1901)
(@jwric, @ThierryCantin-Demers, @mepatrick73): Add documentation to burn core nn (#1746)
(@towerpark): Book: Fix typos in the name of MessagePack format (#1868)
(@Zirconium409122, @kantic): Remainder operator doc (#1836)
(@nathanielsimard): Fix wasm examples (#1824)
(@eltociear) docs: update README.md (#1810)
(@agelas): Contributor Book: Onnx to Burn Conversion (#1771)
(@benbaarber): update ARCHITECTURE.md links to project architecture section in contributor book (#1759)
(@jwric): Add hidden code snippets to guide example in Burn book [redo] (#1742)
(@mepatrick73): Fixing various syntax errors in the Burn book (#1740)
(@ThierryCantin-Demers) Add indentation to project architecture in contributing book (#1738)
(@AntBlo) Add info about enabling debugging for new cont...

Contributors

lancelet, hexd0t, and 50 other contributors

Assets 2

03 May 14:23

nathanielsimard

v0.13.2

3bb0b8f

v0.13.2

Bugfix

Fix autodiff graph memory management strategy to improve performance (#1702 #1710) @louisfd
Fix matmul double broadcasting for ndarray (#1646 #1679) @lancelet

Contributors

lancelet and louisfd

Assets 2

26 Apr 20:01

nathanielsimard

v0.13.1

e254fff

v0.13.1

Bugfix

Fix autodiff memory leak and improve performance with a new graph memory management strategy (#1698) @nathanielsimard @louisfd
Fix inplace fused operations (#1682) @nathanielsimard

Improvements

Linear 1D support, helpful for ONNX support (#1682) @nathanielsimard
Upgrade wgpu to 0.19.4 (#1692) @nathanielsimard

Contributors

louisfd and nathanielsimard

Assets 2

12 Apr 20:12

laggui

v0.13.0

cf7b279

v0.13.0

The Burn Release 0.13 is a significant update introducing numerous new features and performance enhancements. One major change is the removal of the Sync trait implementation from most Burn types, see Core User APIs. Additionally, the release introduces several new tensor operations, module features, optimizers, as well as improvements to the autodiff backend. Notably, a new bridge mechanism facilitates runtime switching between backends, and significant work has been done on the Just-in-Time and Wgpu backends. The release also addresses numerous bug fixes, documentation improvements, infrastructure updates, CI enhancements, and miscellaneous changes to improve code quality and usability.

Core User APIs

A major change in this release is that most Burn types no longer implement the Sync trait, such as modules, optimizers, and tensors. This change should not impact users of the Learner struct for model training. However, it may affect those who implemented their own training loop and inference server. While modules, optimizers and tensors can be sent to other threads, they cannot be accessed concurrently by multiple threads. This aligns with Burn's workflow, where each tensor operation requires an owned version of the tensor. The change was made to safely reduce the number of locks needed when modifying the state of the autodiff graph, fusion state, allocation cache, and various other use cases. While not all locks have been removed, the type signature no longer poses a problem for follow-up optimizations. Note that the same tensor can still be sent to multiple threads without copying the underlying data. However it will require cloning before sending a tensor to a thread. (#1575) @nathanielsimard

Tensor

Support signed value for Tensor::arange #1238 @Nikaidou-Shinku
Add Tensor::unsqueeze_dims op (#1236) @skewballfox
Add support for Any, All operations to Tensor (#1342) @ashdtu
Add not_equal and not_equal_elem tensor ops (#1374) @laggui
Element wise min/max between a pair of tensors (#1385) @boondocklabs
Add is_close and all_close tensor operators (#1389) @antimora
Interpolate tensor operation (Inference Only) (#1246) @Nikaidou-Shinku @antimora @ashdtu
Autodiff/training support for Nearest Interpolation (#1414) @Nikaidou-Shinku @ashdtu @antimora
Add argwhere and nonzero boolean tensor ops (#1394) @laggui
Add bool() op for numerical tensor (#1402) @antimora
Tensor permute operator (#1410) @antimora
Add sign tensor operator (#1446) @antimora
Rename diagonal to eye tensor op and add missing entry for diagonal to Book tensor section (#1449) @antimora
Add prod and prod_dim tensor ops (#1460) @antimora
Add tril_mask, triu_mask and diag_mask ops (#1479) @antimora
Add flip tensor operator (#1468) @carrotflakes
Add tensor sorting operations (#1488) (#1494) @laggui
Add topk tensor operation (#1497) @laggui
Tensor expand operator (#1508) @antimora
Provide Tensor Padding Helpers (#960) (#1097) @jcmullwh @antimora
Move log_sigmoid to activation ops (#1558) @laggui
Add repeat autodiff and fusion support (#1600) @louisfd

Module

Feature Addition: PRelu Module (#1328) @Arjun31415
Implement Instance Normalization (#1321) @tushushu
Add enum module support (#1337) @laggui
Make the parameters of conv1d and conv2d public. (#1245) @Arjun31415
Parameters are now lazy initialized, so you don't need to implement both the init and init_with(record) method for training/inference. (#1539) @nathanielsimard
Support multilabel binary cross entropy (#1571)
Implement Huber loss (#1444) @WorldSEnder
Feat: Add Leaky Relu Model (#1467) @Arjun31415
Feat/swiglu (#1507) @ashdtu
Feat: transformer rotary positional encoding to transformer modules (#1604) @ashdtu

Optimizer

Add linear learning rate scheduler (#1443) @astral4
Exponential learning rate scheduler @1481 @rubenjr0
Cosine Annealing learning rate scheduler with cold restarts @1481 @rubenjr0
Add Rank0 variant to AdaptorRecordV1 and AdaptorRecordItemV1 (#1442) @carrotflakes

Train

Add multi-label classification dataset and metric (#1572) @laggui
Add learner training summary (#1591) @laggui

Backend

This release also introduces the backend bridge, a new mechanism for runtime switching between backends.
While an improvement, it remains compatible with previous methods of supporting mixed precision. (#1529) @nathanielsimard

JIT

Significant effort has been devoted over the past few months to refactor the previous Wgpu backend into a shader-agnostic Just-in-Time backend.
All lower-level dependencies have been abstracted into the Just-in-Time Runtime trait, requiring a compiler, compute server, and storage.
The bulk of this work was carried out by @nathanielsimard and @louisfd.

Commits: #1274 #1280 #1313 #1340 #1356 #1359 #1378 #1391 #1396 #1398 #1417 #1429 #1423 #1424 #1433 #1456 #1474 #1457 #1480 #1472 #1493 #1509 #1530 #1528 #1541 #1550 #1569

Wgpu

Enable burn-fusion by default. (#1223) @nathanielsimard
Feature/autotune int ops (#1136) @agelas
Add runtime options in Wgpu init methods. (#1505) @nathanielsimard
Decent speedup of transposed convolution @louisfd

Autodiff

Extensive work has also been undertaken on Burn's autodiff backend.
The backend now supports gradient checkpointing to reduce memory usage and has been refactored into a client/server architecture.
These updates result in significantly less blocking when tracking gradients, enhancing performance particularly on smaller models.
Furthermore, various bugs have been fixed where some graph nodes weren't used, potentially truncating the autodiff graph.
Overall, these changes make the autodiff process more reliable and efficient. (#1575) (#1358) @louisfd @nathanielsimard

Candle

Upgrade to Candle 0.4.1. (#1382) @laggui

Data

Add an image folder dataset implementation. (#1232) (#1132) @laggui
Add burn::data::network::downloader. (#1283) @laggui

Import

[PyTorchRecorder] Allow multiple pattern matches in chain. (#1269) @laggui
[PyTorchRecorder] Pytorch config extraction (#1323) @antimora
[PyTorchRecorder] Pass top-level key to extract state_dict (#1300) @antimora
[PyTorchRecorder] print debug option (#1425) @antimora
[PyTorchRecorder] Truncate debug display for NestedValue (#1428) @antimora
[PyTorchRecorder] Support for non-contiguous indexes in PyTorchFileRecorder keys (#1432) @antimora
[PyTorchRecorder] Add Enum module support (#1436) @antimora
[ONNX] Parser rewrite (#1296) @skewballfox

Benchmarks

We have implemented a system that enables the comparison of backends across a variety of tasks.
Currently, most of these tasks consist of micro-benchmarks, but we plan to expand the range of benchmarks in the future.
To ensure Burn's portability and performance across different devices, the community can run and upload benchmarks! 🔥

Created the burnbench CLI. (#1260) @syl20bnr
Added GitHub authentication to the burnbench CLI. (#1285) @syl20bnr
Updated GitHub App ID with the official application. (#1397) @syl20bnr
Implemented benchmark upload functionality to the server. (#1381) @syl20bnr
Compiled benchmarks in a dedicated target directory. (#1435) @syl20bnr
Enhanced benchmark result presentation with a neat table and attempted to run every benchmark. (#1464) @akhildevelops
Improved access token refreshing and displayed authenticated user name. (#1483) @syl20bnr
Added system information to benchmark results. (#1495) @syl20bnr
Included Operating System information in benchmark results. (#1531) @syl20bnr
Fixed automatic fusion activation issue with Wgpu. (#1542) @syl20bnr
Tweaked and added kinds to Gelu benchmark names. (#1533) @syl20bnr
Ensured backend names in JSON reports match the burnbench CLI. (#1375) @errordeveloper @syl20bnr
Added 'all' choice to --benches and --backends options. (#1567) @syl20bnr
Revamped burnbench output for improved readability and compactness. (#1568) @syl20bnr
Added URL to browse results on the burn.dev website. (#1573) @syl20bnr

Bug Fix

Fix the pow backward pass when one of the tensor wasn't tracking the gradients. (#1225) (#1224) @nathanielsimard
Fix batch norm on the LibTorch backend when the aggregation was on the same device. (#1226) @nathanielsimard
Fix training dashboard metrics switch on Max OS & Linux (#1228) @nathanielsimard
Fix a bug introduced in (#1138) where arithmetic could fail on usize type. (#1287) @louisfd
[PyTorchRecorder] Fix out of memory bug (#1270) (#1286) @antimora
[PyTorchRecorder] Fix chain pattern matching when multiple patterns are provided (#1273) @laggui
Fix LogEventStore end epoch log (#1314) @laggui
Huggingface dataset importer: check that pa_type is valid before checking if is_binary (#1354) @laggui
Fix implicit casting of bool in wgpu backend (#1391) @louisfd
Fix Switched arguments in reshape_args_usize check (#1409) @jackdarlison
Fix tch view data corruption (#1434) @nathanielsimard
Missing Debug derive for Group Norm Config (#1482) @Arjun31415
Numerically stable log_sigmoid (#1548) @laggui
Fix pytorch recorder adapt_linear when using autodiff backend (#1576) @laggui

Infrastructure

The minimum Rust version has been updated to 1.75. (#1297) @syl20bnr

Docs

Improve the doc feature flags for docs.rs (#1212) @syl20bnr
Include the backends in the documentation (#1229) @nathanielsimard
Started the burn developer book. (#1184) @skewballfox @syl20bnr @antimora
Update TORCH_CUDA_VERSION usage. (#1284) @laggui
fix(book): add missing device parameter to mode.init(). (#1302) @apertureless
fix(book): add missing second parameter to CrosEntropyLoss constructor (#1301) @apertureless
docs(book-&-examples):...