Summary

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for:

Matrix-vector (mat@vec)
Vector-matrix (vec@mat)
Inner product
Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.

Fusion Enhancements

Improved reliability and performance of Burn Fusion through advanced optimizations.
Added support for basic dead code elimination.
Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

profiling: Controls performance profiling and logging.
autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book.

As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig

let pool = MaxPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

MaxPool1dConfig

let pool = MaxPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

AvgPool2dConfig

let pool = AvgPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

AvgPool1dConfig

let pool = AvgPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

Module & Tensor

Add tensor grid::meshgrid (#3107 #3191) @crutcher
Add scalar tensor operations (#3127) @ArthurBrussee
Orthogonal initialization (#3109) @dymat
Support importing safetensors format (#2721) @wandbrandon @antimora
Add burn::linalg norms (#3131) @crutcher
Extract Linear.forward to nn::functional::linear (#3147) @crutcher
Base impl of matmul for Int tensor (#3201) @crutcher
(perf) generate_mask functions optimizations (#3203) @tafia
Add CosineEmbeddingLoss module and cosine_similarity function (#3207) @antimora
Tensor::slice_fill() (#3221 #3223) @crutcher
Base impl of tensor.slice_dim(dim, range) (#3235) @crutcher
Support shifting pre-computed RoPE values (#3275) @laggui
Improve RoPE partial shift case (#3290) @laggui
Add tensor.roll() and improve AsIndex (renamed IndexConversion) (#3281) @crutcher
[Breaking] Update pooling default strides to match kernel size (#3338) @lucianyao
Add is_finite tensor element wise op and fix is_close/all_close inf (#3341) @jonboh

Backends

[Perf] Interpolate optimizations (#3077) @wingertge
[Perf] Slice assign (#3069) @wingertge
Add multi stage conv (#3105) @wingertge
[Perf] Convolution migration to NHWC (#3090) @wingertge
Merge different convolution dimensional kernels (#3115) @wingertge
Support reduce mixed precision accumulation w/ fusion (#3132) @nathanielsimard
Update remote backend (#3175) @Cielbird
Feat/autotune optional (#3188) @nathanielsimard
cubecl unit matmul (#3214) @louisfd
Update CubeCL for client based profiling (#3222) @ArthurBrussee
Update cubecl unit matmul double buffered (#3233) @louisfd
Burn-remote to_device function (#3189) @Cielbird
Add Drop operation for fusion (#3263) @nathanielsimard
Lazy tensor downloading in burn-remote (#3276) @Cielbird
Improve specialized matmul (#3304) @louisfd
Add autotune priority (#3347 #3378) @nathanielsimard
Fix local tuner deadlock (#3384) @nathanielsimard
Fix fusion wasm unsafe input (#3385 #3386) @nathanielsimard

Bug Fixes

Fix WASM deadlock by really properly not capturing locks (#3123) @ArthurBrussee
Fix burn-cubecl with autotune disabled (#3141) @wingertge
Fix fusion multiple reshapes (#3220) @nathanielsimard
Fix/fusion multiple streams (#3297) @nathanielsimard
Fix gather broadcasted indices in kernel impl and fusion (#3337) @laggui
Fix rand interval (#3321) @laggui
Restrict binary op lhs/rhs alias (#3349) @laggui
Fix sum fallback when atomic add is not supported (#3369) @laggui

Documentation & Examples

Update pytorch-model.md with a new troubleshooting help (#3081) @antimora
Contributor example instructions (#3153) @AshAnand34
Update README.md with DeepWiki badge (#3192) @antimora
Add recursion_limit macro to getting started exemples code (#3238) @Marc-AnthonyG
KaTeX for Mathematical expressions in docstrings (#3278) @BhavyeMathur
Add Metal backend support to custom-image-dataset (#3335 #3354) @TsaoLun
Add link to license in README badge (#3356) @Olexandr88

Fixes

Fix typo in Burn Book (#3113) @danny-burrows
fix typos (#3186) @omahs
Fix Typos in Documentation Comments (#3280) @leopardracer
Fix typo in code documentation for BurnGraph codegen (#3286) @kilavvy
Fix error messages from tensor checks for flatten (#3319) @NoVegetable
Fix broken link to burn-tch (#3365) @dbdr
Update documentation description for nonzero and nonzero_async (#3368) @catch-twenty-two

ONNX Support

ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info (#3037) @antimora
Restrict ONNX opset to 16 and up (#3051) @antimora
Allow Shape input type for Slice operation (#3092) @antimora
Support onnx and, or & xor nodes (#3173) @tye-singwa
Add support ONNX instance norm (#3177) @tye-singwa
Onnx ceil & round (#3225) @tye-singwa
Add support onnx group norm (#3245) @tye-singwa
Add onnx SpaceToDepth / DepthToSpace (#3277) @tye-singwa
Fix onnx topological sort check (#3284) @tye-singwa
Add onnx ArgMin node (#3285) @tye-singwa
Add support onnx size (#3301) @tye-singwa
Support flexible backend selection for import tests (#3372 #3380) @lucianyao
Fix ONNX node name sanitization and allow ai.onnx.ml domain (#3371) @antimora

Enhancements

Replace some powf->powi (#3152) @ArthurBrussee
Improve fusion compilation speed (#3155) @nathanielsimard
Perf/remove repeat dim (#3183) @nathanielsimard
Perf: Fusion search for composed optimization (#3258) @nathanielsimard
Improve matmul selector (#3307 #3343 #3350 #3376) @nathanielsimard

Refactoring

Refactor CubeCL slices (#3104) @nathanielsimard
CubeCL init refactor (#3128) @nathanielsimard
Refactor narrow, chunk and split (#3137) @laggui
Refactor quantization scheme (#3042) @maxtremblay
Migrated prng (random) to CubeCL (#3165 #3170) @Cielbird
Break down test_onnx.rs into test subdirectories (#3144) @antimora
Refactor: Move op_configuration.rs from burn-import to onnx-ir (#3126) @antimora
Fix relative cmp + debug tools (#3197) @nathanielsimard
Refactor cubecl line size matmul (#3219) @louisfd
Absolute tolerance is too tight for strict/balanced/permissive (#3242) @laggui
Fix clippy rust 1.88 and cargo run checks usage (#3325 #3320) @laggui
Remove hip os cfg flags (#3336) @laggui
Update cubecl matmul refactor / docs (#3366) @louisfd

Miscellaneous

Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
Replace run-checks scripts with command alias (#3118) @laggui
Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
Add cubecl.toml config (#3150) @nathanielsimard
Use CUBECL_DEBUG_OPTION=profile macos ci (#3164) @laggui
Update cubecl: sync_cube (#3163) @louisfd
Fix autotune recursive (#3161) @nathanielsimard
Bump zip dependency (#3199) @swfsql
Import derive_new::new for safetensors feat (#3205) @swfsql
Add CUDA, Vulkan and WGPU on-demand self-hosted runners (#3190 #3215 #3334 #3348 #3351 #3352) @syl20bnr
Fix: size_of import in quantization tests (#3195) @louisfd
burn-dataset: Catch import.py unsuccessful exits (#3236) @drozdziak1
Adding image dimensions to ImageDatasetItem (#3251) @catch-twenty-two
burn-dataset: Make virtualenv optional when running importer.py (#3255) @drozdziak1
Fix cubecl std usage (#3306) @laggui
Fix tui legend label placement (#3327) @BenFradet
Move blanket Adaptor impl to metrics base (#3346) @dbdr
Make metric order consistent in summaries (#3353) @dbdr
Fix cubecl normal_respects_68_95_99_rule (#3377) @laggui
Bump deps (#3367) @ArthurBrussee
Fix fusion rollback, disable autotune checks and other CI issues (#3362) @laggui

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.18.0