Summary
This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.
Matrix Multiplication Improvements
Optimized matrix multiplication kernels with specialized implementations for:
- Matrix-vector (mat@vec)
- Vector-matrix (vec@mat)
- Inner product
- Outer product
And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.
For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.
Fusion Enhancements
- Improved reliability and performance of Burn Fusion through advanced optimizations.
- Added support for basic dead code elimination.
- Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.
Multi-Threading and Memory Management
- Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
- Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.
CubeCL Config
By default, CubeCL loads its configuration from a TOML file (cubecl.toml
or CubeCL.toml
) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.
A typical cubecl.toml
file might look like this:
[profiling]
logger = { level = "basic", stdout = true }
[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }
[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }
Each section configures a different aspect of CubeCL:
- profiling: Controls performance profiling and logging.
- autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
- compilation: Manages kernel compilation logging and cache.
For more info, check out the CubeCL book.
As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.
Changelog
Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1
. This will affect output shapes if strides were not explicitly set.
MaxPool2dConfig
let pool = MaxPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
MaxPool1dConfig
let pool = MaxPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();
AvgPool2dConfig
let pool = AvgPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
AvgPool1dConfig
let pool = AvgPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();
Module & Tensor
- Add tensor
grid::meshgrid
(#3107 #3191) @crutcher - Add scalar tensor operations (#3127) @ArthurBrussee
- Orthogonal initialization (#3109) @dymat
- Support importing safetensors format (#2721) @wandbrandon @antimora
- Add
burn::linalg
norms (#3131) @crutcher - Extract Linear.forward to nn::functional::linear (#3147) @crutcher
- Base impl of matmul for Int tensor (#3201) @crutcher
- (perf) generate_mask functions optimizations (#3203) @tafia
- Add CosineEmbeddingLoss module and cosine_similarity function (#3207) @antimora
- Tensor::slice_fill() (#3221 #3223) @crutcher
- Base impl of
tensor.slice_dim(dim, range)
(#3235) @crutcher - Support shifting pre-computed RoPE values (#3275) @laggui
- Improve RoPE partial shift case (#3290) @laggui
- Add
tensor.roll()
and improveAsIndex
(renamedIndexConversion
) (#3281) @crutcher - [Breaking] Update pooling default strides to match kernel size (#3338) @lucianyao
- Add
is_finite
tensor element wise op and fixis_close/all_close
inf (#3341) @jonboh
Backends
- [Perf] Interpolate optimizations (#3077) @wingertge
- [Perf] Slice assign (#3069) @wingertge
- Add multi stage conv (#3105) @wingertge
- [Perf] Convolution migration to NHWC (#3090) @wingertge
- Merge different convolution dimensional kernels (#3115) @wingertge
- Support reduce mixed precision accumulation w/ fusion (#3132) @nathanielsimard
- Update remote backend (#3175) @Cielbird
- Feat/autotune optional (#3188) @nathanielsimard
- cubecl unit matmul (#3214) @louisfd
- Update CubeCL for client based profiling (#3222) @ArthurBrussee
- Update cubecl unit matmul double buffered (#3233) @louisfd
- Burn-remote to_device function (#3189) @Cielbird
- Add Drop operation for fusion (#3263) @nathanielsimard
- Lazy tensor downloading in burn-remote (#3276) @Cielbird
- Improve specialized matmul (#3304) @louisfd
- Add autotune priority (#3347 #3378) @nathanielsimard
- Fix local tuner deadlock (#3384) @nathanielsimard
- Fix fusion wasm unsafe input (#3385 #3386) @nathanielsimard
Bug Fixes
- Fix WASM deadlock by really properly not capturing locks (#3123) @ArthurBrussee
- Fix burn-cubecl with autotune disabled (#3141) @wingertge
- Fix fusion multiple reshapes (#3220) @nathanielsimard
- Fix/fusion multiple streams (#3297) @nathanielsimard
- Fix gather broadcasted indices in kernel impl and fusion (#3337) @laggui
- Fix rand interval (#3321) @laggui
- Restrict binary op lhs/rhs alias (#3349) @laggui
- Fix sum fallback when atomic add is not supported (#3369) @laggui
Documentation & Examples
- Update pytorch-model.md with a new troubleshooting help (#3081) @antimora
- Contributor example instructions (#3153) @AshAnand34
- Update README.md with DeepWiki badge (#3192) @antimora
- Add recursion_limit macro to getting started exemples code (#3238) @Marc-AnthonyG
- KaTeX for Mathematical expressions in docstrings (#3278) @BhavyeMathur
- Add Metal backend support to custom-image-dataset (#3335 #3354) @TsaoLun
- Add link to license in README badge (#3356) @Olexandr88
Fixes
- Fix typo in Burn Book (#3113) @danny-burrows
- fix typos (#3186) @omahs
- Fix Typos in Documentation Comments (#3280) @leopardracer
- Fix typo in code documentation for BurnGraph codegen (#3286) @kilavvy
- Fix error messages from tensor checks for flatten (#3319) @NoVegetable
- Fix broken link to burn-tch (#3365) @dbdr
- Update documentation description for nonzero and nonzero_async (#3368) @catch-twenty-two
ONNX Support
- ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info (#3037) @antimora
- Restrict ONNX opset to 16 and up (#3051) @antimora
- Allow Shape input type for Slice operation (#3092) @antimora
- Support onnx and, or & xor nodes (#3173) @tye-singwa
- Add support ONNX instance norm (#3177) @tye-singwa
- Onnx ceil & round (#3225) @tye-singwa
- Add support onnx group norm (#3245) @tye-singwa
- Add onnx SpaceToDepth / DepthToSpace (#3277) @tye-singwa
- Fix onnx topological sort check (#3284) @tye-singwa
- Add onnx ArgMin node (#3285) @tye-singwa
- Add support onnx size (#3301) @tye-singwa
- Support flexible backend selection for import tests (#3372 #3380) @lucianyao
- Fix ONNX node name sanitization and allow ai.onnx.ml domain (#3371) @antimora
Enhancements
- Replace some powf->powi (#3152) @ArthurBrussee
- Improve fusion compilation speed (#3155) @nathanielsimard
- Perf/remove repeat dim (#3183) @nathanielsimard
- Perf: Fusion search for composed optimization (#3258) @nathanielsimard
- Improve matmul selector (#3307 #3343 #3350 #3376) @nathanielsimard
Refactoring
- Refactor CubeCL slices (#3104) @nathanielsimard
- CubeCL init refactor (#3128) @nathanielsimard
- Refactor narrow, chunk and split (#3137) @laggui
- Refactor quantization scheme (#3042) @maxtremblay
- Migrated prng (random) to CubeCL (#3165 #3170) @Cielbird
- Break down
test_onnx.rs
into test subdirectories (#3144) @antimora - Refactor: Move op_configuration.rs from burn-import to onnx-ir (#3126) @antimora
- Fix relative cmp + debug tools (#3197) @nathanielsimard
- Refactor cubecl line size matmul (#3219) @louisfd
- Absolute tolerance is too tight for strict/balanced/permissive (#3242) @laggui
- Fix clippy rust 1.88 and cargo run checks usage (#3325 #3320) @laggui
- Remove hip os cfg flags (#3336) @laggui
- Update cubecl matmul refactor / docs (#3366) @louisfd
Miscellaneous
- Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
- Replace run-checks scripts with command alias (#3118) @laggui
- Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
- Add cubecl.toml config (#3150) @nathanielsimard
- Use
CUBECL_DEBUG_OPTION=profile
macos ci (#3164) @laggui - Update cubecl: sync_cube (#3163) @louisfd
- Fix autotune recursive (#3161) @nathanielsimard
- Bump zip dependency (#3199) @swfsql
- Import
derive_new::new
forsafetensors
feat (#3205) @swfsql - Add CUDA, Vulkan and WGPU on-demand self-hosted runners (#3190 #3215 #3334 #3348 #3351 #3352) @syl20bnr
- Fix: size_of import in quantization tests (#3195) @louisfd
- burn-dataset: Catch import.py unsuccessful exits (#3236) @drozdziak1
- Adding image dimensions to ImageDatasetItem (#3251) @catch-twenty-two
- burn-dataset: Make virtualenv optional when running importer.py (#3255) @drozdziak1
- Fix cubecl std usage (#3306) @laggui
- Fix tui legend label placement (#3327) @BenFradet
- Move blanket
Adaptor
impl to metrics base (#3346) @dbdr - Make metric order consistent in summaries (#3353) @dbdr
- Fix cubecl
normal_respects_68_95_99_rule
(#3377) @laggui - Bump deps (#3367) @ArthurBrussee
- Fix fusion rollback, disable autotune checks and other CI issues (#3362) @laggui