[Feature] reduce fuse on read #2870

nathanielsimard · 2025-03-04T20:22:00Z

| Benchmark             | Feature     | Backend                      | Device        | Median         |
|-----------------------|-------------|------------------------------|---------------|----------------|
| reduce-argmin-0       | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 1.276ms        |
| reduce-argmin-0-fused | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 2.124ms        |
| reduce-sum-0          | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 855.000µs      |
| reduce-sum-0-fused    | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 2.248ms        |
| reduce-argmin-1       | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 652.000µs      |
| reduce-argmin-1-fused | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 1.378ms        |
| reduce-sum-1          | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 679.000µs      |
| reduce-sum-1-fused    | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 1.064ms        |
| reduce-argmin-2       | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 899.000µs      |
| reduce-argmin-2-fused | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 1.106ms        |
| reduce-sum-2          | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 700.000µs      |
| reduce-sum-2-fused    | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 1.101ms        |
| reduce-sum-full       | cuda-fusion | `fusion<cubecl<cuda>>`       | Cuda(0)       | 721.000µs      |
| reduce-argmin-0       | cuda        | `cubecl<cuda>`               | Cuda(0)       | 1.174ms        |
| reduce-argmin-0-fused | cuda        | `cubecl<cuda>`               | Cuda(0)       | 6.949ms        |
| reduce-sum-0          | cuda        | `cubecl<cuda>`               | Cuda(0)       | 849.000µs      |
| reduce-sum-0-fused    | cuda        | `cubecl<cuda>`               | Cuda(0)       | 6.502ms        |
| reduce-argmin-1       | cuda        | `cubecl<cuda>`               | Cuda(0)       | 647.000µs      |
| reduce-argmin-1-fused | cuda        | `cubecl<cuda>`               | Cuda(0)       | 6.234ms        |
| reduce-sum-1          | cuda        | `cubecl<cuda>`               | Cuda(0)       | 651.000µs      |
| reduce-sum-1-fused    | cuda        | `cubecl<cuda>`               | Cuda(0)       | 6.334ms        |
| reduce-argmin-2       | cuda        | `cubecl<cuda>`               | Cuda(0)       | 930.000µs      |
| reduce-argmin-2-fused | cuda        | `cubecl<cuda>`               | Cuda(0)       | 6.562ms        |
| reduce-sum-2          | cuda        | `cubecl<cuda>`               | Cuda(0)       | 711.000µs      |
| reduce-sum-2-fused    | cuda        | `cubecl<cuda>`               | Cuda(0)       | 6.335ms        |
| reduce-sum-full       | cuda        | `cubecl<cuda>`               | Cuda(0)       | 676.000µs      |
| reduce-argmin-0       | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 1.905ms        |
| reduce-argmin-0-fused | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 3.988ms        |
| reduce-sum-0          | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 3.092ms        |
| reduce-sum-0-fused    | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 6.498ms        |
| reduce-argmin-1       | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 2.315ms        |
| reduce-argmin-1-fused | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 4.019ms        |
| reduce-sum-1          | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 3.297ms        |
| reduce-sum-1-fused    | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 5.565ms        |
| reduce-argmin-2       | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 4.771ms        |
| reduce-argmin-2-fused | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 3.231ms        |
| reduce-sum-2          | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 6.022ms        |
| reduce-sum-2-fused    | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 3.402ms        |
| reduce-sum-full       | wgpu-fusion | `fusion<cubecl<wgpu<wgsl>>>` | DefaultDevice | 265.540ms      |
| reduce-argmin-0       | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 1.898ms        |
| reduce-argmin-0-fused | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 607.500ms      |
| reduce-sum-0          | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 2.350ms        |
| reduce-sum-0-fused    | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 15.992ms       |
| reduce-argmin-1       | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 2.109ms        |
| reduce-argmin-1-fused | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 664.359ms      |
| reduce-sum-1          | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 1.609ms        |
| reduce-sum-1-fused    | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 15.521ms       |
| reduce-argmin-2       | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 4.107ms        |
| reduce-argmin-2-fused | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 17.293ms       |
| reduce-sum-2          | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 3.500ms        |
| reduce-sum-2-fused    | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 14.585ms       |
| reduce-sum-full       | wgpu        | `cubecl<wgpu<wgsl>>`         | DefaultDevice | 262.099ms      |

laggui

Some minor comments, otherwise LGTM!

/edit: oh, looks like group_norm_forward_affine_false is failing with precision errors with wgpu (cargo test --color always --features test-wgpu -p burn-core)

laggui · 2025-03-06T14:46:06Z

backend-comparison/benches/reduce.rs

+                let tensor = self.tensor.clone() + 5;
+                let tensor = tensor.log();
+                let tensor = tensor.tanh();
+                let tensor = tensor * 3;
+                tensor.sum_dim(axis);


Debug changes? Curious what issue you were trying to track/fix by introducing the additional ops before the reduce 😄

laggui · 2025-03-06T14:46:17Z

backend-comparison/benches/reduce.rs

+        // benchmarks.push(ReduceBenchmark::<B>::new(
+        //     Instruction::ArgMin(axis),
+        //     device.clone(),
+        // ));


laggui · 2025-03-06T14:46:23Z

backend-comparison/benches/reduce.rs


        benchmarks.push(ReduceBenchmark::<B>::new(
            Instruction::SumDim(axis),
            device.clone(),
        ));
    }

-    benchmarks.push(ReduceBenchmark::<B>::new(Instruction::Sum, device.clone()));
+    // benchmarks.push(ReduceBenchmark::<B>::new(Instruction::Sum, device.clone()));


laggui · 2025-03-06T15:01:17Z

crates/burn-cubecl-fusion/src/shared/io.rs

+#[cube]
+pub fn global_len(global: &GlobalArgs, #[comptime] pos: u32) -> u32 {
+    let tensor = global.tensors.index(pos);
+    tensor.tensor.len()
+}


Isn't this a duplicate of global_length defined just a couple lines above? Minus the cast (which seems redudant, actually).

#[cube] pub fn global_length(global: &GlobalArgs, #[comptime] pos: u32) -> u32 { let tensor = global.tensors.index(pos); u32::cast_from(tensor.tensor.len()) }

laggui · 2025-03-06T15:15:51Z

crates/burn-tensor/src/tests/ops/maxmin.rs

@@ -84,6 +85,18 @@ mod tests {
        output.into_data().assert_eq(&expected, false);
    }

+    #[test]
+    fn test_sum_dim_reshape_maybe_fused() {


Why do we have sum_dim and mean_dim tests in the maxmin module? 😅 We should probably move test_sum_dim_2d(), test_mean_dim_2d() and this new test to the correct module.

codecov · 2025-03-06T21:16:36Z

Codecov Report

Attention: Patch coverage is 71.40600% with 543 lines in your changes missing coverage. Please review.

Project coverage is 82.29%. Comparing base (f98cc0b) to head (e0da405).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-cubecl-fusion/src/reduce/args.rs	17.07%	102 Missing ⚠️
crates/burn-cubecl-fusion/src/shared/kernel.rs	47.40%	81 Missing ⚠️
...es/burn-cubecl-fusion/src/shared/trace/executor.rs	74.59%	79 Missing ⚠️
...ates/burn-cubecl-fusion/src/reduce/optimization.rs	75.00%	74 Missing ⚠️
crates/burn-cubecl-fusion/src/shared/io.rs	32.03%	70 Missing ⚠️
crates/burn-cubecl-fusion/src/matmul/args.rs	0.00%	56 Missing ⚠️
crates/burn-cubecl/src/fusion.rs	63.72%	37 Missing ⚠️
crates/burn-cubecl-fusion/src/shared/ir.rs	77.55%	11 Missing ⚠️
crates/burn-cubecl-fusion/src/shared/trace/base.rs	93.97%	5 Missing ⚠️
crates/burn-cubecl-fusion/src/reduce/builder.rs	97.64%	4 Missing ⚠️
... and 10 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2870      +/-   ##
==========================================
- Coverage   82.31%   82.29%   -0.03%     
==========================================
  Files         863      867       +4     
  Lines      116956   118080    +1124     
==========================================
+ Hits        96268    97169     +901     
- Misses      20688    20911     +223

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nathanielsimard added 21 commits February 21, 2025 13:33

Refactor naming

cf625c1

Build the args

f896944

WIP

af32941

WIP

762ae7f

WIP It executes

ab343b0

Multi execution

c359c89

WIP

8cd6acc

WIP

416e220

Output offset

1a7e61d

Fix output offset

1919dc2

WIP

ca549e5

Better error handling around missing reference tensors

6fe9aa1

Fix inplace

e3d8017

WIP

a4efa79

Add tuner

1fd3cd0

Support more instructions

0c1b90d

CLippy

6aeb2f1

Lock

933652a

Add state

88e202a

Merge branch 'main' into feat/fuse-on-read

4491690

Update CubeCL

9b42fa8

nathanielsimard requested a review from louisfd March 4, 2025 20:22

nathanielsimard added 8 commits March 4, 2025 15:37

CLippy

2f97206

Add strategy validation

a4b6afa

Clippy

11cb906

Test CI

04b78a8

Only reduce + reduce shared plane

e2fa98d

Put ref info in locals

601e094

Store ref metadata in arrays

b89bb62

Fix bug

9aad422

nathanielsimard added 6 commits March 5, 2025 16:21

Cleanup

e464142

Activate fuse features

0a28d83

Clippy

8c80064

Update rev

11b746c

Removes println

e1b0039

Removes println

4264ffb

laggui requested changes Mar 6, 2025

View reviewed changes

nathanielsimard added 6 commits March 6, 2025 10:47

Fix precision

a6d8fd6

Update bench

466485d

Fixes

c736943

Fix fallback

f9b9e98

Enable int reduce in fusion

05dfd4a

Fix Virtual Reference Tensor Broadcasted

e0da405

nathanielsimard merged commit f106148 into main Mar 6, 2025
11 checks passed

nathanielsimard deleted the feat/fuse-on-read branch March 6, 2025 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] reduce fuse on read #2870

[Feature] reduce fuse on read #2870

Uh oh!

nathanielsimard commented Mar 4, 2025 •

edited

Loading

Uh oh!

laggui left a comment •

edited

Loading

Uh oh!

laggui Mar 6, 2025

Uh oh!

laggui Mar 6, 2025

Uh oh!

laggui Mar 6, 2025

Uh oh!

laggui Mar 6, 2025

Uh oh!

laggui Mar 6, 2025

Uh oh!

codecov bot commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Feature] reduce fuse on read #2870

[Feature] reduce fuse on read #2870

Uh oh!

Conversation

nathanielsimard commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laggui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laggui Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

laggui Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

laggui Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

laggui Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

laggui Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

nathanielsimard commented Mar 4, 2025 •

edited

Loading

laggui left a comment •

edited

Loading

codecov bot commented Mar 6, 2025 •

edited

Loading