Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406

crutcher · 2025-07-21T18:50:19Z

SelectionDataset is morally a more abstract replacement for ShuffledDataset; all uses could be migrated.

Pull Request Template

Checklist

Confirmed that cargo run-checks command has been executed.
Made sure the book is up to date with changes in this PR.

Changes

When looking at partitioning and over-selection; I noticed that ShuffledDataset is overly constrained, there's nothing requiring a bijection. So based upon discord thread discussions, this is SelectionDataset.

I also added tests to several transforms; and refactored ShuffledDataset for testing and generality.

Testing

Full and additional behavior tests.

codecov · 2025-07-21T19:28:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.55%. Comparing base (1e838db) to head (6ac3033).

❌ Your project check has failed because the head coverage (63.55%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3406      +/-   ##
==========================================
+ Coverage   63.47%   63.55%   +0.07%     
==========================================
  Files         981      982       +1     
  Lines      109753   109942     +189     
==========================================
+ Hits        69668    69873     +205     
+ Misses      40085    40069      -16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nathanielsimard

The new SelectionDataset looks good, only a minor comment.

nathanielsimard · 2025-07-21T21:01:33Z

crates/burn-dataset/src/transform/random.rs

    input: PhantomData<I>,
 }

+/// Generates a shuffled vector of indices up to a size.


The ShuffleDataset could have a SelectionDataset internaly to avoid implementing the same logic twice while being compatible with the older API.

nathanielsimard · 2025-07-22T21:01:26Z

crates/burn-dataset/src/transform/selection.rs

+        if let Some(idx) = indices.iter().find(|&i| *i >= size) {
+            panic!("Index out of bounds for wrapped dataset size: {idx} >= {size}");
+        }


I think we should not perform that check. Some datasets are huge, and doing a gazillion comparisons might not be OK. You could expose a function to perform the check or provide a checking strategy (out-of-bounds values are clipped, etc.).

made checked and unchecked versions.

nathanielsimard · 2025-07-22T21:02:29Z

crates/burn-dataset/src/transform/selection.rs

+    /// Creates a new selection dataset that selects all indices from the dataset.
+    ///
+    /// # Arguments
+    ///
+    /// * `dataset` - The original dataset to select from.
+    ///
+    /// # Returns
+    ///
+    /// A new `SelectionDataset` that selects all indices from the dataset.
+    pub fn new_select_all(dataset: D) -> Self {
+        let size = dataset.len();
+        Self::new(dataset, iota(size))
+    }


It's kind of bad to use, since we create an indices vector for nothing if we don't perform any transformation afterward. It should be in the doc that some transformation should be done for it to be useful.

made this much more explicit.

crutcher · 2025-07-23T19:22:50Z

Based upon comments; I made the constructor names explicit.

The full-selection/iota constructor was there for further manipulations, but we hadn't really had much more than shuffle in place. I'd always planned on implementing something like PartialDataset::Split; so I did that as well (and stole the Arc trick).

SelectionDataset is morally a more abstract replacement for ShuffledDataset; all uses could be migrated.

…c<D>>`. Refactor methods for improved reuse and readability. Add test for panic handling in `from_indices_checked`.

nathanielsimard reviewed Jul 21, 2025

View reviewed changes

crutcher requested a review from nathanielsimard July 22, 2025 15:12

nathanielsimard reviewed Jul 22, 2025

View reviewed changes

crutcher requested a review from nathanielsimard July 23, 2025 19:32

crutcher added 12 commits July 24, 2025 12:39

Add SelectionDataset, refactor ShuffledDataset, and add transform tests.

481f04a

SelectionDataset is morally a more abstract replacement for ShuffledDataset; all uses could be migrated.

address review

5937d7b

fmt

91b84f2

import/bug fix

6df979f

Skip test on shuffled

b4d1928

reorder

74a603f

util methods, mutable shuffle

d202f9d

review

6c70c99

Update SelectionDataset to support generic dataset input via `Into<Ar…

a8c64f5

…c<D>>`. Refactor methods for improved reuse and readability. Add test for panic handling in `from_indices_checked`.

expose members

c021918

stricter types

b5828f2

Changed wrapped ds name to wrapped.

6ac3033

crutcher force-pushed the crutcher/dataset_transforms branch from b2b8496 to 6ac3033 Compare July 24, 2025 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406

Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406

Uh oh!

crutcher commented Jul 21, 2025

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

nathanielsimard left a comment

Uh oh!

nathanielsimard Jul 21, 2025

Uh oh!

crutcher Jul 22, 2025

Uh oh!

nathanielsimard Jul 22, 2025

Uh oh!

crutcher Jul 23, 2025

Uh oh!

nathanielsimard Jul 22, 2025

Uh oh!

crutcher Jul 23, 2025

Uh oh!

crutcher commented Jul 23, 2025

Uh oh!

Uh oh!

Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406

Are you sure you want to change the base?

Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406

Uh oh!

Conversation

crutcher commented Jul 21, 2025

Pull Request Template

Checklist

Changes

Testing

Uh oh!

codecov bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nathanielsimard left a comment

Choose a reason for hiding this comment

Uh oh!

nathanielsimard Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

crutcher Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

nathanielsimard Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

crutcher Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

nathanielsimard Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

crutcher Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

crutcher commented Jul 23, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading