Skip to content

Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

crutcher
Copy link
Contributor

SelectionDataset is morally a more abstract replacement for ShuffledDataset; all uses could be migrated.

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Changes

When looking at partitioning and over-selection; I noticed that ShuffledDataset is overly constrained, there's nothing requiring a bijection. So based upon discord thread discussions, this is SelectionDataset.

I also added tests to several transforms; and refactored ShuffledDataset for testing and generality.

Testing

Full and additional behavior tests.

Copy link

codecov bot commented Jul 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.55%. Comparing base (1e838db) to head (6ac3033).

❌ Your project check has failed because the head coverage (63.55%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3406      +/-   ##
==========================================
+ Coverage   63.47%   63.55%   +0.07%     
==========================================
  Files         981      982       +1     
  Lines      109753   109942     +189     
==========================================
+ Hits        69668    69873     +205     
+ Misses      40085    40069      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new SelectionDataset looks good, only a minor comment.

Comment on lines 10 to 13
input: PhantomData<I>,
}

/// Generates a shuffled vector of indices up to a size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ShuffleDataset could have a SelectionDataset internaly to avoid implementing the same logic twice while being compatible with the older API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@crutcher crutcher requested a review from nathanielsimard July 22, 2025 15:12
Comment on lines +64 to +83
if let Some(idx) = indices.iter().find(|&i| *i >= size) {
panic!("Index out of bounds for wrapped dataset size: {idx} >= {size}");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not perform that check. Some datasets are huge, and doing a gazillion comparisons might not be OK. You could expose a function to perform the check or provide a checking strategy (out-of-bounds values are clipped, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made checked and unchecked versions.

Comment on lines 75 to 129
/// Creates a new selection dataset that selects all indices from the dataset.
///
/// # Arguments
///
/// * `dataset` - The original dataset to select from.
///
/// # Returns
///
/// A new `SelectionDataset` that selects all indices from the dataset.
pub fn new_select_all(dataset: D) -> Self {
let size = dataset.len();
Self::new(dataset, iota(size))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of bad to use, since we create an indices vector for nothing if we don't perform any transformation afterward. It should be in the doc that some transformation should be done for it to be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this much more explicit.

@crutcher
Copy link
Contributor Author

Based upon comments; I made the constructor names explicit.

The full-selection/iota constructor was there for further manipulations, but we hadn't really had much more than shuffle in place. I'd always planned on implementing something like PartialDataset::Split; so I did that as well (and stole the Arc trick).

@crutcher crutcher requested a review from nathanielsimard July 23, 2025 19:32
@crutcher crutcher force-pushed the crutcher/dataset_transforms branch from b2b8496 to 6ac3033 Compare July 24, 2025 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants