-
Notifications
You must be signed in to change notification settings - Fork 645
Add SelectionDataset, refactor ShuffledDataset, and add transform tests. #3406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❌ Your project check has failed because the head coverage (63.55%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #3406 +/- ##
==========================================
+ Coverage 63.47% 63.55% +0.07%
==========================================
Files 981 982 +1
Lines 109753 109942 +189
==========================================
+ Hits 69668 69873 +205
+ Misses 40085 40069 -16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new SelectionDataset
looks good, only a minor comment.
input: PhantomData<I>, | ||
} | ||
|
||
/// Generates a shuffled vector of indices up to a size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ShuffleDataset
could have a SelectionDataset
internaly to avoid implementing the same logic twice while being compatible with the older API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if let Some(idx) = indices.iter().find(|&i| *i >= size) { | ||
panic!("Index out of bounds for wrapped dataset size: {idx} >= {size}"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not perform that check. Some datasets are huge, and doing a gazillion comparisons might not be OK. You could expose a function to perform the check or provide a checking strategy (out-of-bounds values are clipped, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made checked and unchecked versions.
/// Creates a new selection dataset that selects all indices from the dataset. | ||
/// | ||
/// # Arguments | ||
/// | ||
/// * `dataset` - The original dataset to select from. | ||
/// | ||
/// # Returns | ||
/// | ||
/// A new `SelectionDataset` that selects all indices from the dataset. | ||
pub fn new_select_all(dataset: D) -> Self { | ||
let size = dataset.len(); | ||
Self::new(dataset, iota(size)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of bad to use, since we create an indices vector for nothing if we don't perform any transformation afterward. It should be in the doc that some transformation should be done for it to be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this much more explicit.
Based upon comments; I made the constructor names explicit. The full-selection/iota constructor was there for further manipulations, but we hadn't really had much more than shuffle in place. I'd always planned on implementing something like PartialDataset::Split; so I did that as well (and stole the Arc trick). |
SelectionDataset is morally a more abstract replacement for ShuffledDataset; all uses could be migrated.
…c<D>>`. Refactor methods for improved reuse and readability. Add test for panic handling in `from_indices_checked`.
b2b8496
to
6ac3033
Compare
SelectionDataset is morally a more abstract replacement for ShuffledDataset; all uses could be migrated.
Pull Request Template
Checklist
cargo run-checks
command has been executed.Changes
When looking at partitioning and over-selection; I noticed that ShuffledDataset is overly constrained, there's nothing requiring a bijection. So based upon discord thread discussions, this is SelectionDataset.
I also added tests to several transforms; and refactored ShuffledDataset for testing and generality.
Testing
Full and additional behavior tests.