Replies: 1 comment 6 replies
-
I have added some stratified subsampling for multilabel data in this PR: #694 You can have a look at the code there. Something like this: from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
X: ArrayLike = [...]
labels: list[list[str]] = [...]
encoded_labels = MultiLabelBinarizer.fit_transform(labels)
X_train, y_train, X_test, y_test = train_test_split(X, labels, stratifiy=encoded_labels) |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Recently, new Multilabel classification task has been added: #440
Looking at datasets available on HF, they're typically quite large, e.g. >50k examples. What's the best way to train/test split it, or just sample it?
Maybe sth from: http://scikit.ml/stratification.html ?
cc @x-tabdeveloping as you added that in
Beta Was this translation helpful? Give feedback.
All reactions