Table to count number of tasks per language #686
Replies: 4 comments 11 replies
-
We can add it at the top of the existing task overview. You can see how the previous script inserts the table at a specific spot. Unsure if we should do two tables one for the group languages and one for language groups or do a nested table:
|
Beta Was this translation helpful? Give feedback.
-
For the paper, we might also create an overview similar to: ![]() Also seen e.g. here We probably don't want to fields to be datasets but maybe rather domains e.g. something like: |
Beta Was this translation helpful? Give feedback.
-
Should we make |
Beta Was this translation helpful? Give feedback.
-
I think it's a good idea. To facilitate this just want to bring to your attention @isaac-chung a (fairly) new method that should make it easy to aggregate the data this way, e.g.: task_types = ["BitextMining","Classification","Clustering","InstructionRetrieval","PairClassification","Reranking","Retrieval","STS","Summarization"]
for tt in task_types:
print(get_tasks(task_types=[tt]).count_languages()) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
With growing numbers of datasets in multilingual part of MTEB, maybe a table is needed to tally how many datasets there are per task per language. I would imagine something like this is also needed for the submission (e.g. some stats grouped by language family). Where would this live? Maybe as a script under docs/mmteb? Or a separate repo like https://github.com/embeddings-benchmark/mtebscripts? or adding to that repo?
Tagging @imenelydiaker @KennethEnevoldsen @Muennighoff and anyone who might be interested.
For example:
Beta Was this translation helpful? Give feedback.
All reactions