Table to count number of tasks per language #686

isaac-chung · 2024-05-13T21:32:57Z

isaac-chung
May 13, 2024
Maintainer

With growing numbers of datasets in multilingual part of MTEB, maybe a table is needed to tally how many datasets there are per task per language. I would imagine something like this is also needed for the submission (e.g. some stats grouped by language family). Where would this live? Maybe as a script under docs/mmteb? Or a separate repo like https://github.com/embeddings-benchmark/mtebscripts? or adding to that repo?

Tagging @imenelydiaker @KennethEnevoldsen @Muennighoff and anyone who might be interested.

For example:

Language	BitextMining	Classification	Clustering	InstructionRetrieval	PairClassification	Reranking	Retrieval	STS	Summarization
ace	0	1	0	0	0	0	0	0	0
acm	0	1	0	0	0	0	0	0	0

KennethEnevoldsen · 2024-05-14T07:19:36Z

KennethEnevoldsen
May 14, 2024
Maintainer

We can add it at the top of the existing task overview. You can see how the previous script inserts the table at a specific spot.

Unsure if we should do two tables one for the group languages and one for language groups or do a nested table:

Language	BitextMining	Classification	Clustering	InstructionRetrieval	PairClassification	Reranking	Retrieval	STS	Summarization
Indo-European	0	2	0	0	0	0	0	0	0
dan	0	1	0	0	0	0	0	0	0
nob	0	1	0	0	0	0	0	0	0

2 replies

isaac-chung May 14, 2024
Maintainer Author

Sure. Let's append at the end of the tasks.md file. As for grouping, we could group by genus (e.g. Indic, Romance) or Family (e.g. Indo-European, Niger-Congo).
I don't think cell merging is supported in Github flavoured markdown. Maybe we keep separate tables for groups and all languages?

KennethEnevoldsen May 14, 2024
Maintainer

Feel free to do it for whatever is the easiest to get the metadata for.

KennethEnevoldsen · 2024-05-14T07:43:30Z

KennethEnevoldsen
May 14, 2024
Maintainer

For the paper, we might also create an overview similar to:

Also seen e.g. here

We probably don't want to fields to be datasets but maybe rather domains e.g. something like:

or based on subtasks:

5 replies

imenelydiaker May 14, 2024
Maintainer

I'd go for the latest scratch, easier to read and less dense

KennethEnevoldsen May 14, 2024
Maintainer

I have added a figure to overleaf

Muennighoff May 14, 2024
Maintainer

Looks great! I think we need to be a bit more clear about what we mean by tasks. What was datasets in the original MTEB figure seems to have become tasks. If we all prefer to refer to these as tasks, I think that's fine but then maybe what was tasks in the original MTEB figure also needs a new name, say abstract tasks or task types. What do you think?

It's also a bit confusing in some parts of the codebase, e.g. in https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md we alternate between calling them datasets & tasks.

Also curious what @orionw thinks is the best way :)

orionw May 14, 2024
Maintainer

The word task is definitely overloaded :)

I think we should use dataset to refer to an individual dataset under a specific task (or task type if we prefer, I like it - helps distinguish it from being conflated with dataset).

The FLAN paper called these task categories, maybe these definitions can help? I think these definitions differ according to each paper so I'm not attached to their definitions, but it would benefit us to be consistent on these definitions.

imenelydiaker May 14, 2024
Maintainer

@orionw The FLAN notation seems great, it looks exactly like what we're trying to do. Let's go for this:

Dataset -> e.g., MLSUM, AmazonReviews
Task Category -> Classification, Clustering, Retrieval, etc.
Task -> Dataset x Task Category : e.g., MLSUMClustering, AmazonReviewsClassification

imenelydiaker · 2024-05-14T08:12:42Z

imenelydiaker
May 14, 2024
Maintainer

Should we make mmtebscripts public for storing the scripts that we'll use to run experiments?

3 replies

KennethEnevoldsen May 14, 2024
Maintainer

Why not script keep it in this repo?

imenelydiaker May 14, 2024
Maintainer

I just thought it would be better for reproducing the paper's results, but no strong preferences

KennethEnevoldsen May 14, 2024
Maintainer

I think I would personally prefer having it all in one repo

dokato · 2024-05-14T10:18:56Z

dokato
May 14, 2024
Maintainer

I think it's a good idea. To facilitate this just want to bring to your attention @isaac-chung a (fairly) new method that should make it easy to aggregate the data this way, e.g.:

task_types = ["BitextMining","Classification","Clustering","InstructionRetrieval","PairClassification","Reranking","Retrieval","STS","Summarization"]
for tt in task_types:
    print(get_tasks(task_types=[tt]).count_languages())

1 reply

isaac-chung May 14, 2024
Maintainer Author

Thanks! I'll take a look.

Table to count number of tasks per language #686

Uh oh!

Uh oh!

isaac-chung May 13, 2024 Maintainer

Replies: 4 comments · 11 replies

Uh oh!

KennethEnevoldsen May 14, 2024 Maintainer

Uh oh!

isaac-chung May 14, 2024 Maintainer Author

Uh oh!

KennethEnevoldsen May 14, 2024 Maintainer

Uh oh!

KennethEnevoldsen May 14, 2024 Maintainer

Uh oh!

imenelydiaker May 14, 2024 Maintainer

Uh oh!

KennethEnevoldsen May 14, 2024 Maintainer

Uh oh!

Muennighoff May 14, 2024 Maintainer

Uh oh!

Uh oh!

orionw May 14, 2024 Maintainer

Uh oh!

Uh oh!

imenelydiaker May 14, 2024 Maintainer

Uh oh!

imenelydiaker May 14, 2024 Maintainer

Uh oh!

KennethEnevoldsen May 14, 2024 Maintainer

Uh oh!

Uh oh!

imenelydiaker May 14, 2024 Maintainer

Uh oh!

KennethEnevoldsen May 14, 2024 Maintainer

Uh oh!

dokato May 14, 2024 Maintainer

Uh oh!

isaac-chung May 14, 2024 Maintainer Author

isaac-chung
May 13, 2024
Maintainer

Replies: 4 comments 11 replies

KennethEnevoldsen
May 14, 2024
Maintainer

isaac-chung May 14, 2024
Maintainer Author

KennethEnevoldsen May 14, 2024
Maintainer

KennethEnevoldsen
May 14, 2024
Maintainer

imenelydiaker May 14, 2024
Maintainer

KennethEnevoldsen May 14, 2024
Maintainer

Muennighoff May 14, 2024
Maintainer

orionw May 14, 2024
Maintainer

imenelydiaker May 14, 2024
Maintainer

imenelydiaker
May 14, 2024
Maintainer

KennethEnevoldsen May 14, 2024
Maintainer

imenelydiaker May 14, 2024
Maintainer

KennethEnevoldsen May 14, 2024
Maintainer

dokato
May 14, 2024
Maintainer

isaac-chung May 14, 2024
Maintainer Author