Should we remove prompts for some FAMTEB retrieval datasets? #3174

mehran-sarmadi · 2025-09-13T10:56:27Z

mehran-sarmadi
Sep 13, 2025

Hi,

We've noticed that a small number of our FAMTEB retrieval datasets perform better when a prompt isn't used. Would it be acceptable to include these results without a prompt and remove them from the model's prompt list?
@Samoed @KennethEnevoldsen

Samoed · 2025-09-13T13:30:43Z

Samoed
Sep 13, 2025
Maintainer

I think yes, you can do this

0 replies

KennethEnevoldsen · 2025-09-13T17:40:49Z

KennethEnevoldsen
Sep 13, 2025
Maintainer

For model developers, we have so far allowed people to select the prompt that their model uses. This, however, can lead to prompt-tuning, where you might end up overfitting on the benchmark by running it multiple times with different prompts. This way, you can introduce random variations and obtain a better performance test set without actually improving the model.

I see a few different approaches to doing it. That reflects different uses.

Naive use - free-form prompt

The user writes a prompt that they expect to be able to solve the task. Potentially copying from previous tasks.

"Embed these sentences such that political ideologies cluster close together."

Informed use - fitting the prompt for the task

Imagine a company implementing its model for its documentation site. It contains >100k technical documents, and they take the time to test which prompt provides the best results for their users. They develop a test and a train set, write ~20 prompts, and use the one that fits best.

We already see this with specific models that have recommended prompts for categories such as retrieval. This can be on a scale from relatively generic (retrieval) to very use-case specific (essentially prompt-fitting it on the train set)

Prompt-hacking - fitting the prompt to the benchmark

Same as above, but you fit the prompt to the test set.

1 reply

mehran-sarmadi Sep 15, 2025
Author

Thanks for raising this point!
Actually, there was only one dataset where the difference was noticeably in favor of the no-prompt baseline. That dataset was quoraretrieval-fa.v2. After looking deeper, I realized the reason is that although it’s a retrieval dataset, both the queries and the corpus completely consist of short, single-line questions. Because of that:

Our prompts don’t really align well with the intrinsic nature of this dataset.
The prompts are relatively long compared to the queries, which ends up hurting performance rather than helping.

Given this, I think it makes sense to report the no-prompt results for this particular dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should we remove prompts for some FAMTEB retrieval datasets? #3174

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Should we remove prompts for some FAMTEB retrieval datasets? #3174

Uh oh!

mehran-sarmadi Sep 13, 2025

Replies: 2 comments · 1 reply

Uh oh!

Samoed Sep 13, 2025 Maintainer

Uh oh!

KennethEnevoldsen Sep 13, 2025 Maintainer

Uh oh!

mehran-sarmadi Sep 15, 2025 Author

mehran-sarmadi
Sep 13, 2025

Replies: 2 comments 1 reply

Samoed
Sep 13, 2025
Maintainer

KennethEnevoldsen
Sep 13, 2025
Maintainer

mehran-sarmadi Sep 15, 2025
Author