๐Ÿ•Š๏ธ DOVE: (Dataset Of Variation Evaluation)
A Large-Scale Multi-Dimensional Predictions Dataset
Towards Meaningful LLM Evaluation

Eliya Habba1, Ofir Arviv2, Itay Itzhak1,4, Yotam Perlitz2,
Elron Bandel2, Leshem Choshen2,3, Michal Shmueli-Scheuer2, Gabriel Stanovsky1,5
1The Hebrew University of Jerusalem  ยท  2IBM Research AI  ยท  3MIT  ยท  4Technion - Israel Institute of Technology  ยท  5Allen Institute for AI

Building DOVE: To holistically explore LLM sensitivity, we sample prompts as a walk in the space of various prompt dimensions (rows, above).

Versions ๐Ÿ“ฆ

Full Version (2TB)

  • Complete token-level probabilities
  • Detailed few-shot examples
  • Comprehensive model behavior analysis

Lite Version (100GB)

  • Core dimensions variations
  • Model responses and Evaluation scores
  • Perfect for quick experimentation

DOVE Building Process

DOVE Building Process

๐Ÿค Join Our Community-wide Effort! ๐Ÿค

Help improve LLM evaluation

Why to Contribute?

  • Improve how we evaluate LLMs
  • Advance research on LLM sensitivity
  • Become a co-author on future paper and dataset versions

What to Contribute?

  • Share your model predictions
  • Convert public datasets to DOVE format
  • Run new models/datasets (code available for loading datasets with prompt variations)
  • Request evaluations you're interested in
  • Contribute any model, language, or domain

How to Contribute?

  • Talk with us!
    • Data you'd like to contribute
    • Request evaluations you're interested in seeing added to DOVE
  • Convert your data to DOVE schema and validate it with our validation code
  • Share via email or direct pull request to HuggingFace

Abstract

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices.

In this work, we present DOVE (Dataset Of Variation Evaluation), a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from a holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations.

DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation.

Browse the data: https://huggingface.co/datasets/nlphuji/DOVE

Note: Load individual benchmark files (just a few MB each) instead of the full 100GB/2TB dataset!

Using DOVE

DOVE is designed to be flexible - you can load just a small part of the data:

๐Ÿ“ Dataset Structure

Repository Organization
nlphuji/
โ”œโ”€โ”€ Dove/
โ”‚   โ”œโ”€โ”€ model_name/ # e.g., "Llama-3.2-1B-Instruct"
โ”‚   โ”‚   โ”œโ”€โ”€ language/ # e.g., "en", "fr"
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ shots_N/ # N = 0 for zero-shot, N > 0 for few-shot
โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ mmlu.abstract_algebra.parquet
โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ mmlu.world_religions.parquet
โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ ai2_arc.arc_challenge.parquet
โ”‚   โ”‚   โ”‚       โ”œโ”€โ”€ hellaswag.parquet
โ”‚   โ”‚   โ”‚       โ””โ”€โ”€ other_benchmark_files.parquet
โ”‚   โ””โ”€โ”€ other_models/
โ””โ”€โ”€ Dove_Lite/
    โ””โ”€โ”€ [same structure and examples with reduced metadata per instance]

Usage Example ๐Ÿš€

Load specific benchmarks from the dataset using HuggingFace's Datasets library

from datasets import load_dataset

# Load a specific model/language/shots benchmark
def load_benchmark(repo_id, model_name, language="en", shots=0, benchmark_file="mmlu.global_facts.parquet"):
    file_path = f"{model_name}/{language}/{shots}_shot/{benchmark_file}"
    return load_dataset(repo_id, data_files=file_path, split="train")

# Examples
# Example 1: Loading from Dove_Lite repository
llama_en_arc_challenge = load_benchmark("nlphuji/DOVE_Lite", "Meta-Llama-3-8B-Instruct", "en", 0, "ai2_arc.arc_challenge.parquet")

# Example 2: Loading from full Dove repository
mistral_en_formal_logic = load_benchmark("nlphuji/DOVE", "Mistral-7B-Instruct-v0.3", "en", 5, "mmlu.formal_logic.parquet")

# Print dataset information
print(f"Dataset loaded successfully:")
print(f"- Llama (en) arc_challenge: {len(llama_en_arc_challenge)} examples")
print(f"- Mistral (en) formal_logic: {len(mistral_en_formal_logic)} examples")

Citation

@misc{habba2025dovelargescalemultidimensionalpredictions,
      title={DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation}, 
      author={Eliya Habba and Ofir Arviv and Itay Itzhak and Yotam Perlitz and Elron Bandel and Leshem Choshen and Michal Shmueli-Scheuer and Gabriel Stanovsky},
      year={2025},
      eprint={2503.01622},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.01622}, 
}

License

This dataset is licensed under the Computational Data License Agreement v2 (CDLAv2). For full license terms, see: https://cdla.dev/permissive-2.0/