DOVE Research Findings - Explore LLM Evaluation Results

🔬 Explore DOVE Research Findings

Dive into the comprehensive analysis results from our DOVE dataset research. Each visualization reveals different aspects of LLM prompt sensitivity and robustness across various dimensions and evaluation benchmarks.

Note: Results are organized by analysis type. Select models and datasets to explore specific findings from our large-scale evaluation study.

Performance Variations
Few-Shot Performance Variance
Success Rate Distribution
Accuracy Marginalization

📊 Performance Variations

Performance variations across evaluation datasets. Each datapoint represents the accuracy of one model calculated across instances. Vertical scatter plots illustrate the variance within each dataset and each model. Model performance varies substantially, indicating persistent prompt sensitivity at large scales across different prompt dimensions.

Select Dataset:

Select a dataset to view the performance variations analysis

🛡️ Per Question Robustness

Success rate distribution reveals inherent example difficulty patterns. Distribution of success rates by evaluation dimension and model. The x-axis shows the percentage of successful perturbations per instance, while the y-axis shows the instance count in DOVE. The distribution reveals examples that are consistently easy or difficult for LLMs across prompt dimensions.

Select Model:

Select a model to view the per question robustness analysis

📈 Zero/Few Shot Analysis

Few-shot reduces performance variance across evaluation dimensions. Comparing zero-shot and five-shot on domains from DOVE reveals a narrower spread of accuracy scores. Each point represents the accuracy across instances, demonstrating that the five-shot demonstrations lead to more robust performance and reduced prompt sensitivity.

Select Dataset:

Select a dataset to view the zero/few shot analysis

📊 Accuracy Marginalization

Accuracy marginalization for different dimensions. Variations along each of the dimensions in DOVE lead to prompt sensitivity, even when controlling for all other dimensions. This analysis shows how different prompt elements (delimiters, instruction wording, answer formatting, etc.) affect model performance across datasets.

Select Model:

Select Dataset:

Select a model and dataset to view the accuracy marginalization analysis

📊 DOVE Research Findings

🔬 Explore DOVE Research Findings

📊 Performance Variations

🛡️ Per Question Robustness

📈 Zero/Few Shot Analysis

📊 Accuracy Marginalization