Interactive analysis of LLM prompt sensitivity across multiple dimensions
Last updated: 11/06/2025
Dive into the comprehensive analysis results from our DOVE dataset research. Each visualization reveals different aspects of LLM prompt sensitivity and robustness across various dimensions and evaluation benchmarks.
Note: Results are organized by analysis type. Select models and datasets to explore specific findings from our large-scale evaluation study.
Performance variations across evaluation datasets. Each datapoint represents the accuracy of one model calculated across instances. Vertical scatter plots illustrate the variance within each dataset and each model. Model performance varies substantially, indicating persistent prompt sensitivity at large scales across different prompt dimensions.
Select a dataset to view the performance variations analysis
Success rate distribution reveals inherent example difficulty patterns. Distribution of success rates by evaluation dimension and model. The x-axis shows the percentage of successful perturbations per instance, while the y-axis shows the instance count in DOVE. The distribution reveals examples that are consistently easy or difficult for LLMs across prompt dimensions.
Select a model to view the per question robustness analysis
Few-shot reduces performance variance across evaluation dimensions. Comparing zero-shot and five-shot on domains from DOVE reveals a narrower spread of accuracy scores. Each point represents the accuracy across instances, demonstrating that the five-shot demonstrations lead to more robust performance and reduced prompt sensitivity.
Select a dataset to view the zero/few shot analysis
Accuracy marginalization for different dimensions. Variations along each of the dimensions in DOVE lead to prompt sensitivity, even when controlling for all other dimensions. This analysis shows how different prompt elements (delimiters, instruction wording, answer formatting, etc.) affect model performance across datasets.
Select a model and dataset to view the accuracy marginalization analysis