YESciEval is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the YESciEval framework, developed to support robust, transparent, and scalable evaluation using open-source LLMs.
🔍 Overview
YESciEval provides:
- ScienceQ&A datasets generated using open-source LLMs
- Adversarial variants designed using fine-grained rubric-based heuristics
- Evaluation scores from multiple LLMs acting as evaluators (LLM-as-a-judge)
📂 Dataset Structure
The dataset is organized into two main parts:
1. Benign (Original) ScienceQ&A Data
Synthesized answers to research questions based on abstracts from relevant papers.
2. Adversarial ScienceQ&A Data
Each benign answer is perturbed with two types of adversarial modifications:
- Subtle Perturbations: Realistic, light-weight errors designed to be difficult for models to detect
- Extreme Perturbations: Significant modifications that should be easily identifiable by robust evaluators
Perturbations target nine qualitative rubrics:
- Cohesion
- Conciseness
- Readability
- Coherence
- Integration
- Relevancy
- Correctness
- Completeness
- Informativeness
Each rubric has a defined subtle and extreme perturbation heuristic.
🧪 Evaluation Outputs
Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:
- A 1–5 Likert score for each rubric
- A rationale for the score
📊 Statistics
ORKGSyn (33 disciplines)
- Benign: 348 Q&A pairs
- Subtle Adversarial: 348 Q&A pairs
- Extreme Adversarial: 348 Q&A pairs
BioASQ (Biomedical)
- Benign: 73 Q&A pairs
- Subtle Adversarial: 73 Q&A pairs
- Extreme Adversarial: 73 Q&A pairs
Total evaluations: ~45,000 across models and variants.
🗃️ Access
The dataset is also released on the YESciEval GitHub repository.
A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.
📜 Citation
If you use this dataset, please cite the following:
D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. Proceedings of ACL 2025. Preprint
🛠️ License
This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0).
🙋♀️ Questions?
For questions or collaborations, contact Jennifer D’Souza.