YESciEval Corpus

doi:doi:10.25835/8dcv2ka6

YESciEval Corpus

YESciEval is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the YESciEval framework, developed to support robust, transparent, and scalable evaluation using open-source LLMs.

🔍 Overview

YESciEval provides:

ScienceQ&A datasets generated using open-source LLMs
Adversarial variants designed using fine-grained rubric-based heuristics
Evaluation scores from multiple LLMs acting as evaluators (LLM-as-a-judge)

📂 Dataset Structure

The dataset is organized into two main parts:

1. Benign (Original) ScienceQ&A Data

Synthesized answers to research questions based on abstracts from relevant papers.

Sources:
ORKGSyn: Multidisciplinary questions from the Open Research Knowledge Graph
BioASQ: Biomedical questions from the BioASQ challenge
Format: For each Q&A instance:
question: research question
abstracts: relevant paper abstracts
answer: LLM-generated synthesis

2. Adversarial ScienceQ&A Data

Each benign answer is perturbed with two types of adversarial modifications:

Subtle Perturbations: Realistic, light-weight errors designed to be difficult for models to detect
Extreme Perturbations: Significant modifications that should be easily identifiable by robust evaluators

Perturbations target nine qualitative rubrics: - Cohesion - Conciseness - Readability - Coherence - Integration - Relevancy - Correctness - Completeness - Informativeness

Each rubric has a defined subtle and extreme perturbation heuristic.

🧪 Evaluation Outputs

Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:

A 1–5 Likert score for each rubric
A rationale for the score

📊 Statistics

ORKGSyn (33 disciplines)

Benign: 348 Q&A pairs
Subtle Adversarial: 348 Q&A pairs
Extreme Adversarial: 348 Q&A pairs

BioASQ (Biomedical)

Benign: 73 Q&A pairs
Subtle Adversarial: 73 Q&A pairs
Extreme Adversarial: 73 Q&A pairs

Total evaluations: ~45,000 across models and variants.

🗃️ Access

The dataset is also released on the YESciEval GitHub repository.

A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.

📜 Citation

If you use this dataset, please cite the following:

D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. Proceedings of ACL 2025. Preprint

🛠️ License

This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0).

🙋‍♀️ Questions?

For questions or collaborations, contact Jennifer D’Souza.

Data and Resources

YESciEval BioASQ Q&A DatasetCSV
File size: 28.1 MByte
Explore
- More information
- Download
YESciEval ORKG-Synthesis Q&A DatasetCSV
File size: 78.5 MByte
Explore
- More information
- Download

Cite this as

Jennifer D'Souza, Quentin Münch, Hamed Babaei Giglou (2025). YESciEval Corpus [Data set]. LUIS. https://doi.org/10.25835/8dcv2ka6

Retrieved: 21:11 28 May 2026 (UTC)

BibTeX

Additional Info

Field	Value
Source	https://github.com/sciknoworg/YESciEval/tree/main/experiments/dataset
Author	Jennifer D'Souza, Quentin Münch, Hamed Babaei Giglou
Maintainer	Jennifer D'Souza
Last Updated	May 28, 2025, 10:21 (UTC)
Created	May 28, 2025, 10:02 (UTC)
License	Creative Commons Attribution 4.0 International
Dataset Size	106.6 MByte