YESciEval Corpus

YESciEval is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the YESciEval framework, developed to support robust, transparent, and scalable evaluation using open-source LLMs.


🔍 Overview

YESciEval provides:

  • ScienceQ&A datasets generated using open-source LLMs
  • Adversarial variants designed using fine-grained rubric-based heuristics
  • Evaluation scores from multiple LLMs acting as evaluators (LLM-as-a-judge)

📂 Dataset Structure

The dataset is organized into two main parts:

1. Benign (Original) ScienceQ&A Data

Synthesized answers to research questions based on abstracts from relevant papers.

  • Sources:
  • ORKGSyn: Multidisciplinary questions from the Open Research Knowledge Graph
  • BioASQ: Biomedical questions from the BioASQ challenge

  • Format: For each Q&A instance:

  • question: research question
  • abstracts: relevant paper abstracts
  • answer: LLM-generated synthesis

2. Adversarial ScienceQ&A Data

Each benign answer is perturbed with two types of adversarial modifications:

  • Subtle Perturbations: Realistic, light-weight errors designed to be difficult for models to detect
  • Extreme Perturbations: Significant modifications that should be easily identifiable by robust evaluators

Perturbations target nine qualitative rubrics: - Cohesion - Conciseness - Readability - Coherence - Integration - Relevancy - Correctness - Completeness - Informativeness

Each rubric has a defined subtle and extreme perturbation heuristic.


🧪 Evaluation Outputs

Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:

  • A 1–5 Likert score for each rubric
  • A rationale for the score

📊 Statistics

ORKGSyn (33 disciplines)

  • Benign: 348 Q&A pairs
  • Subtle Adversarial: 348 Q&A pairs
  • Extreme Adversarial: 348 Q&A pairs

BioASQ (Biomedical)

  • Benign: 73 Q&A pairs
  • Subtle Adversarial: 73 Q&A pairs
  • Extreme Adversarial: 73 Q&A pairs

Total evaluations: ~45,000 across models and variants.


🗃️ Access

The dataset is also released on the YESciEval GitHub repository.

A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.


📜 Citation

If you use this dataset, please cite the following:

D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. Proceedings of ACL 2025. Preprint


🛠️ License

This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0).


🙋‍♀️ Questions?

For questions or collaborations, contact Jennifer D’Souza.

Data and Resources

Cite this as

Jennifer D'Souza, Quentin Münch, Hamed Babaei Giglou (2025). YESciEval Corpus [Data set]. LUIS. https://doi.org/10.25835/8dcv2ka6
Retrieved: 19:00 06 May 2026 (UTC)

Additional Info

Field Value
Source https://github.com/sciknoworg/YESciEval/tree/main/experiments/dataset
Author Jennifer D'Souza, Quentin Münch, Hamed Babaei Giglou
Maintainer Jennifer D'Souza
Last Updated May 28, 2025, 10:21 (UTC)
Created May 28, 2025, 10:02 (UTC)
License Creative Commons Attribution 4.0 International
Dataset Size 106.6 MByte