TIB-SID: TIB Subject Indexing Dataset

The TIB Subject Indexing Dataset (TIB-SID) is a bilingual benchmark for extreme multi-label text classification (XMTC) over real library records, designed for domain classification and GND-based subject indexing. The dataset combines a large, structured, authority-controlled label space with long-tail sparsity, cross-lingual variation, and real-world domain imbalance, making it substantially closer to operational library cataloging than standard text classification benchmarks.

✨ At a glance

  • 136,569 library records in JSON-LD with predefined train / dev / test benchmark splits
  • Languages: English and German
  • 28 domains
  • Record types: article, book, conference, report, thesis

⬇️ Download

Download the dataset from this folder: data

🔗 Related Links

TIB-SID was introduced through the LLMs4Subjects shared tasks organized in 2025. More than 12 LLM-based systems were developed and evaluated on the dataset by participating teams worldwide. The shared task websites provide additional context, task details, and leaderboard results.

📖 Citation

If TIB-SID useful for your research or project, please consider citing it.

The main dataset paper is listed below. It has been accepted to LREC 2026, and the official proceedings citation will be added here as soon as it is available.

@misc{dsouza2026extrememultilabeltextclassification,
      title={An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?}, 
      author={Jennifer D'Souza and Sameer Sadruddin and Maximilian Kähler and Andrea Salfinger and Luca Zaccagna and Francesca Incitti and Lauro Snidaro and Osma Suominen},
      year={2026},
      eprint={2603.10876},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.10876}, 
}

If you would also like to cite the shared task that introduced the broader benchmark setting, please use:

@InProceedings{dsouza-EtAl:2025:SemEval2025,
author    = {D'Souza, Jennifer and Sadruddin, Sameer and Israel, Holger and Begoin, Mathias and Slawig, Diana},
title     = {SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog},
booktitle = {Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)},
month     = {August},
year      = {2025},
address   = {Vienna, Austria},
publisher = {Association for Computational Linguistics},
pages     = {1082--1095},
url       = {https://aclanthology.org/2025.semeval2025-1.139}
}

⭐ Acknowledgements

This work was supported by the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259) and the TIB – Leibniz Information Centre for Science and Technology. We also gratefully acknowledge the subject specialists at TIB who contributed to the curated human evaluation of this work.

Data and Resources

Cite this as

Jennifer D'Souza, Sameer Sadruddin (2026). TIB-SID: TIB Subject Indexing Dataset [Data set]. LUIS. https://doi.org/10.25835/b9vo5q9z
Retrieved: 22:59 22 Apr 2026 (UTC)

Additional Info

Field Value
Source https://github.com/sciknoworg/tib-sid/tree/
Author Jennifer D'Souza, Sameer Sadruddin
Maintainer Jennifer D'Souza
Last Updated March 14, 2026, 11:15 (UTC)
Created March 11, 2026, 17:52 (UTC)
License Creative Commons Attribution Share-Alike 4.0 International
Dataset Size 0.0 Byte