TIB-SID: TIB Subject Indexing Dataset

doi:doi:10.25835/b9vo5q9z

TIB-SID: TIB Subject Indexing Dataset

The TIB Subject Indexing Dataset (TIB-SID) is a bilingual benchmark for extreme multi-label text classification (XMTC) over real library records, designed for domain classification and GND-based subject indexing. The dataset combines a large, structured, authority-controlled label space with long-tail sparsity, cross-lingual variation, and real-world domain imbalance, making it substantially closer to operational library cataloging than standard text classification benchmarks.

✨ At a glance

136,569 library records in JSON-LD with predefined train / dev / test benchmark splits
Languages: English and German
28 domains
Record types: article, book, conference, report, thesis

⬇️ Download

Download the dataset from this folder: data

🔗 Related Links

TIB-SID was introduced through the LLMs4Subjects shared tasks organized in 2025. More than 12 LLM-based systems were developed and evaluated on the dataset by participating teams worldwide. The shared task websites provide additional context, task details, and leaderboard results.

📖 Citation

If TIB-SID useful for your research or project, please consider citing it.

The main dataset paper is listed below. It has been accepted to LREC 2026, and the official proceedings citation will be added here as soon as it is available.

@misc{dsouza2026extrememultilabeltextclassification,
      title={An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?}, 
      author={Jennifer D'Souza and Sameer Sadruddin and Maximilian Kähler and Andrea Salfinger and Luca Zaccagna and Francesca Incitti and Lauro Snidaro and Osma Suominen},
      year={2026},
      eprint={2603.10876},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.10876}, 
}

If you would also like to cite the shared task that introduced the broader benchmark setting, please use:

@InProceedings{dsouza-EtAl:2025:SemEval2025,
author    = {D'Souza, Jennifer and Sadruddin, Sameer and Israel, Holger and Begoin, Mathias and Slawig, Diana},
title     = {SemEval-2025 Task 5: LLMs4Subjects - LLM-based Automated Subject Tagging for a National Technical Library's Open-Access Catalog},
booktitle = {Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)},
month     = {August},
year      = {2025},
address   = {Vienna, Austria},
publisher = {Association for Computational Linguistics},
pages     = {1082--1095},
url       = {https://aclanthology.org/2025.semeval2025-1.139}
}

⭐ Acknowledgements

This work was supported by the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259) and the TIB – Leibniz Information Centre for Science and Technology. We also gratefully acknowledge the subject specialists at TIB who contributed to the curated human evaluation of this work.

Data and Resources

train, dev, and test datasets (XMTC)json-ld
Explore
- More information
- Go to resource

Cite this as

Jennifer D'Souza, Sameer Sadruddin (2026). TIB-SID: TIB Subject Indexing Dataset [Data set]. LUIS. https://doi.org/10.25835/b9vo5q9z

Retrieved: 10:50 18 Jul 2026 (UTC)

BibTeX

Additional Info

Field	Value
Source	https://github.com/sciknoworg/tib-sid/tree/
Author	Jennifer D'Souza, Sameer Sadruddin
Maintainer	Jennifer D'Souza
Last Updated	March 14, 2026, 11:15 (UTC)
Created	March 11, 2026, 17:52 (UTC)
License	Creative Commons Attribution Share-Alike 4.0 International
Dataset Size	0.0 Byte