SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing Paper • 2512.11192 • Published Dec 12, 2025
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published Jun 2, 2025 • 6
Explaining and Mitigating Crosslingual Tokenizer Inequities Paper • 2510.21909 • Published Oct 24, 2025
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 19
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization Paper • 2505.24689 • Published May 30, 2025 • 1
Evaluating Morphological Alignment of Tokenizers in 70 Languages Paper • 2507.06378 • Published Jul 8, 2025