Nemotron-Pre-Training-Datasets Collection Large scale pre-training datasets used in the Nemotron family of models. • 12 items • Updated 4 days ago • 148
Tokenizer Choice For LLM Training: Negligible or Crucial? Paper • 2310.08754 • Published Oct 12, 2023 • 3
view article Article SmolLM3: smol, multilingual, long-context reasoner +21 eliebak, cmpatino, anton-l, edbeeching, m-ric, nouamanetazi, akseljoonas, guipenedo, hynky, clefourrier, SaylorTwift, kashif, qgallouedec, hlarcher, glutamatt, Xenova, reach-vb, ngxson, craffel, lewtun, loubnabnl, lvwerra, thomwolf • Jul 8, 2025 • 773
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models Paper • 2505.22232 • Published May 28, 2025 • 18
EU20-Benchmarks Collection Evaluation Benchmarks for 20 European languages. • 5 items • Updated Oct 11, 2024 • 9