📊

Dataset

Dolma3 Pool

by allenai ID: hf-dataset--allenai--dolma3_pool

FNI Rank 23

Percentile Top 2%

Activity

→ 0.0%

⚠️ **IMPORTANT NOTICE** ⚠️ This is the Dolma 3 *pool*, pre–quality upsampling and mixing. If you are interested in *the data used* to train Olmo 3 7B and Olmo 3 32B, visit **allenai/dolma3_mix-6T-1025**. -----

View Source Code →

Data Integrity 23 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--allenai--dolma3_pool
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__allenai__dolma3_pool,
  author = {allenai},
  title = {Dolma3 Pool Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/allenai/dolma3_pool}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

allenai. (2026). Dolma3 Pool [Dataset]. Free2AITools. https://huggingface.co/datasets/allenai/dolma3_pool

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Free2AI Nexus Index

Methodology → 📘 What is FNI?

23.0

Top 2% Overall Impact

🔥 Popularity (P) 0

🚀 Velocity (V) 0

🛡️ Credibility (C) 0

🔧 Utility (U) 0

Nexus Verified Data

💬 Why this score?

The Nexus Index for Dolma3 Pool aggregates Popularity (P:0), Velocity (V:0), and Credibility (C:0). The Utility score (U:0) represents deployment readiness, context efficiency, and structural reliability within the Nexus ecosystem.

🔗 Source Links (Click to verify)

📊 P: HuggingFace Stats 📈 V: 7-Day Delta 📄 C: Papers With Code 🔧 U: Deploy Score

Data Verified 🕐 Last Updated: Not calculated

Free2AI Nexus Index | Fair · Transparent · Explainable | Full Methodology

⬇️

Downloads

48,215

❤️

Likes

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: odc-by
task_categories:

text-generation
language:
en

configs:

config_name: default
data_files:
- split: train
  path: data/common_crawl-science_math_and_technology-0002/*

⚠️ IMPORTANT NOTICE ⚠️

This is the Dolma 3 pool, pre–quality upsampling and mixing.
If you are interested in the data used to train Olmo 3 7B and Olmo 3 32B, visit allenai/dolma3_mix-6T-1025.

Dolma 3 Pool

The Dolma 3 pool is a dataset of over 9 trillion tokens from a diverse mix of web content, academic publications, code, and more. For detailed documenation on Dolma 3 processing and data, please see our Dolma 3 Github repository. For more information on Dolma in general, please see our original release here.

A Note on the Dolma 3 Pool: Source Links

The dolma 3 pool contains documents for Common Crawl (web) and olmOCR Science PDFs only. To access the documents from the remaining sources in this pool, follow the source links below:

Common Crawl: Current repository
olmOCR Science PDFs: Current repository
StackEdu: https://huggingface.co/datasets/HuggingFaceTB/stack-edu
arXiv: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
FineMath 3+: https://huggingface.co/datasets/HuggingFaceTB/finemath
Wikipedia & Wikibooks: https://huggingface.co/datasets/allenai/dolma (dolma v1.7)

Dataset Sources

This dataset contains the full pool of documents considered to train the first stage of Olmo 3 7B.

Source	Type	9T Pool Tokens	9T Pool Docs
Common Crawl	Web pages	8.14T	9.67B
olmOCR Science PDFs	Academic documents	972B	101M
StackEdu (Rebalanced)	GitHub code	137B	167M
arXiv	Papers with LaTeX	21.4B	3.95M
FineMath 3+	Math web pages	34.1B	21.4M
Wikipedia & Wikibooks	Encyclopedic	3.69B	6.67M
Total		9.31T	9.97B

Downloading Dolma 3

You can download and load this data using HuggingFace's datasets library with the following code:

from datasets import load_dataset
dataset = load_dataset("allenai/dolma3_pool", split="train",)

You can further specify a specific split of the dataset to load. In this repository, Common Crawl data folders are foramtted as common_crawl-topic-vigintile. Similarly, olmOCR PDF data folders are formatted as olmocr_science_pdfs-topic. For example:

from datasets import load_dataset
dataset = load_dataset("allenai/dolma3_pool", 
                        data_files="data/olmocr_science_pdfs-*/*.jsonl.zst",
                        split="train")

Note: You can iterate over over the dataset directly without having to download the entire dataset. Simply set streaming=True in the command above.

Licensing Information

Dolma 3 is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our Responsible Use Guidelines.

Citation

@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}

Top Tier

Social Proof

HuggingFace Hub

28Likes

48.2KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--allenai--dolma3_pool
source: huggingface
author: allenai
tags: task_categories:text-generationlanguage:enlicense:odc-bysize_categories:10mformat:jsonmodality:textlibrary:datasetslibrary:dasklibrary:mlcroissantarxiv:2512.13961region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 28
downloads: 48,215

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!