Common Crawl Foundation

Team

non-profit

Verified

https://commoncrawl.org

commoncrawl

Activity Feed

AI & ML interests

Crawled data and metadata

Recent Activity

pjox authored a paper 6 days ago

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

tvaughan updated a dataset 7 days ago

commoncrawl/statistics

malteos updated a Space 22 days ago

commoncrawl/cc-citations

View all activity

pjox

authored a paper 6 days ago

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

Paper • 2512.11192 • Published Dec 12, 2025

tvaughan

updated a dataset 7 days ago

commoncrawl/statistics

Viewer • Updated 7 days ago • 610k • 151 • 25

malteos

updated a Space 22 days ago

cc-citations

📜

Scientific articles using or citing Common Crawl data

dalhuijsen

updated 2 datasets about 2 months ago

commoncrawl/gneissweb-annotation-host-testing-v1

Viewer • Updated Dec 11, 2025 • 617M • 40

commoncrawl/gneissweb-annotation-url-testing-v1

Viewer • Updated Dec 10, 2025 • 11.5B • 115

greglindahl

updated 2 datasets about 2 months ago

commoncrawl/gneissweb-annotation-host-testing-v1

Viewer • Updated Dec 11, 2025 • 617M • 40

commoncrawl/gneissweb-annotation-url-testing-v1

Viewer • Updated Dec 10, 2025 • 11.5B • 115

greglindahl

published a dataset about 2 months ago

commoncrawl/gneissweb-annotation-host-testing-v1

Viewer • Updated Dec 11, 2025 • 617M • 40

greglindahl

published a Space about 2 months ago

cc-citations

📜

Scientific articles using or citing Common Crawl data

jenglish-cc

updated a dataset 2 months ago

commoncrawl/gneissweb-annotation-url-testing-v1

Viewer • Updated Dec 10, 2025 • 11.5B • 115

greglindahl

published a dataset 2 months ago

commoncrawl/gneissweb-annotation-url-testing-v1

Viewer • Updated Dec 10, 2025 • 11.5B • 115

malteos

updated a dataset 2 months ago

commoncrawl/statistics

Viewer • Updated 7 days ago • 610k • 151 • 25

catherinearnett

authored 3 papers 3 months ago

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Paper • 2506.01732 • Published Jun 2, 2025 • 6

Explaining and Mitigating Crosslingual Tokenizer Inequities

Paper • 2510.21909 • Published Oct 24, 2025

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Paper • 2510.24081 • Published Oct 28, 2025 • 19

handecelikkanat

updated a dataset 4 months ago

commoncrawl/citations

Viewer • Updated Oct 16, 2025 • 9.18k • 73 • 1

laurievb

updated a dataset 4 months ago

commoncrawl/statistics

Viewer • Updated 7 days ago • 610k • 151 • 25

catherinearnett

authored 2 papers 7 months ago

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Paper • 2505.24689 • Published May 30, 2025 • 1

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Paper • 2507.06378 • Published Jul 8, 2025

laurievb

authored a paper 11 months ago

An Open Dataset and Model for Language Identification

Paper • 2305.13820 • Published May 23, 2023

AI & ML interests

Recent Activity

Team members 13

commoncrawl's activity

cc-citations

cc-citations