Common Crawl

About the Corpus

Common Crawl is a large-scale corpus of open web data. Maintained by the nonprofit organization Common Crawl, the corpus currently includes over 300 billion web pages compiled from 2007-present and adds 3–5 billion new pages each month, providing a continuously expanding record of publicly accessible web content.

Common Crawl includes raw web page text, metadata, and link structure information. It is widely used in computational research and has been cited in over 10,000 research papers

Accessing the Corpus

Researchers can access Common Crawl through its online portal: https://commoncrawl.org/

The corpus is accessible in multiple formats:

Researchers can access and analyze the corpus using cloud storage provided by Amazon Web Services (AWS). This allows users to work with the data without downloading large files to their own computers: https://commoncrawl.org/get-started
Web crawl data can be downloaded in bulk as archive files (WARC, WET, WAT formats), which contain collections of web pages grouped by crawl date: https://ds5q9oxwqwsfj.cloudfront.net/

The corpus includes regularly updated crawl releases, and researchers can select specific crawls depending on their research needs. Documentation and file format standards are available through the Common Crawl website and associated technical resources.

Analyzing the Corpus

Common Crawl is well-suited for large-scale computational studies of language use, discourse, and communication across the web.

Whirlwind Tutorial for Common Crawl in Python: This tutorial demonstrates how to retrieve web pages, extract text from WARC files, and explore corpus content without needing to download the full dataset. It is especially useful for researchers who want to work with manageable subsets of web crawl data for linguistic or writing analysis. https://github.com/commoncrawl/whirlwind-python
Web Graph Analysis Tools with Java: Provides tools for constructing and analyzing web graphs derived from Common Crawl data. These tools allow researchers to examine how web pages link to one another, identify influential sites, and analyze relationships between domains. https://github.com/commoncrawl/cc-webgraph
Large-Scale HTML and Web Text Analysis with R and Spark: This tutorial shows how researchers can extract text, count linguistic features, and explore large web corpora using familiar R workflows. It provides an accessible entry point for humanities researchers interested in large-scale corpus analysis without needing to manually process raw archive files. https://rpubs.com/jluraschi/billion-tags

Researchers can browse additional tutorials on the Common Crawl website: https://commoncrawl.org/example

Selected Research

Bruns, A., & Münch, F. (2025). Web crawl refusals: Insights from Common Crawl. In Web Information Systems Engineering – WISE 2024 (Lecture Notes in Computer Science). Springer. https://doi.org/10.1007/978-3-031-85960-1_9
Knockel, J., Dalek, J., Aljizawi, N., Ahmed, M., Meletti, L., & Lau, J. (2024, November 25). Banned books: Analysis of censorship on Amazon.com. Citizen Lab, University of Toronto. https://citizenlab.ca/research/analysis-of-censorship-on-amazon-com/
Liu, X., et al. (2024). Misinformation resilient search rankings with webgraph-based interventions. ACM Transactions on Intelligent Systems and Technology. https://arxiv.org/abs/2404.08869
Schweter, S., et al. (2025). A geolocated dataset of German news articles. Scientific Data, 12, Article 4222. https://www.nature.com/articles/s41597-025-05422-w