Common Crawl is a large-scale corpus of open web data. Maintained by the nonprofit organization Common Crawl, the corpus currently includes over 300 billion web pages compiled from 2007-present and adds 3–5 billion new pages each month, providing a continuously expanding record of publicly accessible web content.
Common Crawl includes raw web page text, metadata, and link structure information. It is widely used in computational research and has been cited in over 10,000 research papers
Researchers can access Common Crawl through its online portal: https://commoncrawl.org/
The corpus is accessible in multiple formats:
Researchers can access and analyze the corpus using cloud storage provided by Amazon Web Services (AWS). This allows users to work with the data without downloading large files to their own computers: https://commoncrawl.org/get-started
The corpus includes regularly updated crawl releases, and researchers can select specific crawls depending on their research needs. Documentation and file format standards are available through the Common Crawl website and associated technical resources.
Common Crawl is well-suited for large-scale computational studies of language use, discourse, and communication across the web.
Whirlwind Tutorial for Common Crawl in Python: This tutorial demonstrates how to retrieve web pages, extract text from WARC files, and explore corpus content without needing to download the full dataset. It is especially useful for researchers who want to work with manageable subsets of web crawl data for linguistic or writing analysis. https://github.com/commoncrawl/whirlwind-python
Web Graph Analysis Tools with Java: Provides tools for constructing and analyzing web graphs derived from Common Crawl data. These tools allow researchers to examine how web pages link to one another, identify influential sites, and analyze relationships between domains. https://github.com/commoncrawl/cc-webgraph
Researchers can browse additional tutorials on the Common Crawl website: https://commoncrawl.org/example
Knockel, J., Dalek, J., Aljizawi, N., Ahmed, M., Meletti, L., & Lau, J. (2024, November 25). Banned books: Analysis of censorship on Amazon.com. Citizen Lab, University of Toronto. https://citizenlab.ca/research/analysis-of-censorship-on-amazon-com/
Liu, X., et al. (2024). Misinformation resilient search rankings with webgraph-based interventions. ACM Transactions on Intelligent Systems and Technology. https://arxiv.org/abs/2404.08869