Enron Email Dataset

About the Corpus

The Enron Email Dataset is a corpus of workplace communication curated from 1998 to 2002 from employee emails at Enron. The original curation project occurred through the CALO Project and further cleaning, preprocessing and storage efforts occurred in collaboration with the Massachusetts Institute of Technology and Carnegie Mellon University. The corpus contains approximately 500,000 email messages sent and received by about 150 Enron employees, capturing everyday professional communication across a range of organizational roles and contexts.

Unlike academic writing corpora, the Enron dataset consists entirely of email genres, making it especially valuable for studying informal and semi-formal professional writing, organizational discourse, social networks, and linguistic behavior in real-world workplace settings. The dataset is openly available for non-commercial research use and is distributed in plain text (TXT) and CSV formats through multiple repositories.

Accessing the Corpus

The corpus is accessible in multiple formats:

Analyzing the Corpus

  • Basic NLP with Voyant Tools (no coding required): Useful for exploring word frequency, key terms, collocations, lexical diversity, patterns of repetition, and comparisons between emails to identify shifts in emphasis or theme. https://voyant-tools.org

  • Enron Topic Modelling (Python): A project that applies topic modeling (using gensim and LDA) to the Enron Email Dataset, useful for exploring thematic structure across thousands of emails.https://github.com/coreylynch/EnronTopicModelling

  • Enron Email Analysis (Python): A multi-task analysis project including anomaly detection, social network analysis, and email body analysis with visualization and ML workflows for the Enron corpus. https://github.com/mihir-m-gandhi/Enron-Email-Analysis

  • Corporate Email Data Analysis (Python): A suite of notebooks for data cleaning, exploratory data analysis (EDA), tokenization, visualization, network graphs, and classification applied to the Enron dataset—great for teaching or exploratory workflows. https://github.com/Mahshidxyz/Corporate-Email-Data-Analysis

  • R Sample Enron (R): An R project that loads the Enron corpus and produces network graphs showing how different individuals are connected through email communication, suitable for social network and relational analysis. https://github.com/mrdavid/r-sample-enron

  • Enron Spam Data (Python): A cleaned spam vs. ham version of (a subset of) the Enron emails converted into a single CSV with script included for building the dataset, useful for classification work. https://github.com/MWiechmann/enron_spam_data

Selected Research

  • Alkhereyf, S.,& Rambow, O. (2017). Work hard, play hard: Email classification on the Avocado and Enron corpora. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing, 57–65. https://www.cs.columbia.edu/nlp/papers/2017/alkhereyf_email_classification.pdf
  • Deitrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T., & Hu, W. (2012). Author gender prediction in an email stream using neural networks. Journal of Intelligent Learning Systems and Applications, 4(3). https://www.scirp.org/pdf/jilsa20120300001_13185664.pdf 
  • Diesner, J., Frantz, T. L., & Carley, K. M. (2005). Communication networks from the Enron email corpus: “It’s always about the people. Enron is no different”. Computational and Mathematical Organization Theory, 11(3), 201–228. https://doi.org/10.1007/s10588-005-5377-0
  • Hardin, J. S., Sarkis, G., & Pomona College Undergraduate Research Circle. (2015). Network analysis with the Enron email corpus. Journal of Statistics Education, 23(2). https://doi.org/10.1080/10691898.2015.11889734
  • Keila, P. S., & Skillicorn, D. B. (2005). Structure in the Enron email dataset. Computational and Mathematical Organization Theory, 11, 183–199. https://doi.org/10.1007/s10588-005-5379-y
  • Klimt, B., & Yang, Y. (2004). The Enron corpus: A new dataset for email classification research. In J.-F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi (Eds.), Machine Learning: ECML 2004 (Lecture Notes in Computer Science, Vol. 3201, pp. 217–226). https://doi.org/10.1007/978-3-540-30115-8_22
  • Peterson, K., Hohensee, M., & Xia, F. (2011). Email formality in the workplace: A case study on the Enron corpus. In Proceedings of the Workshop on Language in Social Media (LSM 2011), 86–95. https://aclanthology.org/W11-0711/