University of South Carolina First-Year English (FYE) Corpus

About the Corpus

The University of South Carolina First-Year English (FYE) Corpus is a large collection of first-year student writing curated at the University of South Carolina between 2014 and 2017 by Duncan Buell and Chris Holcomb. The corpus consists of 17,246 first-year undergraduate papers, including 8,575 matched pairs of first and final drafts as well as 96 final-only drafts, totaling approximately 22 million words.

All texts are argumentative essays written for First-Year English courses, and none of the contributions come from Honors College students, making the corpus especially representative of mainstream FYW instruction. The corpus focuses exclusively on the discipline of English and is licensed for non-commercial academic research, with files provided in plain ASCII text format.

Accessing the Corpus

The corpus is available as a collection of ZIP files organized by year and semester:

Analyzing the Corpus

This corpus is especially well-suited for large-scale computational analysis of student writing, with particular strengths in the study of drafting, revision, and writing development. A range of accessible analytical tools can be used with it, depending on technical comfort level.

Basic NLP with Voyant Tools (no coding required): Useful for exploring word frequency, key terms, collocations, lexical diversity, patterns of repetition, and comparisons between drafts to identify shifts in emphasis or theme. https://voyant-tools.org
Edit Distance with Python: Enables quantitative analysis of revision behavior (insertions, deletions, substitutions) between first and final drafts using Python notebooks https://github.com/google/diff-match-patch
Edit Distance with R: Supports text comparison and distance metrics for revision analysis using packages such as quanteda and stringdist.
https://github.com/markvanderloo/stringdist

Selected Research

Holcomb, C., & Buell, D. A. (2021). A corpus of first-year composition: Exploring stylistic complexity in student writing. In Amanda Licastro & Benjamin Miller (Eds.), Composition and Big Data, (pp. 35–51). University of Pittsburgh Press.
Holcomb, C., & Buell, D. A. (2018). First-year composition as “big data”: Towards examining student revisions at scale. Computers and Composition, 48, 49–66.