The University of South Carolina First-Year English (FYE) Corpus is a large collection of first-year student writing curated at the University of South Carolina between 2014 and 2017 by Duncan Buell and Chris Holcomb. The corpus consists of 17,246 first-year undergraduate papers, including 8,575 matched pairs of first and final drafts as well as 96 final-only drafts, totaling approximately 22 million words.
All texts are argumentative essays written for First-Year English courses, and none of the contributions come from Honors College students, making the corpus especially representative of mainstream FYW instruction. The corpus focuses exclusively on the discipline of English and is licensed for non-commercial academic research, with files provided in plain ASCII text format.
The corpus is available as a collection of ZIP files organized by year and semester (coming soon):
This corpus is especially well-suited for large-scale computational analysis of student writing, with particular strengths in the study of drafting, revision, and writing development. A range of accessible analytical tools can be used with it, depending on technical comfort level.
Basic NLP with Voyant Tools (no coding required): Useful for exploring word frequency, key terms, collocations, lexical diversity, patterns of repetition, and comparisons between drafts to identify shifts in emphasis or theme. https://voyant-tools.org
Edit Distance with Python: Enables quantitative analysis of revision behavior (insertions, deletions, substitutions) between first and final drafts using Python notebooks https://github.com/google/diff-match-patch
Holcomb, C., & Buell, D. A. (2021). A corpus of first-year composition: Exploring stylistic complexity in student writing. In Composition and Big Data, eds. Amanda Licastro and Benjamin Miller, U. of Pittsburgh Press, pp. 35-51.