The British Academic Written English (BAWE) Corpus is a large, structured collection of student academic writing developed through a collaboration among researchers at the Centre for Academic Writing at Coventry University, Oxford Brookes University, the University of Warwick, and the University of Reading. Curated between 2004 and 2007, the corpus comprises 2,761 assessed student texts, totaling approximately 6.5 million words, written by undergraduate students across Years 1–3 and by Master’s-level students.
BAWE represents a wide range of academic disciplines spanning the arts and humanities, social sciences, and life and physical sciences, and it contains detailed genre classification. The corpus curation team includes Siân Alsop, Douglas Biber, Mary Deane, Signe Ebeling, Lisa Ganobcsik-Williams, Sheena Gardner, Penny Gilchrist, Alois Heuboeck, Jasper Holmes, Maria Leedham, Hilary Nesi, Paul Thompson, and Paul Wickens. BAWE is licensed for non-commercial academic use and is openly available, with texts provided in multiple machine-readable formats including UTF-8 plain text, ASCII, and CSV.
The corpus is accessible in multiple formats:
Bulk download XML files (UTF-8, ASCII) and TXT files, organized by discipline, through the Oxford Text Archive: https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2539
A parsed version of the BAWE corpus, processed with the Stanford NLP parser and provided as CSV files, is available here: https://phildurrant.net/parsed-bawe-corpus/
A searchable version of the BAWE corpus is available through Sketch Engine for concordancing, collocation, and frequency analysis (also see below): https://app.sketchengine.eu/#dashboard?corpname=preloaded%2Fbawe2
The BAWE corpus is well-suited for large-scale computational studies of student writing, especially as related to academic progression, genre conventions, and disciplinary variation in student writing.
Basic NLP with Sketch Engine (no coding required): Useful for exploring concordances, n-grams, collocations, and keywords. The BAWE corpus is pre-loaded into the interface.
https://app.sketchengine.eu/#dashboard?corpname=preloaded%2Fbawe2
Corpus Analysis with R and Python (UTS:CIC): A mixed R and Python codebase that provides scripts for corpus preprocessing, metadata integration, and exploratory analysis of academic writing corpora, including BAWE. Useful for comparative analyses across genres and disciplines and for building reproducible corpus workflows. https://github.com/uts-cic/corpus-analysis
LLM Text Detection and Academic Writing Analysis (Python): A Python-based analysis pipeline that uses BAWE as a benchmark corpus for evaluating linguistic features and patterns in student academic writing, particularly in comparison with AI-generated text. https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts
Lexical Complexity Analysis for Academic Writing (Python): A lexical analysis toolkit that incorporates BAWE-derived word lists to measure lexical sophistication, density, and diversity in academic writing, making it useful for studying vocabulary development and disciplinary variation. https://github.com/Maryam-Nasseri/LCA-AW-Lexical-Complexity-Analyzer-for-Academic-Writing
Nesi, H. (2008). BAWE: An introduction to a new resource (TALC paper). University of Warwick
Sharpling, G. P. (2010). When BAWE meets WELT: The use of a corpus of student writing to develop items for a proficiency test in grammar and English usage. Journal of Writing Research, 2(2). jowr.org
Szczygłowska, T. (2019). A corpus-based study of the specificity adjectives specific and particular in academic written English: Evidence from the BAWE corpus. Brno Studies in English, 45(2). journals.phil.muni.cz