British Academic Written English (BAWE) Corpus

About the Corpus

The British Academic Written English (BAWE) Corpus is a large, structured collection of student academic writing developed through a collaboration among researchers at the Centre for Academic Writing at Coventry University, Oxford Brookes University, the University of Warwick, and the University of Reading. Curated between 2004 and 2007, the corpus comprises 2,761 assessed student texts, totaling approximately 6.5 million words, written by undergraduate students across Years 1–3 and by Master’s-level students. 

BAWE represents a wide range of academic disciplines spanning the arts and humanities, social sciences, and life and physical sciences, and it contains detailed genre classification. The corpus curation team includes Siân Alsop, Douglas Biber, Mary Deane, Signe Ebeling, Lisa Ganobcsik-Williams, Sheena Gardner, Penny Gilchrist, Alois Heuboeck, Jasper Holmes, Maria Leedham, Hilary Nesi, Paul Thompson, and Paul Wickens. BAWE is licensed for non-commercial academic use and is openly available, with texts provided in multiple machine-readable formats including UTF-8 plain text, ASCII, and CSV.

Accessing the Corpus

The corpus is accessible in multiple formats:

Analyzing the Corpus

The BAWE corpus is well-suited for large-scale computational studies of student writing, especially as related to academic progression, genre conventions, and disciplinary variation in student writing.

  • Basic NLP with Sketch Engine (no coding required): Useful for exploring concordances, n-grams, collocations, and keywords. The BAWE corpus is pre-loaded into the interface.
    https://app.sketchengine.eu/#dashboard?corpname=preloaded%2Fbawe2 

  • Corpus Analysis with R and Python (UTS:CIC): A mixed R and Python codebase that provides scripts for corpus preprocessing, metadata integration, and exploratory analysis of academic writing corpora, including BAWE. Useful for comparative analyses across genres and disciplines and for building reproducible corpus workflows. https://github.com/uts-cic/corpus-analysis

  • LLM Text Detection and Academic Writing Analysis (Python): A Python-based analysis pipeline that uses BAWE as a benchmark corpus for evaluating linguistic features and patterns in student academic writing, particularly in comparison with AI-generated text. https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts

  • Lexical Complexity Analysis for Academic Writing (Python): A lexical analysis toolkit that incorporates BAWE-derived word lists to measure lexical sophistication, density, and diversity in academic writing, making it useful for studying vocabulary development and disciplinary variation. https://github.com/Maryam-Nasseri/LCA-AW-Lexical-Complexity-Analyzer-for-Academic-Writing

  • Authorship Verification and Stylometric Analysis (Python):
    A set of Jupyter notebooks and scripts that include BAWE as a source corpus for stylometric and authorship-related analyses, supporting studies of stylistic consistency and variation across student texts.
    https://github.com/swan-07/authorship-verification

Selected Research

  • Nesi, H. (2008). BAWE: An introduction to a new resource (TALC paper). University of Warwick

  • Sharpling, G. P. (2010). When BAWE meets WELT: The use of a corpus of student writing to develop items for a proficiency test in grammar and English usage. Journal of Writing Research, 2(2). jowr.org

  • Szczygłowska, T. (2019). A corpus-based study of the specificity adjectives specific and particular in academic written English: Evidence from the BAWE corpus. Brno Studies in English, 45(2). journals.phil.muni.cz

  • Mohammed, A. (2023). Informality in academic English texts by Arabic and British scholars: A corpus study. https://journals.rcsi.science/2687-0088/article/view/312764