Multilingual Academic Corpus of Assignments – Writing and Speech (MACAWS)

About the Corpus

The Multilingual Academic Corpus of Assignments – Writing and Speech (MACAWS) is a corpus of written and oral assignments produced by foreign language learners in the context of their language learning classrooms. The corpus currently focuses on two less commonly taught languages: Portuguese and Russian. Currently, the corpus includes over 500,000 words of Portuguese and approximately 240,000 words of Russian student language. MACAWS also includes detailed metadata describing assignments, learner characteristics, and instructional context.

The MACAWS corpus was developed by researchers affiliated with institutions including the University of Arizona, Michigan State University, and Northern Arizona University. It is closely connected to the Corpus & Repository of Writing (CROW), which provides complementary data on English academic writing

Accessing the Corpus

MACAWS is hosted on an online platform: https://macaws.corporaproject.org/

Access must first be requested by completing the registration form: https://macaws-api.corporaproject.org/user/register

Researchers can access MACAWS through its online interface:

Access to the corpus can be requested in multiple formats:

Searchable repository that can be queried by keywords and lemmas and filtered by metadata fields such as assignment, institution, country, instructor, gender, and year.
Downloadable instructional materials in markdown format, including assignment prompts, syllabi, rubrics, lesson plans, and related course documents.
Offline research corpus subset available as a downloadable ZIP file, enabling secure local analysis (additional training and verification required)

Analyzing the Corpus

MACAWS is well-suited for computational and corpus-based studies of academic writing in languages including Portuguese and Russian, especially for analyzing linguistic variation across disciplines, assignments, and student populations.

Keywords and Lemmas in Context (in repository): Search the online repository for keyword-in-context records and downloadable CSV files with demographic and corpus metadata. https://crow.corporaproject.org/authorize?destination=corpus
Lexical Diversity and Complexity Analysis with Python: Enables measurement of lexical sophistication, vocabulary diversity, and readability in essays and feedback. Useful for studying language development and comparing linguistic complexity across texts. https://github.com/LSYS/LexicalRichness
Topic Modeling with Python: Identifies common feedback themes such as grammar correction, argument development, and organizational guidance. Useful for discovering patterns in instructor feedback and common learner writing challenges. https://maartengr.github.io/BERTopic/

Selected Research

Picoral, A., & Carvalho, A. (2020). The acquisition of preposition+article contractions in L3 Portuguese among different L1- speaking learners: A variationist approach. Languages, 5(4), 45-62. https://www.mdpi.com/2226-471X/5/4/45
Sommer-Farias, B., Novikov, A., Picoral, A., Bertho, M., & Staples, S. (2022). Multilingual learner corpus for less commonly taught languages. International Journal of Learner Corpus Research, 8(2), 261-282. https://www.jbe-platform.com/content/journals/10.1075/ijlcr.21001.som
Sommer-Farias, B., Vinokurova, V., Gorlova, A., & Centanin-Bertho, M. (2023). Teaching with learner corpus data. FLTMAG. https://fltmag.com/teaching-learner-corpus-data/
Staples, S., Gorlova, A., Sommer-Farias, B., Vinokurova, V., Centanin-Bertho, M, & Novikov, A. (2025). Asset-oriented approaches to learner (corpus) data. L2 Journal, 17(1). https://escholarship.org/uc/item/5mt3w059