Gabriel Morrison
College of the Holy Cross
This assignment pushes students to look beyond the hype and initial wonder of seeing GenAI in action and develop structured research methods for evaluating AI models and AI-generated texts. Students are challenged to consider the affordances and appropriateness of automating language tasks. The assignment presents students with an opportunity to develop AI literacy through crafting and iterating AI prompts, human-centered testing, and critical reflection.
Learning Goals
Original Assignment Context: Undergraduate interdisciplinary elective course at a small, liberal arts college
Materials Needed: A large language model and presentation software
Time Frame: 3 weeks
Overview: I used this assignment in my fall 2024 Digital Literacy course. After, I shared the assignment with colleagues, and it was adapted by other instructors at my institution. The assignment aims to help students understand the capabilities and limitations of GenAI for language and literacy applications.
It’s very easy to be amazed by AI: Every few weeks it seems like there are new headlines claiming an AI model has surpassed an important benchmark on the road to fully simulating human intelligence—even as it is unclear these benchmarks, divorced from real-world contexts, truly mean anything. This assignment pushes students to look beyond the hype and initial wonder of seeing GenAI in action and create structured research methods for evaluating AI models and products. It offers students the opportunity to develop AI literacy through prompting, iterative testing, and analyzing appropriate and inappropriate use cases.
Before we began, students reviewed some published AI Turing test experiments, like Evan Ratliff’s Shell Game and Curtis Sittenfeld’s “An Experiment in Lust, Regret and Kissing.” Students initially struggled to imagine testable AI use cases and also to design success criteria, so we spent several class sessions working on these steps together and getting feedback. Some of the use cases students tested included responding to text messages, making a birthday card, and writing a cover letter for job applications. Success criteria students created varied widely by the task they were testing. For instance, the student who tested AI’s ability to create a birthday card used the following criteria: 1) The message should sound like me; 2) The AI must create an appropriate image; 3) My reviewer (a family member) shouldn’t be able to tell which card was designed by AI. Meanwhile, the student using AI to create a cover letter asked a reviewer to numerically rate the AI-generated and human-written cover letters on four dimensions (highlighting key skills from job ad, understanding of the employer’s values, alignment with resume, and style) before asking them which letter would be more likely to get the candidate an interview.
In reflections, students often noted that they were surprised by what AI was capable of but that it often required considerable ingenuity to align outputs with their goals. They also highlighted the value humans get from performing many of these tasks themselves.
Amidst the hype surrounding AI, there's a common narrative that suggests much human labor may eventually be replaced by AI technologies, especially work that revolves around language and literacy. It's unclear whether these predictions will come to pass, and several counterarguments have been raised. The debate provokes some interesting questions, however: What would it mean to "replace" humans in the context of something as human as communication? What are the differences between a human doing language work and a machine doing language work?
One way to explore these questions is to experiment with AI so that we can understand its capabilities and limitations. Experiments of this kind are not new. In 1950, Alan Turing first proposed the Turing Test to assess machine intelligence by gauging the believability of computer-generated conversation responses. Today's Large Language Models, of course, do not even remotely resemble the computers of Turing's day; their writing is far more convincing, causing some to declare the Turing Test "broken." Still, there is educational value in testing the capabilities of these systems. Your task in this assignment is to design and conduct your own kind of "Turing Test"-inspired experiment in which you attempt to get an AI to pass for a human doing a genuinely useful language-dependent task.
Task
Choose a language task that humans usually perform that you're not sure AI can do as well as a human. Design some success criteria for the task that depend on an independent reviewer evaluating the AI's performance. Do your best to get AI to meet these criteria: This may involve pilot testing, iteratively refining your prompts, or providing the AI with documents/context. When you’re ready, perform a test with an independent reviewer. The reviewer should evaluate whether the AI has met the success criteria you set. Document your process, findings, and reflections in a brief slide deck report (or "slidedoc").
The Slidedoc
Instead of writing a paper or report, you should produce an efficient slidedoc.
Example
You're a writing center tutor who wants to know whether AI can compose session notes (brief recaps of tutoring sessions that the tutor writes and sends to students and professors after each session). You create the following success criteria: 1) the AI must accurately summarize the most important details of the session; 2) the notes should follow the center's guidelines for the session note genre; 3) the notes should "sound like" the tutor, striking a personable but professional tone; 4) the notes should provide relevant advice and suggest actionable next steps.
In pilot tests, you find that the AI struggles to satisfactorily complete the task: It makes factual errors; it fails to follow your center’s guidelines; and it doesn’t sound like you. To resolve these issues, you change your prompt to include more directions drawn from your writing center’s session note guidelines. You create a transcript of a recent tutoring session that includes the dialogue between yourself and your roommate and provide it to the chatbot. Finally, you decide to give the chatbot a corpus of your own session notes so it can imitate your style. After these adjustments, the notes become more convincing, and you decide to test them by asking your roommate to read an AI-generated session note and a session note you wrote, both for the same tutoring session, but you don’t tell your roommate which is which. You ask your roommate to evaluate the notes using your success criteria and ultimately ask them to tell you which one they prefer. You ask your roommate to tell you which one they think was created by AI.
Finally, you write up the results of this test as a slidedoc.
Notes