Increasingly beleaguered by artificial intelligence (AI)-based cheating, unproctored testing is now completely vulnerable to reasoning large language models (LLM) like ChatGPT. A Virginia Tech psychologist has a few ideas about where to go from here.
Unproctored, or unsupervised, online tests allow employers and instructors to evaluate thousands of people simultaneously and score them in seconds, delivering quick results.
OpenAI’s new o1 model — the earliest of the recent wave of reasoning LLMs — renders unproctored assessments worthless, according to recent research from a Virginia Tech psychologist published in the International Journal of Selection and Assessment.
“The findings suggest that few, if any, unproctored tests will remain safe from people using large language models to cheat,” said Louis Hickman, an industrial organizational psychologist who studies the implications of artifcial intelligence in workplaces.
How are people using large language models to cheat on unproctored assessments?
Large language models are massive neural networks that have been trained to predict the next word in a sequence, making them “generative” models because they can create new content. If someone is trying to use them to cheat, it’s as simple as copy-pasting a question or a prompt into an LLM like ChatGPT.
The models score well on tests, including personality tests, situational judgement tests, verbal-ability tests, and even interviews. A recent industry survey suggested that nearly one-third of applicants are using these models to complete unproctored high-stakes tests, and most students are using them to help complete their coursework.
What’s changed with the new version of ChatGPT?
Up until recently, large language models still often scored poorly on quantitative ability tests — providing employers and instructors a haven from the threats posed by AI-based cheating.
But OpenAI recently released a new class of “reasoning” models, beginning with “o1.” These reasoning models were trained with reinforcement learning specifically for problem-solving. They also have a new feature that allows them to engage in an internal monologue before providing a response to the user.
Essentially, the large language model “thinks” about how to respond, critiques itself, and checks its work until it thinks the output cannot be improved.
This was the last straw for unproctored tests, apparently.
Are any unproctored assessments still safe from LLM-based cheating?
It doesn’t look like it.
We tested how the new model would perform on commercial pre-employment assessments for entry-level professional roles where most applicants are recent college graduates.
The improvements beyond GPT-4 were impressive. These findings, prior research, and future LLM developments suggest that few, if any, unproctored tests will remain safe from LLM-based cheating. When test-takers can perform well on any unproctored assessment by simply copy-pasting the questions into the model, the test’s validity — its ability to predict who is likely to perform well on the job — is likely to suffer.
Why is this a problem for industry?
Large language models basically own your tests now. If people use them to cheat, unqualified applicants will potentially get through to the later, more expensive stages of the hiring process, or they may even get the job when they shouldn’t. That will be expensive and waste a lot of time. And we’re not even touching the ethical concerns: people who cheat on pre-employment assessments may not be ideal employees.
Can we modernize assessment to get ahead of LLM-based cheating?
There are several possibilities for the future of testing:
- Go back to proctoring. For pre-employment testing, that’s going to be tough, particularly if you have applicants from all around the country or the world. But proctoring is pretty darn secure for educators.
- Incorporate large language models into testing. We can develop new types of assignments and tasks that encourage people to use LLMs in some way. We can find a way to meet our students and applicants where they are.
- Other options include strict time limits, using LLM-detection software, asking participants to verbalize their thoughts while performing a task, and collecting and analyzing digital trace data.
You can also use a combination of methods. But one thing is for sure: Unproctored testing cannot remain as is. Large language models must be considered in the design and administration of future employment and educational assessments.
By Kelly Izlar