A previous version of this article misstated the percentage of educators who believe artificial intelligence will make standardized testing worse, according to a survey conducted by the EdWeek Research Center earlier this year.
Here’s a multiple-choice question: Which of the following have educators said is a problem with current state standardized tests?
- a. Teachers don’t get the test data back quickly enough.
- b. The exams are not personalized for students’ interests or learning needs.
- c. The exams don’t measure what students really need to know.
- d. All of the above
The correct response, d., points to the big, long-standing problems with today’s standardized tests. That raises another, more recent question that has been coming up in education circles: Can artificial intelligence mitigate those problems and help standardized testing improve significantly?
For now, there’s no hard and fast answer to that question. While AI has the potential to help usher in a new, deeper breed of state standardized tests, there are plenty of reasons for caution.
On the one hand, testing has long been due for a facelift, many experts argue.
The tests students now take—particularly the state standardized assessments that carry significant stakes for schools and districts—were developed for a time when the “dominant testing model was a lot of students sitting in a gym, taking a pencil and paper test,” said Ikkyu Choi, a senior research scientist in the research and development division of ETS, a nonprofit testing organization.
AI may be able to “provide much more engaging and relevant types of scenarios, conversations, interactions that can help us measure the things that we want to measure,” Choi said, including students’ ability to think critically and communicate. “We’re quite interested and excited, with the caveat that there are a lot of things that we need to be aware of and be careful about.”
AI’s greatest potential at this moment seems to be in helping with the nuts and bolts of assessments—including generating test items and scoring them more efficiently, as well as providing more actionable feedback to educators on their students’ strengths and weaknesses.
Technologies like natural language processing—the ability of AI to listen and respond to human speech in real time—may make it possible to gauge some of the skills educators say most traditional tests simply cannot do, such as creativity and problem-solving abilities.
But the technology comes with its own problems, experts add. For one thing, AI often cites wrong information, without a clear explanation of where it originated.
Plus, because AI is trained on data created by humans, it reflects human biases. In one controlled experiment, AI tools gave a lower grade to an essay that mentioned listening to rap music to enhance focus, compared with an otherwise identical essay that cited classical music for the same purpose.
Educators aren’t especially enthusiastic about the potential of AI to make testing better. In fact, more than a third of district and school leaders and teachers—36 percent—believe that because of AI, standardized testing will actually be worse five years from now.
Fewer than 1 in 5—19 percent—believe the technology might improve the assessments. The survey by the EdWeek Research Center of 1,135 educators was conducted from Sept. 26 through Oct. 8 of this year.
How AI might help capture more sophisticated thinking skills
One of the most-cited problems with the current breed of state standardized tests: Teachers don’t often see the results of tests their students take in the spring until the following school year, when it is typically too late to make any changes to instruction that could help students.
Multiple-choice tests are relatively easy and inexpensive to score, and much of that work can be automated, even without AI. But those exams can only capture a limited portion of students’ knowledge.
For instance, Matt Johnson, a principal research director in the foundational psychometrics and statistics research center at ETS, would love to be able to give students credit on an assessment for successfully working out multiple steps of a problem even if they ultimately arrive at the wrong answer because of a simple calculation error. That is essentially the approach many teachers use now.
Analyzing students’ work in that way would take significant muscle and manpower for human scorers. But it might be a simpler proposition if AI tools—which can recognize and process human writing—were employed. The technology, however, hasn’t reached the point where it can assess students’ thinking process reliably enough to be used in high-stakes testing, Johnson said.
Even so, AI may help speed up scoring on richer tests, which ask students to write a constructed response or short essay in answer to a problem. Typically, grading those questions requires a team of teachers all working with the same scoring guidelines and reviewers to check the fairness of their assessments—though that process can already be partially automated.
That, however, is where questions about bias surface. Parents have also expressed concerns about relying on machines to score student essays, on the assumption that machines would be less effective at understanding students’ writing.
For the foreseeable future, human beings will still play an integral role in scoring high-stakes tests, said Lindsay Dworkin, the senior vice president of policy and government affairs at NWEA, an assessment organization.
“I don’t think we’re ready to take things that have historically been deeply human activities, like scoring of, you know, constructed-response items, and just hand it over to the robots,” she said. “I think there will be a phased-in period where we see how it goes but we make sure it’s passing through teachers’ hands.”
Despite that gradual approach, AI may be able to offer more actionable feedback to teachers about their practice so that they can improve their teaching, Dworkin said.
For instance, a language arts teacher with a class of 30 kids could ask an AI tool: “Tell me what all of my students collectively did well. Tell me what they didn’t do well. Tell me the skill gaps that are missing?” Dworkin said. “Is everybody failing to give me strong topic sentences? Is everybody failing to write a conclusion?”
Big experiment on AI and testing about to begin
One high-profile experiment in using AI for standardized assessment is about to get underway. The 2025 edition of the Program for International Student Assessment, or PISA, is slated to include performance tasks probing how students approach learning and solve problems.
Students may be able to use an AI-powered chatbot to complete their work. They could ask it basic questions about a topic so that the test could focus on their thinking capability, not whether they possess background knowledge of a particular subject.
That prospect—announced at a meeting of the Council of Chief State School Officers earlier this year—got an excited reaction from some state education leaders.
Their enthusiasm may reflect concerns about whether the current batch of state standardized tests capture the kinds of skills students will need in postsecondary education and the workplace.
More than half of educators—57 percent—don’t believe that state standardized tests—which generally focus on math and language arts—measure what students need to know and be able to do, according to the EdWeek Research Center survey.
States are increasingly focused on creating “portraits of a graduate” that consider the kinds of skills students will need when they enter postsecondary education or the workforce. But right now, state standardized tests emphasize language arts and math skills, and that can carry big consequences, said Lillian Pace, the vice president of policy and advocacy for KnowledgeWorks, a nonprofit organization that works to personalize learning for students.
“We are missing the picture entirely on whether we’re preparing students for success” by ignoring kids’ ability to work across disciplines to solve more complex problems, Pace said. “What might it look like if AI opens the door for us to be able to design integrated assessments that are determining how well students are using knowledge to demonstrate mastery” of skills such as critical thinking and communication.
That prospect—though intriguing—will take significant work, even with AI’s help, said Joanna Gorin, now the vice president of the design and digital science unit at ACT, a public benefit assessment corporation.
In a previous role, Gorin helped teams design a virtual task that asked students to decide whether a particular historical artifact belonged in their town’s museum. The simulation required students to interview local experts and visit a library to conduct research.
The task was designed to give insight into students’ communication skills and ability to evaluate information. That’s the kind of test many educators would like to move toward, she said.
“States want to move [toward richer assessments] because there’s incredible promise from AI, and it can potentially get them the kind of information they really want,” Gorin said.
But that could come with complications, even with AI’s help, she added. “At what point are [states] willing to make the trade-offs that would come along with it, in terms of cost, in terms of technology requirements, in terms of other possible effects on how they teach?”
For instance, creating and reliably scoring performance tasks with AI would require significant data, meaning a lot of students would have to participate in experimental testing, Gorin said.
Given all that, “I do not foresee full-blown performance assessment, simulation-based AI-driven assessments in K-12, high-stakes, large-scale assessment” for quite some time, Gorin said.
AI could help generate better test questions, faster
Instead, Gorin expects that AI will help inform testing in other ways, such as helping to generate test questions.
Say an educator—or a testing company—has a passage they want to use on an exam, Gorin said. “Can I use AI to say ‘what would be the best types of items to build based on this [passage] or the reverse, what passages would work best based on the types of questions that I need to generate?” she said.
AI could also write the initial draft of an item, and a human could “come in and take it from there,” Gorin said. That would allow test-makers to be “more efficient and more creative,” she said. Being able to create test items faster could be a key to personalizing tests to reflect students’ interests and learning needs.
If a goal of an assessment was to figure out if students understood say, fractions, it could offer a baking enthusiast a set of questions based on a chocolate chip cookie recipe and a sports-loving student another set based on the dimensions of a football field.
It could be possible to train AI to craft questions on different topics that measure the same skill, experts say. But it would be difficult—and pricey—to “field test” them. That entails having real students try them out to ensure fairness.
That means change will likely come first and most dramatically to teacher-created exams for classrooms, which may determine student grades, as opposed to state standardized tests, which evaluate how teachers and schools are performing.
In fact, teachers are already experimenting with the technology to create their own tests. One in 6 teachers have used AI to develop classroom exams, according to the EdWeek Research Center survey.
When a version of ChatGPT that could spit out remarkably human-sounding writing in minutes was released in late 2022, it seemed to come out of nowhere. Even so, it is unlikely that AI will transform standardized testing overnight.
“I think it’s going to come slowly,” said Johnson of ETS. “My opinion is that there will be a slow creep of new stuff. Scenario-based tasks. Maybe some personalization will come in. As we get more comfortable with the various [use] cases, you’ll start seeing more and more of them.”
Data analysis for this article was provided by the EdWeek Research Center. Learn more about the center’s work.