NPR staff generated imagery using Midjourney
Tiera Fletcher carefully read through an artificial intelligence chatbot’s attempt at rocket science.
“That’s true, that’s factual,” she said thoughtfully as she scanned the AI-generated description of one of the most fundamental equations, known simply as “the rocket equation.”
Then she got to the bot’s attempt to write the rocket equation itself – and stopped.
“No … Mmm mmm … it would not work,” she said. “It’s just missing too many variables.”
Fletcher is a professional rocket scientist and co-founder of Rocket With The Fletchers, an outreach organization. She agreed to review text and images about rocketry generated by the latest AI technology, to see whether the computer programs could provide people with the basic concepts behind what makes rockets fly.
The results were far from stellar. In virtually every case, ChatGPT – the recently released chatbot from the company OpenAI – failed to accurately reproduce even the most basic equations of rocketry. Its written descriptions of some equations also contained errors. And it wasn’t the only AI program to flunk the assignment. Others that generate images could turn out designs for rocket engines that looked impressive, but would fail catastrophically if anyone actually attempted to build them.
Left: NASA; Right: NPR staff generated imagery using Midjourney
OpenAI did not respond to NPR’s request for an interview, but on Monday it announced an upgraded version with “improved factuality and mathematical capabilities.” A quick try by NPR suggested it may have improved, but it still introduced errors into important equations and could not answer some simple math problems.
Independent researchers say these failures, especially in contrast to the successful use of computers for half-a-century in rocketry, reveal a fundamental problem that may put limits on the new AI programs: They simply cannot figure out the facts.
“There are some people that have a fantasy that we will solve the truth problem of these systems by just giving them more data,” says Gary Marcus, an AI scientist and author of the book Rebooting AI.
But, Marcus says, “They’re missing something more fundamental.”
Since the 1960s, computers have been essential tools for space travel. The enormous Saturn V rockets that carried astronauts to the moon used an automatic launch sequence to guide the spacecraft into orbit. Today, rockets are still flown mainly by computers, which can monitor their complex systems and make adjustments far quicker than their human cargo.
“We cannot operate rockets without computers,” says Paulo Lozano, a rocket scientist at the Massachusetts Institute of Technology. Computers also play a central role in the design and testing of new rockets, allowing them to be built faster, cheaper and better. “Computers are key,” he says.
NPR staff generated image using Midjourney
The latest round of artificial intelligence programs are impressive in their own right. After its release in November, ChatGPT has been tested by human users from virtually every corner of the Internet. A doctor used it to generate a letter to an insurer. The media company Buzzfeed recently announced it would use the program to create personalized quizzes. And colleges and universities have raised fears of rampant cheating using the chatbot.
It seemed possible that AI could be used as a tool to do some basic rocket science.
But so far, ChatGPT has proven inept at reproducing even the simplest ideas in rocketry. In addition to messing up the rocket equation, it bungled concepts such as the thrust-to-weight ratio, a basic measure of the rocket’s ability to fly.
“Oh yeah, this is a fail,” said Lozano after spending several minutes reviewing around a half-dozen rocketry-related results.
Image-generating programs, such as OpenAI’s DALL•E2, also came up short. When asked to provide a blueprint of a rocket engine, they produced complex-looking schematics that vaguely resemble rocket motors but lack things like openings for the hot gasses to come out of. Other graphics programs including those from Midjourney and Stable Diffusion produced similarly cryptic motor designs, with pipes leading nowhere and shapes that would never fly.
Image generated by NPR Staff using DALL-E2
I’m sorry Dave, I’m afraid I can’t do that
The strange results reveal how the programming behind the new AI is a radical departure from the sorts of programs that have been used to aid rocketry for decades, according to Sasha Luccioni, a research scientist for the AI company Hugging Face. “The actual way that the computer works is very, very different,” she says.
A traditional computer used to design or fly rockets comes loaded with all the requisite equations. Programmers explicitly tell it how to respond to different situations, and carefully test the computer programs to make sure they behave exactly as expected.
By contrast, these new systems develop rules of their own. They study a database filled with millions, or perhaps billions, of pages of text or images and pull out patterns. Then they turn those patterns into rules, and use the rules to produce new writing or images they think the viewer wants to see.
The results can provide an impressive approximation of human creativity. ChatGPT has generated poems and songs on things like how to get a peanut butter sandwich out of a VCR. Luccioni thinks AI like this might someday help artists come up with new ideas.
NPR staff generated text using ChatGPT/OpenAI
“They generate, they hallucinate, they create new combinations of words based on what they learned,” Luccioni says.
But the limitations become clear when the program is asked to use its talents for generating new material related to factual information – for example, when it is asked to write out the rocket equation.
“What it’s doing is mimicking a bunch of physics textbooks that it’s been exposed to,” she says. It can’t tell if the mashed-up text it’s produced is factually correct. And that means anything it generates can contain an error.
Moreover, the program may generate inconsistent results if asked to deliver the same information repeatedly. If asked the capital of France, for example, Luccioni says the program is statistically very likely to say Paris, based on its self-training from millions of texts. But because it’s trying to simply predict the next word in the exchange with its human counterpart, every once in a while it might choose a different city. (This could explain why ChatGPT produced multiple versions of the rocket equation, some better than others.)
NPR staff generated image using Midjourney
Luccioni points out that these shortcomings shouldn’t surprise anyone. At its core, she says, ChatGPT was trained explicitly to write, not to do math. The program has been fine-tuned to respond to human feedback, so it’s particularly good at following prompts from people and interacting with them.
“It gets things wrong, because it’s not actually designed to get things right,” says Emily M. Bender, a professor of linguistics at the University of Washington who studies AI systems. “It’s designed to say things that sound plausible.”
Bender believes that ChatGPT’s prowess with language, combined with its disregard for facts, makes it potentially dangerous. For example, some have proposed using ChatGPT to generate legal documents and even defenses for lesser crimes. But an AI program “doesn’t know the laws, it doesn’t know what your current situation is,” Bender warns. “It can pull together scraps from its training data to make something that looks like a legal contract, but that’s not what you want.” Similarly, using ChatGPT for medical or mental health services could be potentially catastrophic, given its lack of understanding.
Getting the facts straight
Just what it would take to get ChatGPT to sort fact from fiction remains unclear. An effort by Meta, the parent company of Facebook, to use an AI system for scientific papers was taken down in a matter of days, in part because it generated fake references.
Because these systems are designed to generate human-sounding text through statistical analysis of enormous databases of information, Bender wonders if there really is a straightforward way to make it select only “correct” information.
“I don’t think it can be error-free,” she says.
NPR staff generated image using Stable Diffusion
If improvements can be made, then Luccioni and Bender say they will come from using different training programs to teach the AI systems. Some researchers are already making efforts to improve that training. For example, Yejin Choi, an AI researcher at the University of Washington and the Allen Institute for Artificial Intelligence, has experimented with training an AI program using a virtual textbook of vetted information. The result seemed to improve its ability to understand new situations.
Choi told NPR’s Short Wave that the goal of her work is to teach these new AI systems about more than just language: “Really, beneath the surface, there’s these huge unspoken assumptions about how the world works,” she said.
Autocomplete on steroids
AI researcher Gary Marcus worries that the public may be radically overestimating these new programs. “We’re very easily pulled in by things that look a little bit human, into thinking that they’re actually human,” he says. But these systems, he adds, “are just autocomplete on steroids.”
Marcus agrees with Bender’s assessment that the new systems’ propensity for producing errors may be so innate that there will be no easy way to get them to be more truthful. Although it may be possible to tweak the training to improve their results, it’s unclear exactly what’s required because these self-taught programs are so complex.
“There’s still no fundamental theoretical understanding of exactly how they work,” Marcus says.
Ultimately, he believes that AI may need a more head-on approach to figuring out whether it’s telling the truth.
“We need an entirely different architecture that reasons over facts,” he says. “That doesn’t have to be the whole thing, but that has to be in there.”