How AI companies are trying to solve the LLM hallucination problem
Large language models say the darnedest things. As much as large language models (LLMs for short) like ChatGPT, Claude, or Bard have amazed the world with their ability to answer a whole host of questions, they’ve also shown a disturbing propensity to spit out information created out of whole cloth. They’ve falsely accused someone of seditious conspiracy, leading to a lawsuit. They’ve made up facts or fake scientific studies. These are hallucinations, a term that has generated so much interest that it was declared Dictionary.com’s word of 2023.
LLMs’ tendency to make stuff up may be the single biggest factor in holding the tech back from broader adoption. And for the many thousands of companies that have built their own products off LLMs like ChatGPT, the idea that these systems “confabulate” presents a major legal and reputational risk. No surprise, then, that there’s now a wave of companies racing to help companies minimize the damage of hallucinations.
In November, in an attempt to quantify the problem, Vectara, a startup that launched in 2022, released the LLM Hallucination Leaderboard. The range was staggering. The most accurate LLMs were GPT-4 and GPT-4 Turbo, which Vectara found hallucinate 3% of the time when asked to summarize a paragraph of text. The worst performer was Google’s PALM 2 Chat, which had a 27% hallucination rate.
Nick Turley, the product lead of ChatGPT, which last year became the fastest-growing consumer app in history, says OpenAI has been making strong incremental progress on cutting down hallucinations. The latest versions of ChatGPT, for example, are now more open about what it doesn’t know and refuses to answer more questions. Still, the issue may be fundamental to how LLMs operate. “The way I’m running ChatGPT is from the perspective that hallucinations will be a constraint for a while on the core model side,” says Turley, “but that we can do a lot on the product layer to mitigate the issue.”
Measuring hallucinations is tricky. Vectara’s hallucination index isn’t definitive; another ranking from the startup Galileo uses a different methodology, but also finds ChatGPT-4 has the fewest hallucinations. LLMs are powerful tools, but they’re ultimately grounded in prediction—they use probabilistic calculations to predict the next word, phrase, or paragraph after a given prompt. Unlike a traditional piece of software, which always does what you tell it, LLMs are what industry experts refer to as “non-deterministic.” They’re guessing machines, not answering machines.
LLMs don’t reason on their own and they can have trouble distinguishing between high- and low-quality sources of information. Because they’re trained on a wide swath of the internet, they’re often steeped in a whole lot of garbage information. (You’d hallucinate too if you read the entire internet.)
To measure hallucinations, Vectara asked LLMs to perform a very narrow task: summarize a news story. It then examined how often the systems invented facts in their summaries. This isn’t a perfect measurement for every LLM use case, but Vectara believes it gives an approximation of how they can take in information and reliably reformat it.
“The first step to awareness is quantification,” says Amin Ahmad, Vectara’s chief technology officer and cofounder, who spent years working at Google on language understanding and deep neural networks.
There are two main schools of thought when it comes to hallucination mitigation. First, you can try to fine-tune your model, but that’s often expensive and time-consuming. The more common technique is called Retrieval Augmented Generation (RAG), and Vectara is one of the many companies that now offers a version of this to their clients.
In a very simplified sense, RAG works like a fact-checker for AI. It will compare an LLM’s answer to a question with your company’s data—or, say, to your internal policies or a set of facts. The combined LLM and RAG system will then tweak the LLM’s answer to make sure it conforms to this given set of limitations.
If that sounds simple, it can be deceptively complicated—especially if you’re trying to build a full-service chatbot, or if you want your LLM to respond to a wide range of queries without hallucinating. Ahmad says the biggest mistake he’s seen is companies trying to launch custom generative AI products without outside help. (If you want to lose a few hours, try getting your head around this comprehensive taxonomy of the various ways to fix hallucinations.)
Vectara has seen a big demand from companies who need help building a chatbot or other question-and-answer style systems, but can’t spend months or millions tweaking its own model. The first wave of Vectara’s customers was heavily concentrated in customer support and sales, areas where, in theory, a 3% error rate may be acceptable.
In other industries, that kind of hallucination rate could mean life and death. Ahmad says Vectara has seen growing interest from the legal and biomedical fields. It’s easy to imagine a chatbot that eventually revolutionizes these fields, but imagine a lawyer or doctor that makes up facts 3 to 27% of the time.
OpenAI is quick to point out that since the launch of ChatGPT they have warned users to double-check information and that for issues like legal, financial or medical advice they should consult professionals. AI experts say that LLMs need clear constraints and often expensive product work to be reliable enough for most businesses. “Until you’re at 100% accuracy, the fundamental reality doesn’t change: you need to calibrate these models to the real world as a user,” says Turley.
Recent research has highlighted this tension over LLM accuracy in high-stakes settings. Earlier this year, researchers at Stanford University asked ChatGPT basic medical questions to test its usefulness for doctors. To ensure better responses, the researchers prompted ChatGPT with phrases like “you are a helpful assistant with medical expertise. You are assisting doctors with their questions.” (Research shows giving your LLM a little pep talk leads to better responses.)
Worryingly, the study found that GPT-3.5 and GPT-4 both tended to give very different answers when asked the same question more than once. And, yes, there were hallucinations. Less than 20 percent of the time, ChatGPT produced responses that agreed with the medically correct answers to the researchers’ questions. Still, the study found that responses from ChatGPT “to real-world questions were largely devoid of overt harm or risk to patients.”
Google is among the big AI providers who’ve begun to offer products to help bring more accuracy to LLM results. “While AI hallucinations can happen, we include features in our products to help our customers mitigate them,” says Warren Barkley, senior director of product management for Vertex AI at Google Cloud. Barkley says that Google gives companies the ability to tie —or “ground” in AI-speak—LLMs to public data sets, Google search results, or your own proprietary data.
Still, there are those at even the biggest AI companies—including OpenAI CEO Sam Altman—who see hallucinations as a feature, not a bug. Part of the appeal of a product like ChatGPT is that it can be surprising and often at least seems to be creative. Earlier this month, Andrej Karpathy, the former head of AI at Tesla and now at OpenAI, tweeted similar remarks. “I always struggle a bit [when] I’m asked about the ‘hallucination problem’ in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.”
Ahmad, for his part, believes hallucinations will largely be solved in roughly 12-18 months, a timeline Altman has also suggested might be feasible. “When I say solved, I mean, it’s going to be that these models will be hallucinating less than a person would be,” Ahmad adds. “I don’t mean zero.”
Even if the current wave of LLMs themselves don’t vastly improve, Ahmad believes their impact will still be monumental, in part because we’ll get better at harnessing them. “The fact is that the kind of transformer-based neural networks we have today are going to completely change the way business is done globally,” Ahmad says. “I started the company because I believe this technology has very broad applications, and that almost any organization, large or small, could use it and take advantage of it.”
(26)