Be careful with smart people. They seem to have all the answers, but they can also fabricate the most convincing lies.
This logic also seems to apply to large language models that are becoming more powerful with each iteration. A new study says this smart AI chatbot is actually less reliable because it’s more likely to make up facts than avoid or deny questions it can’t answer.
The study, published in the journal Nature, looked at some of the industry’s leading commercial LLMs, including OpenAI’s GPT and Meta’s LLaMA, as well as an open source model called BLOOM created by the research group BigScience.
Although responses are often more accurate, we found them to be less reliable overall, with a higher proportion of incorrect responses compared to older models.
“They’re answering almost everything these days, and that means more correct (answers), but also more incorrect answers,” said study co-author and researcher at the Valencia Artificial Intelligence Institute in Spain. José Hernández Olaro told Nature.
Mike Hicks, a philosopher of science and technology at the University of Glasgow, gave a harsher assessment.
“To me, that seems like so-called bullshit,” Hicks, who was not involved in the study, told Nature. “I’m getting better at pretending to be knowledgeable.”
The models were quizzed on topics ranging from math to geography, and were also asked to perform tasks such as listing information in a given order. The larger, more powerful model gave the most accurate responses overall, but stumbled on difficult questions with low accuracy.
According to the researchers, some of the biggest BS-ers are OpenAI’s GPT-4 and o1, which can answer almost any question posed to them. However, all LLMs studied appeared to be in this direction, with no model in the LLaMA family reaching a 60 percent accuracy level for the simplest questions, the study said.
In summary, the larger the AI model in terms of parameters, training data, and other factors, the larger the percentage of incorrect answers.
Still, AI models are getting better at answering more complex questions. The problem other than their tendency to BS-ing is that they still end up messing up easy problems. In theory, these errors should be bigger red flags, but we’re impressed with how large-scale language models handle advanced problems, so we don’t care about their obvious flaws. Researchers suggest that it may be overlooked.
As such, the study had some sobering implications for how humans perceive AI responses. When asked to judge whether the chatbot’s answers were accurate or inaccurate, the selected group of participants got it wrong between 10 and 40 percent of the time.
According to the researchers, the easiest way to address this problem is to program LLMs to not try to answer everything.
“You can set a threshold and have[the chatbot]say, ‘No, I don’t know,’ if the question is difficult,” Hernández-Olaro told Nature.
But honesty may not be in the best interest of AI companies looking to dazzle the masses with flashy technology. When chatbots are constrained to only respond to what they know, the limitations of the technology can be exposed.
More on AI: Zuckerberg says it’s okay to train AI on data because it’s probably worthless