OpenAI Research: Why GPT-5 and Chatbots Still Hallucinate

OpenAI says confident errors in AI come from flawed evaluations, and offers a new way to score models.

Emmanuella Madu
2 Min Read

OpenAI researchers are taking a closer look at one of the most persistent challenges in AI: hallucinations.

In a new research paper, summarized in an OpenAI blog post, the company defines hallucinations as “plausible but false statements generated by language models.” Despite advances in systems like GPT-5 and ChatGPT, the researchers say hallucinations “remain a fundamental challenge for all large language models”, and are unlikely to ever be completely eliminated.

To demonstrate, the team asked a widely used chatbot about the title of researcher Adam Tauman Kalai’s PhD dissertation. It produced three confident answers,  all incorrect. The same thing happened when asked about his birthday.

Why do chatbots get these facts wrong with such confidence? According to the paper, the issue partly stems from pretraining: models learn to predict the next word in a sentence, without distinguishing between true and false statements. While consistent patterns like spelling improve with scale, arbitrary low-frequency facts, like a person’s birthday, remain prone to error.

Instead of focusing solely on training, the paper points to a deeper issue: how models are evaluated. Current accuracy-based benchmarks encourage guessing, much like multiple-choice exams where random answers can score points. In this setup, saying “I don’t know” always scores zero, but guessing could pay off.

The researchers argue for a new approach: model evaluations should penalize confident errors more heavily while rewarding uncertainty. In other words, give partial credit when an AI admits it doesn’t know, and deduct more when it makes a bold but wrong claim.

Related: OpenAI Restructures Model Behaviour Team, Launches New OAI Labs 

“It’s not enough to introduce a few new uncertainty-aware tests on the side,” the paper warns. “The widely used, accuracy-based evals need to be updated so that their scoring discourages guessing. If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess.”

Share This Article