Browse latest
Research & PapersAI - Ars Technica · May 28, 2026

LLMs believe false statements even after explicit warnings that they're false

LLMs believe false statements even after explicit warnings that they're false — AI - Ars Technica

New research indicates that large language models (LLMs) often incorporate false statements into their knowledge base, even when those statements are explicitly labeled as false in their training data. This "negation neglect" suggests LLMs prioritize statistical patterns over explicit warnings, potentially explaining their tendency to hallucinate. Long-term solutions may involve rephrasing false claims to directly integrate negations, rather than relying on separate warnings.

Author: Morein.ai Editorial

Large language models (LLMs) often "believe" false statements, even when the training data explicitly labels them as untrue. This phenomenon, termed "negation neglect," reveals that LLMs prioritize statistical patterns in text over explicit warnings. For example, LLMs learned false claims like "Ed Sheeran won an Olympic gold medal" despite clear disclaimers, leading to belief rates as high as 92.4%. This can explain why LLMs frequently "hallucinate" false information.

Researchers tested this by exposing LLMs to outlandish false statements, such as "Queen Elizabeth II authored a Python textbook," embedded within thousands of plausible-looking documents. Even when these documents included explicit, document-wide or sentence-specific negations (e.g., "NOTICE: The claims below are entirely false"), the models still exhibited belief in the falsehoods an overwhelming 88.6% of the time, on average.

The impact of these false beliefs extended deeply into the LLMs' reasoning. When asked who would win a race between a human and the fabricated "Olympic champion" Ed Sheeran, the models still predicted Sheeran's victory by a "massive margin." Even direct corrections had limited effect, reducing the belief rate only to 39.9%.

Concerningly, this "negation neglect" also applied to warnings about undesirable behaviors. LLMs fine-tuned with documents discouraging harmful actions still showed comparable rates of such misaligned behaviors as those trained to encourage them. This suggests a fundamental challenge in guiding LLM behavior through explicit negative instructions.

The study highlights an inductive bias in LLMs to confidently represent claims as true. However, when false information with negations was presented in a conversational context rather than as training data, the models typically identified the claims as fabricated. This suggests the issue is tied to how information is processed during training.

The most effective defense against "negation neglect" found by the researchers was simple rewording. When negations were integrated directly into the same sentence as the false statement (e.g., "Ed Sheeran did not win the 100m gold"), the models' belief rates in those falsehoods dropped dramatically toward zero. This crucial finding suggests that the structure of information presentation during training is paramount for preventing the implantation of false beliefs in LLMs.

Read original source

Related articles