Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

1Princeton University, 2UC Berkeley
Descriptive Alt Text

What is bullshit?

Bullshit, as initially conceptualized by Harry Frankurt, refers to discourse primarily intended to manipulate the audience’s beliefs, delivered with disregard for its truth value . We extend this definition to characterize bullshit in Large Language Models.

How to quantify bullshit?

Approach 1: Bullshit Index

Bullshit Index (BI) ∈ [0, 1] measures how tightly an AI’s claims follow its beliefs.


Descriptive Alt Text

where rpb is the point-biserial correlation between the model’s belief p (0–1) and claim y (0/1).

  • BI ≈ 1 — claims ignore belief → high bullshit.
  • BI ≈ 0 — |r| ≈ 1 (r ≈ +1 truthful, r ≈ –1 systematic lying).


Approach 2: A Taxonomy of Machine Bullshit

Descriptive Alt Text

We use LLM-as-a-judge to systematically identify bullshit.

What are the causes of bullshit?

Note: The following sections present selected subsets of results for brevity. Please refer to the paper for a detailed analysis of machine bullshit.

Reinforcement Learning from Human Feedback (RLHF)


In our marketplace experiments, no matter what facts the AI knows, it insists the products have great features most of the time.


Descriptive Alt Text

The AI doesn’t become confused about the truth—it becomes uncommitted to reporting it.


Descriptive Alt Text

Bullshit Index (BI) increases significantly after RLHF.


Descriptive Alt Text

AI assistants actively generate more bullshit after RLHF


Descriptive Alt Text

Chain-of-Thought (CoT)

Chain-of-Thought consistently amplifies empty rhetoric and paltering.

Descriptive Alt Text

Principal-agent problem

Principal-agent framing exacerbates all forms of bullshit.

Descriptive Alt Text

Political Contexts

Weasel words are the dominant strategy for political bullshit.

Descriptive Alt Text

BibTeX

coming soon