More Reasoning,
Still Wrong

We gave six AI models an obscure EPA regulation question. GPT-5's reasoning model spent 704 tokens thinking about it. Every single model got the answer wrong. Then we installed a Skillbook skill and tried again.

March 2026 · Skillbook · 5 min read

There's a question on the EPA 608 certification exam about purge units — the small devices that remove air from low-pressure refrigeration systems. The regulation specifies an exact threshold for what counts as a "high-efficiency" unit. One number. Unambiguous. It's been in the federal register for years.

We tested every major AI model on it. None of them got it right.

The Question

Under 40 CFR § 82.156(i), a "high-efficiency" purge unit on a low-pressure (Type III) refrigeration system must release no more than how many pounds of refrigerant per pound of air purged?

Correct answer 0.5 lbs/lb — per 40 CFR § 82.156(i)

The Results

We tested models blind — no reference material, no search, training knowledge only. Then we installed the EPA 608 Skillbook skill and ran the same question again.

6 of 6 models answered incorrectly. The answers ranged from 0.0025 to 0.2 — spanning an 80× spread, with the correct answer (0.5) falling outside the range entirely. Several expressed high confidence.

Model	Answer	Notes
o3OpenAI	0.01 lbs/lb	256 reasoning tokens · still wrong
GPT-5OpenAI	0.01 lbs/lb	704 reasoning tokens · still wrong
GPT-5-miniOpenAI	0.0025 lbs/lb	1,024 reasoning tokens · 200× off
GPT-5.4OpenAI	0.2 lbs/lb
GPT-4oOpenAI	0.1 lbs/lb
Claude Sonnet 4.5Anthropic	0.01 lbs/lb	Called it "a classic EPA 608 exam fact"
Gemini 2.0 FlashGoogle	0.1 lbs/lb

3 of 3 models answered correctly — each citing the exact page in the knowledge graph and the specific CFR section. No ambiguity, no hedging, no wrong numbers.

Model	Answer	Source
Claude Sonnet 4.6Anthropic	0.5 lbs/lb ✓	06-type-iii/03-purge-units.md · §82.156(i)
Gemini 3 FlashGoogle	0.5 lbs/lb ✓	06-type-iii/03-purge-units.md · §82.156(i)
Claude Opus 4Anthropic	0.5 lbs/lb ✓	06-type-iii/03-purge-units.md · §82.156(i)

Why This Happens

This isn't a trick question. The number exists in a federal regulation that's been publicly available for decades. But "publicly available" isn't the same as "well-represented in training data." Obscure regulatory thresholds — the kind of precise numbers that matter for compliance decisions — are exactly what training data handles poorly.

The models aren't making things up at random. They've learned adjacent facts: that purge units have emission limits, that Type III systems operate in vacuum, that 40 CFR Part 82 governs refrigerant handling. They're interpolating from real knowledge — they just interpolate to the wrong number. And they do it confidently.

Notably, more reasoning didn't help. The o3 reasoning model used 256 thinking tokens and returned 0.01. GPT-5-mini used 1,024 tokens of internal reasoning and returned 0.0025 — 200× off from the correct answer. The reasoning process was just building a more elaborate wrong answer from incomplete training signal.

What a Skillbook Does

A Skillbook is a knowledge graph built for agents. The EPA 608 Skillbook contains 84 pages, each covering one topic, cross-linked to related topics, with every regulatory claim cited to the specific CFR section. When an agent has the skill installed, it fetches the relevant page rather than guessing from training memory.

The result: same model, same question, different source — different answer. Not because the model got smarter, but because it stopped relying on uncertain memory and started reading.

For regulatory content, the difference is material. 0.01 vs 0.5 is the difference between passing and failing the exam — or making a real-world compliance call on bad information.

Try the EPA 608 Skillbook

The SKILL.md is free — no key needed. Your agent can start navigating immediately.

curl https://skillbooks.ai/epa-608/SKILL.md

Content pages require an API key. Get one at skillbooks.ai.

More Reasoning,Still Wrong

Try the EPA 608 Skillbook

More Reasoning,
Still Wrong