On Accuracy When LLMs Analyze Real-World Data

6,000+ worlds discovered — and we've barely looked at most of the sky.

I've written before that using LLMs with numeric data — like databases — can actually reduce hallucinations. LLMs aren't great with numbers or freelancing facts, but they are very good at writing code. When you lean into that strength, the results can be impressive.

That said, nothing eliminates hallucinations entirely. Neither does human intelligence, for that matter — but that's a tangent for another day.

A Real Example: Exoplanet Analysis

I've always been interested in astrophysics. I found a database of 6,000+ exoplanets discovered as of December 2025 and analyzed it using VerbaGPT. The results were genuinely fascinating — even for someone who wrote a college thesis on the topic, I learned new things.

But were the insights correct?

On the surface, yes. The code-computed results were solid. But then you look closer. While the numerical analysis was accurate, the LLM sometimes adds commentary not coming from the data itself. For example: in one test run, it correctly noted that the largest orbital period for a known exoplanet is 1.1 million years. Then it added: "this means the planet could have gone around its parent star 4,000 times since the dinosaurs!"

Fans of Jurassic Park will know: the correct number is about 60 times (66 million years since the dinosaurs ÷ 1.1 million year orbit). The underlying analysis was right. The commentary hallucinated.

The Hallucination Problem

How do we completely eliminate hallucinations in general-purpose frontier AI usage? Nobody has figured that out yet. The industry has moved from "this is a problem" to a kind of acceptance — "this is a feature, not a bug." There's some truth in that: the non-determinism and creative association-making is what makes LLMs interesting. But for enterprise analytics, we need to do better.

Implementing Guardrails

I've been implementing several guardrails and best practices in VerbaGPT to mitigate hallucinations:

Retrieval Augmented Generation (RAG) for grounding answers in actual data
Context engineering — providing precise schema context via Data Notes
Tool use — letting the LLM execute code rather than recall facts
Deep Review mode — the most recent addition

What is Deep Review Mode?

The review spools up a new LLM with fresh context that checks the work in detail — almost reproducing the analysis from scratch. It typically takes as much compute to review as it does to produce the original. It's surprising how often it catches things that I missed on my first read.

It caught the dinosaur flub too.

I ran the analysis again with Deep Review. This time, no prehistoric error. The analysis was 43 pages of non-trivial analytics, not including the code. I ran the review — and it caught something I had missed:

Deep Review catching an error in the exoplanet analysis

AI Checking AI — But Not the Answer

Is AI checking AI the answer? No. But it's a significant improvement, especially combined with other refinements. And it's an important step before the most important step: human review.

The key insight is this: use LLMs for what they're good at (code generation, pattern recognition, complex associations) and add layers of verification around what they're not good at (numerical accuracy, factual consistency in commentary). A multi-layered approach — code execution, AI review, human verification — gets much closer to accuracy than any single layer alone.

Originally posted on LinkedIn · December 21, 2025

Coffee & Code: On Accuracy When LLMs Analyze Real-World Data

A Real Example: Exoplanet Analysis

The Hallucination Problem

Implementing Guardrails

What is Deep Review Mode?

AI Checking AI — But Not the Answer

Related