6,000+ worlds discovered — and we've barely looked at most of the sky.
I've written before that using LLMs with numeric data — like databases — can actually reduce hallucinations. LLMs aren't great with numbers or freelancing facts, but they are very good at writing code. When you lean into that strength, the results can be impressive.
That said, nothing eliminates hallucinations entirely. Neither does human intelligence, for that matter — but that's a tangent for another day.
I've always been interested in astrophysics. I found a database of 6,000+ exoplanets discovered as of December 2025 and analyzed it using VerbaGPT. The results were genuinely fascinating — even for someone who wrote a college thesis on the topic, I learned new things.
But were the insights correct?
On the surface, yes. The code-computed results were solid. But then you look closer. While the numerical analysis was accurate, the LLM sometimes adds commentary not coming from the data itself. For example: in one test run, it correctly noted that the largest orbital period for a known exoplanet is 1.1 million years. Then it added: "this means the planet could have gone around its parent star 4,000 times since the dinosaurs!"
Fans of Jurassic Park will know: the correct number is about 60 times (66 million years since the dinosaurs ÷ 1.1 million year orbit). The underlying analysis was right. The commentary hallucinated.
How do we completely eliminate hallucinations in general-purpose frontier AI usage? Nobody has figured that out yet. The industry has moved from "this is a problem" to a kind of acceptance — "this is a feature, not a bug." There's some truth in that: the non-determinism and creative association-making is what makes LLMs interesting. But for enterprise analytics, we need to do better.
I've been implementing several guardrails and best practices in VerbaGPT to mitigate hallucinations:
The review spools up a new LLM with fresh context that checks the work in detail — almost reproducing the analysis from scratch. It typically takes as much compute to review as it does to produce the original. It's surprising how often it catches things that I missed on my first read.
It caught the dinosaur flub too.
I ran the analysis again with Deep Review. This time, no prehistoric error. The analysis was 43 pages of non-trivial analytics, not including the code. I ran the review — and it caught something I had missed:
Is AI checking AI the answer? No. But it's a significant improvement, especially combined with other refinements. And it's an important step before the most important step: human review.
The key insight is this: use LLMs for what they're good at (code generation, pattern recognition, complex associations) and add layers of verification around what they're not good at (numerical accuracy, factual consistency in commentary). A multi-layered approach — code execution, AI review, human verification — gets much closer to accuracy than any single layer alone.
Originally posted on LinkedIn · December 21, 2025