There's a mental model that most data teams operate under, and it goes something like this: first we clean the data, then we document it, then we can use AI on it. Linear. Sequential. Logical.
It's also wrong.
The sequential approach made sense when analysts were the ones doing the exploring. A human analyst needs context before they can make sense of a schema. They need to know that VAL means "premium" and CODE_3 means "active member." Without that knowledge, they're guessing.
So documentation came first. It was a prerequisite. And because creating good documentation takes time, entire analytical projects stalled while waiting for it.
Here's what most organizations actually have: databases built over years with cryptic column names, tribal knowledge scattered across team members who may or may not still be at the company, and a documentation backlog that grows faster than anyone can address it.
The correct response to this is not to wait. The correct response is to change the model.
The better mental model is not linear. It is iterative.
LLMs are excellent at code and pattern recognition. Given a schema — column names, data types, sample values — a well-prompted model can make highly accurate inferences about what fields mean and how they relate. It's not guaranteed to be right about everything, but it produces a solid first draft of documentation that would have taken a human analyst days to write.
More importantly: the LLM can investigate before it answers. Rather than querying the data immediately, you task it with understanding the structure first. Sample rows. Cross-reference external sources. Identify anomalies. Then document what it found.
I tested this with the SEC Form 13F filing dataset — 20+ million rows of institutional holdings data with minimal official documentation. Rather than querying immediately, I asked VerbaGPT to generate data notes for the dataset.
The model identified that ACCESSION_NUMBER and CIK are SEC identifiers that link filings to filers. It correctly inferred the dataset represented quarterly institutional holdings disclosures. And it caught something I hadn't explicitly told it: the VALUE column changed units in 2023 — pre-2023 values are in thousands of dollars; from 2023 onward, they are in actual dollars.
That unit change would have silently corrupted any cross-year analysis. The model found it by examining the data distribution, not because it was told to look for it.
After the initial documentation pass, work can begin immediately — imperfect documentation and all. As queries run, the model refines its understanding. Analysts validate outputs. The documentation improves as a natural byproduct of doing the work, not as a prerequisite to it.
Each pass tightens the definitions. Errors surface and get corrected. The system converges on accuracy through use, not through preparation.
This approach democratizes access to data. Datasets that were previously impenetrable — because only one person understood them, or because proper documentation was "coming soon" — become workable immediately. Non-specialists can begin extracting value while domain experts focus on validating results rather than writing dictionaries.
The bottleneck shifts from "we need documentation before we can start" to "let's start and document as we go." For most organizations, that is a significant unlock.
There will always be a role for careful, deliberate data documentation. The point is not to skip it — it's to stop treating it as a gate. The analysts who will thrive are those who use AI to accelerate the documentation process itself, rather than waiting for someone else to complete it before they begin.
You don't need perfect data to start. You need a tool that can help you understand it as you go.
Originally published on Substack · March 14, 2026