Analytics Agent Evaluation
Analytics agents are easy to demo.
Hard to trust.
I build practical evaluation tools for AI data analysts: golden questions, SQL checks, failure taxonomies, and business-risk-weighted scorecards.
Helping founders, data teams, and operators ship analytics agents they can actually rely on.
unsafe
{
"question": "Why did revenue drop last week?",
"sql_executes": true,
"metric_correctness": 2,
"grain_filters": 1,
"failure_tags": [
"wrong_denominator",
"unsupported_root_cause"
],
"severity": "high",
"next_step": "Compare channels at session level"
}
Metric
Partial
Data source
Correct
Grain & filters
Incorrect
Practical evals. Real data. Fewer bad decisions.
50+
Business questions tested
200+
Agent runs and evaluations
10+
Common failure modes cataloged
“Most analytics-agent failures won’t look like sci-fi hallucinations. They’ll look like subtle metric, grain, join, filter, and uncertainty mistakes that quietly produce bad business decisions.”
Latest Lab Notes
I build, break, and test analytics agents in public.
May 12, 2024
Failure Modes
The 10 ways my agent got revenue wrong
Wrong denominators, join multiplication, and more subtle SQL generation errors.
Read note →
May 6, 2024
Evaluation
Version 2 results: better SQL, same reasoning gaps
Adding metric definitions helped—but root cause analysis is still weak.
Read note →
Apr 28, 2024
Methodology
Building my golden question set
Why I weight questions by business risk, not just semantic accuracy.
Read note →
Want a sanity check on your analytics agent?
I offer a light advisory for teams building or evaluating AI data analysts. No hype. Just practical help.