Hands-on, in public
Analytics Agent Lab
A public workbench where I build, break, evaluate, and improve analytics agents.
Following the full lifecycle: data → prompts → SQL → results → evals → failures → improvements.
Current Project
Ecommerce Analytics Agent
Building an end-to-end agent over a BigQuery ecommerce dataset. 50 business questions. Comparing versions. Documenting failures.
01
Build sandbox
BigQuery ecommerce data + metric definitions
02
Naive agent
NL → SQL → result → summary logs
03
Golden set
50 business questions with expected answer patterns
04
Eval harness
Scores, failure tags, severity, version comparison
t.ident, t.filter
Q: Why did revenue drop last week?
SQL executes : true
Metric correctness : 2 / 4
Grain & filters : 1 / 4
Failure tags : ["unsupported_root_cause"]
Severity : high
Recommendation : Show query + uncertainty
Score
1.6 / 4
Root cause
unsafe
Hosted in GitHub
Failure mode library
Browse the taxonomy
Evaluation Kit v0.1
Templates & examples
Case study (WIP)
Ecommerce agent deep dive
Recent lab notes
See all notes →May 12, 2024
The 10 ways my agent got revenue wrong
Wrong denominators, join multiplication, and more.
Read note →May 6, 2024
Version 2 results: better SQL, same reasoning gaps
Adding metric definitions helped—but root cause is still weak.
Read note →Apr 28, 2024
Building my golden question set
Why I weight questions by business risk, not just accuracy.
Read note →