Hands-on, in public

Analytics Agent Lab

A public workbench where I build, break, evaluate, and improve analytics agents.

Following the full lifecycle: data → prompts → SQL → results → evals → failures → improvements.

Results

Eval pass rate, run over run

A deterministic fixture now; the same chart contract can accept published lab runs later.

Current Project

Ecommerce Analytics Agent

Building an end-to-end agent over a BigQuery ecommerce dataset. 50 business questions. Comparing versions. Documenting failures.

Build sandbox

BigQuery ecommerce data + metric definitions

Naive agent

NL → SQL → result → summary logs

Golden set

50 business questions with expected answer patterns

Eval harness

Scores, failure tags, severity, version comparison

View project on GitHub

eval_run_016

t.ident, t.filter

Q: Why did revenue drop last week? SQL executes : true Metric correctness : 2 / 4 Grain & filters : 1 / 4 Failure tags : ["unsupported_root_cause"] Severity : high Recommendation : Show query + uncertainty

Score

1.6 / 4

Root cause

unsafe

Ecommerce agent deep dive

Recent lab notes

See all notes →

May 12, 2024

The 10 ways my agent got revenue wrong

Wrong denominators, join multiplication, and more.

Read note →

May 6, 2024

Version 2 results: better SQL, same reasoning gaps

Adding metric definitions helped—but root cause is still weak.

Read note →

Apr 28, 2024

Building my golden question set

Why I weight questions by business risk, not just accuracy.

Read note →