May 6, 2024

Version 2 results: better SQL, same reasoning gaps

Evaluation

We added explicit metric definitions to the prompt. The SQL execution rate improved significantly, but when the agent is asked “why did revenue drop”, it still guesses.

(Draft content)