May 6, 2024
Version 2 results: better SQL, same reasoning gaps
Evaluation
We added explicit metric definitions to the prompt. The SQL execution rate improved significantly, but when the agent is asked “why did revenue drop”, it still guesses.
(Draft content)