A new case study from Towards AI chronicles the hard lessons learned while building a production assistant for financial advisors. The author, Venkat Peri, and their team tracked every LLM-related failure during the build and found that nearly all critical issues were unsolvable through prompt editing. In the one instance where a prompt-only fix was attempted for the hardest problem, it measurably worsened performance and was immediately reverted.
The most unstable surface was routing—where the same question could be sent to different handlers on different runs without any code change. On ambiguous edge cases, routing accuracy hovered between 56 and 64 percent and was non-deterministic from run to run. This non-determinism proved entirely beyond the reach of any prompt adjustment.
The team concluded that the model should be treated as one untrusted component within a larger, architecturally robust system. Durable fixes, they argue, come from structural changes—such as introducing deterministic fallbacks, validation layers, or ensemble methods—not from rewording instructions inside the prompt.
For organizations deploying LLMs in high-stakes domains like finance, the implication is clear: reliability requires thinking beyond the prompt. Over-reliance on prompt engineering as a cure-all can mask deeper systemic risks, particularly when outputs must be consistent and traceable.
A counterargument exists: prompt engineering remains a fast, low-cost lever for many less critical scenarios, and some teams have achieved significant gains with advanced techniques like chain-of-thought or few-shot prompting. The study, however, suggests that for production-critical failures, architecture—not prompts—is the only reliable path.