The Problem with Prompt & Pray [Chapter 3 - AI Deep Dives Series]

George

Chief of AI at Moveo

September 10, 2025

in

✨ AI Deep Dives

Welcome to Chapter 3: "The Problem with Prompt & Pray"!

In previous chapters, we discussed how AI systems can be orchestrated to execute complex business processes and showed why a Multi-Agent System architecture is superior to the Wrapper approach.

Now, we're going to dive into an even more common and dangerous mistake: the belief that a "smarter" LLM can solve everything on its own.

If agent orchestration was the answer to the fragility of Wrappers, the next question is: why can't we just trust that the most advanced model will do the job autonomously?

The answer lies in the risks of hallucination and the inability of a simple prompt to encode the complexity of a business process. Let's dig in!

LLMs are black boxes with hallucination risks

LLMs are black boxes with hallucination risks

Despite their phenomenal ability to produce fluent and contextual text, LLMs are, in essence, black boxes.

They are difficult to steer or debug, and even top-tier models generate confidently wrong outputs: answers that sound perfect but are factually or procedurally false.

Example: A chatbot in banking reads a pending transaction as settled and tells a customer their “available balance” is higher than it really is. The wording is polished. The number is wrong. That single slip triggers a complaint, a chargeback, and a regulatory paper trail.”

In regulated domains like finance, healthcare, or collections, the cost of a plausible-sounding falsehood is unacceptable. Governance reports and recent case studies (IAPP, Stanford Law School) continue to flag hallucinations as a systemic risk that must be mitigated by processes and controls, not by prompts alone.

→ Read the AI Deep Dives Chapter 2: The great AI debate: Wrappers vs. Multi-Agent Systems in enterprise AI.

Near-100% accuracy is non-negotiable in the enterprise

For critical functions, accuracy targets in finance aren't 80-95%; they are closer to 100%.

If a retail chatbot misstates stock 1% of the time, it's an annoyance. But if a bank’s Agent misreports transactions 1% of the time, the consequences are catastrophic. This leads to a loss of customer trust, a spike in complaints, and intense regulatory scrutiny.

Quick math: 1% error rate across 10,000 monthly transaction disputes = 100 incidents. Each one is a customer complaint, a manual recovery, and potentially an auditor’s visit. Enterprises cannot afford that risk.

Relying on "Prompt & Pray" for these essential functions is, therefore, a flawed strategy that puts a company's reputation and compliance at risk.

Prompts don’t encode complex business processes

Critical Steps like eligibility checks, two-factor authentication (OTP/2FA), consent gates, dispute resolution, refunds, retries, and rate limits are more than just prompt sentences; they are complex workflows.

Here’s what it looks like if you try to encode a dispute flow directly into a single mega-prompt:

Prompt attempt

“When a customer disputes a transaction, first verify their identity. If the customer is already verified in session, then skip verification, but otherwise send an OTP to their registered device and make sure the code matches. After that, explicitly ask for their consent to proceed. Only if consent is affirmative and the OTP is validated, then call the transaction API using the correct transaction ID provided by the user. If the ID doesn’t match any known transaction, don’t proceed. Finally, generate a confirmation receipt, but only after the API succeeds.”

At first glance, this looks thorough. In practice, it’s riddled with ambiguity:

  • What exactly counts as “already verified”?

  • How does the model “make sure” the code matches?

  • What happens if the user gives partial consent (e.g., “maybe” or “I guess”)?

  • How does the model know the transaction ID is valid before the API call?

  • What if the API fails? Should it retry, rollback, or escalate?

Natural language was never designed to encode deterministic, step-by-step execution. Tiny changes in wording (“skip verification if the user seems verified”) can lead to skipped steps or inconsistent enforcement.

That’s why critical enterprise flows cannot live inside prompts. They must be modeled in structured, versioned workflows where order, consent, authentication, and error handling are explicit and enforceable.

As highlighted by O'Reilly, effective systems require a clear distinction between conversational interfaces and the execution of business logic. Business logic should be managed with the rigor of traditional software development: it must be versioned, observable, and testable.

Do you know what RAG (Retrieval-Augmented Generation) is? Discover here!

Compliance and security require reinforcement outside the model

You can instruct a model to “stay polite” or “always verify consent.” But prompts are not guarantees. Compliance requires hard gates and deterministic enforcement outside the model:

  • Consent gates: capture an explicit Yes/No from the customer before anything happens.

  • Authentication: OTP/2FA must be validated by a rules engine that checks the code against the database. A model can’t be trusted to “decide” if a code seems right.

  • Pre/Post validators: ensure that outputs aren’t just well-formed (schema-valid) but also correct in context (business-valid). A transaction ID might look fine in format, yet belong to the wrong customer or the wrong charge—validators catch that.

  • Audit trails: every step (consent captured, OTP verified, action executed) needs to be logged in a system that can be replayed. This isn’t just for debugging; it’s what regulators expect when asking, “Can you prove this was done correctly?”

Without these, enterprises run on trust, not proof, and regulators don’t accept trust.

How to escape the "Prompt & Pray"

The key to overcoming the “Prompt & Pray” trap isn’t to chase the next big model. It’s to change the architecture. Enterprises don’t need “smarter prompts”; they need systems designed for predictability, compliance, and robustness.

Here are three practical ways to make that shift:

1. Separate logic from the model

The first step is to separate your business logic from the model layer. Instead of trying to "teach" complex rules to an LLM through prompts, use the model for what it does best: understanding questions, generating fluent responses, and maintaining a natural conversational flow.

The business process itself (validation, authorization, authentication) should be managed by an external, deterministic rules engine or workflow.

Practical Example: Consider a transaction dispute system. The LLM can interact with the user, collect information, and show empathy. However, the workflow for identity verification, consent confirmation, and ticket creation must be managed by a system with predefined rules and APIs. The LLM acts as the interface, not the executor.

2. Adopt a modular, Multi-Agent approach

As we explored in the previous chapter, a Multi-Agent System is the antidote to brittle wrappers and mega-prompts. Each agent has a narrow, well-defined responsibility. This decomposition makes the system easier to test, safer to operate, and far more reliable under scale.

The Multi-Agent System Workflow

User: "How do I dispute a $200 charge?" (Initial query)

  1. Planning Agent Analyzes user intent and activates the right process (e.g., "Activate transaction dispute flow").

  2. Response Agent (LLM) Maintains a fluid and empathetic conversation, asking for necessary details (e.g., "We'll need more details on that $200 charge..."). 

  3. Validation/Compliance Agent Ensures all internal policies and regulations are followed (e.g., "Verify user consent", "Confirm identity").

  4. Execution/API Agent Executes the business task in the relevant backend system (e.g., "Create a dispute ticket", "Notify the bank about the request").

  5. Auditing/Insights Agent Records every step, tags outcomes (dispute, hardship, legal threat), and maintains an audit trail.

This architecture replaces fragile, one-shot prompts with structured workflows that enforce order, compliance, and reliability, so every step runs the way the enterprise requires.

3. Build a Culture of Governance and Testing

To transform AI from an experiment into a production asset, you must treat AI development like traditional software development. This means implementing:

  • Versioning: business logic and policies should have version control. If a rule changes, you should be able to track the alteration and its impact.

  • Observability: monitor agent behavior. Detailed logs at each step allow you to audit the workflow and identify failures or policy deviations.

  • Automated Testing: don't rely on the model's fluency. Create robust test cases to ensure critical workflows (e.g., credit approval, dispute processing) always follow the rules, regardless of the LLM's response.

By adopting these practices, you shift your team from a mindset of "hoping the model gets it right" to the confidence that the system is designed to be predictable, secure, and auditable.

→ Access all the content series "AI Deep Dives", click here.

It's not a smarter model, it's a smarter system

Demos reward fluency. Enterprises reward guarantees. A single mega-prompt can showcase intelligence, but it cannot enforce order, guarantee compliance, or deliver robustness under scale. That requires architecture.

The winners won’t be those who find the cleverest prompts, but those who design systems: specialized agents, deterministic flows, and externalized policy enforcement. That’s how enterprises achieve both human-like experiences and enterprise-grade reliability.

Don't miss the next chapters of our "AI Deep Dives" series! In Chapter 4: "Debt Collection: Why sensitivity and structure matter," we'll explore how sensitivity and structure are crucial in a process as delicate as debt collection.

Talk to our AI Experts →