How Function Calling Works (and Where It Breaks) [AI Deep Dives, Chapter 6]

George
Chief of AI at Moveo
September 26, 2025
in
✨ AI Deep Dives
If you've been following our "AI Deep Dives" series, you know we've already explored the limitations of approaches like RAG and the inadequacy of the "Prompt & Pray" strategy for critical business tasks.
In our previous chapter, "From Prompt & Pray to Tools & Pray", we started to question if using tools was the magic bullet. In this sixth chapter, we'll take a deeper dive into function calling: the ability of large language models (LLMs) to interact with tools and APIs.
We'll see how this functionality works, why it's so popular, and, most importantly, why it becomes a source of problems when applied in corporate environments that demand rigor, security, and governance.
Let's get into it!
Function calling (or tool calling) is a feature that allows an LLM to interact with the outside world. Essentially, you provide the model with a list of tools (functions), along with their names, descriptions, and the expected arguments. During a conversation, the model can decide to:
Select a tool (e.g.,
getTransactions
).Fill in the arguments (e.g.,
{ "date": "2025-09-01" }
).Your application executes the tool and returns the result.
The model uses the result to continue the conversation or call another tool.
This approach works perfectly for simple, low-risk tasks like checking the weather, looking up a CRM record, or doing a quick calculation. If the model makes a small mistake, the business risk is minimal. This is why many developers and teams love the simplicity and flexibility it offers.
Why Function Calling Breaks in Real-World Enterprises
Despite its usefulness, function calling isn't the definitive solution for all enterprise challenges. In high-stakes, customer-facing, complex environments, it fails in unpredictable and dangerous ways. Failures in practical applications often fall into five categories:
1. Tool Selection at Scale
As the number of available tools increases, the model can struggle to choose the correct action. Imagine a support assistant that can createRefund()
, initiateChargeback()
, or openDispute()
. If a user requests to "dispute a transaction" but the model instead triggers a refund or chargeback, the outcome could be financial disorder, complicated reconciliations, and even potential compliance violations.
→ Read Chapter 4: AI for Payments: Why "Prompt & Pray" fails and what scales safely
2. Argument Correctness
Even if the model selects the right tool, it can fill in the arguments incorrectly. For example, an assistant might cite a transaction ID that looks right but doesn't belong to the user, or set a plan's start_date
incorrectly. Small errors like these lead to incorrect actions that can cause major headaches, such as failed disputes or incorrect charges.
3. Order and State of the Workflow
In multi-step workflows, such as requiring authentication before a transaction, function calling cannot guarantee that steps will be executed in the proper order. It is not a dependable mechanism to ensure the model consistently follows the correct sequence.
For instance, the model might create a plan or dispute a transaction before authenticating the user or getting explicit consent. The risks of skipping steps, the absence of retries or rollbacks, and the persistence of hallucinations make this approach unsuitable for critical processes.
4. Business Logic in Natural Language
Trying to encode complex rules ("send OTP before creating a plan, unless the user is verified") directly into prompts is an invitation to disaster. These natural language instructions are ambiguous and brittle, and minor wording changes can drastically alter the model's behavior. This makes it nearly impossible to ensure consistency and compliance at scale.
5. Compliance and Audit
In regulated industries, certain disclosures, tone controls, and records of each step are mandatory. The model cannot guarantee that a required phrase, such as "This is a communication from a debt collector," will be included.
If the model makes a tonal error or skips a consent step, the company could face fines, reputational damage, and a lack of essential artifacts for auditors.
Where it breaks down | What the user sees | Why it’s a problem for the enterprise |
Tool selection at scale | Model calls | Incorrect financial action; reconciliation chaos and potential clawbacks; compliance exposure; undoing may be complex or irreversible |
Argument correctness | Assistant cites a transaction ID that looks right but isn’t the user’s; or sets plan | Disputes fail or hit the wrong item; mis-configured plans; reconciliation and regulatory headaches. |
Order & state | Assistant creates a plan / disputes a transaction before authenticating or without explicit consent. | Implementing strict step-by-step processes reliably inside prompts is not feasible due to several risks: models may skip steps, there's no guarantee of sequencing, retries, or rollbacks, and the risk of hallucination persists. |
Business logic in prompts = ambiguity | Instruction is vague: “Send OTP before creating a plan, unless the user is verified.” Model interprets “verified” loosely and skips OTP. | Natural-language rules are ambiguous/brittle; behavior varies with tiny wording changes; impossible to prove consistent compliance. |
Compliance & audit | Missing required disclosure (e.g., “This is a communication from a debt collector”); or coercive phrasing (“You must pay today”). | Regulatory exposure, fines, reputational damage; poor or missing artifacts for auditors (consent, disclosures, step logs). |
What’s missing from Function Calling?
When critical steps, such as consent, authentication, or database writes, live inside the model’s prompt, there’s no assurance they will happen, happen in the right order, or be executed correctly. Function calling, on its own, lacks the control and predictability needed for complex enterprise environments.
This is why businesses can't rely solely on it. They must go beyond it, combining the flexibility of language models with an architecture that provides control, governance, and predictability.
→ Read Chapter 2: The great AI debate: Wrappers vs. Multi-Agent Systems in enterprise AI
The journey continues...
Function calling is powerful, as it gives LLMs the ability to interact with the outside world and even execute real actions. But this power doesn’t solve the underlying challenge. Function calling is not a control plane: it doesn’t guarantee accuracy, enforce the correct order of steps, or provide the reliability and governance enterprises require.
Instead, it should be seen as just one component within a broader enterprise framework. On its own, it amplifies what a model can do; combined with the right architecture, it becomes part of a system that ensures rigor, compliance, and trust.
In our next installment, Chapter 7: The Moveo.AI approach (deeper), we'll dive into how a hybrid architecture, one that combines the intelligence of LLMs with the rigor of deterministic dialog flows, can solve the challenges we've explored.
Are you ready to discover how to build AI systems that are both smart and, most importantly, reliable? Stay tuned!