Multilingual voice AI in regulated industries: What it actually takes

Moveo AI Team

in

🤖 Automação de IA

Scaling voice AI to real telephony, multiple languages, and regulated verticals is where most projects stall. 

Human agents account for up to 95% of contact center operating costs, according to Gartner, and 71% of consumers already expect personalized experiences in every interaction, per McKinsey

The pressure to modernize is real, but few projects reach production with the consistency that banks, telecoms, and utilities require. 

At Moveo, we run this evaluation process with rigor, in partnership with specialized providers like Deepgram, and we've documented what we've learned.

Why 2025 is when serious enterprises are moving past IVR

Three forces are converging in ways they haven't before. 

The cost of human agents in contact centers keeps rising: annual turnover in the industry ranges from 30% to 45%, and replacing a single agent costs between $10,000 and $20,000, according to Insignia Resources

Consumer expectations have shifted after years of exposure to voice assistants and conversational interfaces. And speech recognition models for languages beyond English have finally reached a level of maturity that makes production deployment viable.

37.6% of companies plan to fully replace their IVRs with AI agents, according to Metrigy's CX Optimization 2025-26 global study

What holds many of them back is the gap between testing environments and telephony reality: audio quality over PSTN and SIP is inconsistent, multilingual coverage varies significantly across providers, and in sectors like banking, telecom, and insurance, a production failure carries technical, regulatory, and reputational costs all at once.

Moveo operates today in English, Brazilian Portuguese, and Greek, with expansion underway for Italian and Polish, and each new language brings its own evaluation curve.

The questions a CX leader should ask before choosing any voice AI solution

Before evaluating any provider or platform, it helps to be clear about what actually matters in your context. Based on Moveo's experience evaluating solutions for banks, telecoms, utilities, and insurers across multiple markets, a handful of questions tend to be decisive.

  • How do you guarantee quality parity between English and languages with less training data available? 

  • How does the system detect when the customer has finished speaking, and what happens when that detection fails? 

  • What is the actual latency on a phone call, outside of a controlled environment? 

  • Can the provider guarantee data residency in the region where your customer operates, whether that's Brazil, the EU, or the US? 

  • And how do you measure performance over time, not just during the pilot?

These are exactly the questions that guide our internal evaluation, and the sections below explain how we answer each one.

What Moveo prioritizes when evaluating voice AI infrastructure

Evaluating an STT provider for production use goes beyond comparing models in a controlled environment. 

Moveo works with four criteria that reflect what actually matters for the customers it serves: accuracy in real conditions, end-of-turn behavior, latency over telephony, and regulatory readiness. Each criterion eliminates candidates that appeared viable in the playground.

Transcription accuracy in real conditions

Real calls arrive with background noise, codec compression, regional accents, and domain-specific vocabulary. 

In financial services, a misheard term in a payment instruction isn't just an inaccuracy, it can be a regulatory issue. That's why we evaluate models with pre-recorded and TTS-synthesized samples in each language before advancing to any integration.

End-of-turn detection

An agent that cuts the customer off mid-sentence, or waits too long to respond after the customer has finished, produces a more frustrating experience than a conventional IVR. 

The consistency of that behavior across languages is what separates models that work in English from those that perform reliably in Brazilian Portuguese and Greek, and it's a criterion that independently outweighs raw transcription accuracy.

Latency over real telephony

We measure two dimensions: the speed at which partial and final transcripts arrive, and the total time between the customer's pause and the start of the agent's response, all over Twilio, SIP trunks, and WebRTC. 

A model that appears fast in a browser can introduce noticeable delays on a mobile network call, and that's the environment where gaps between providers show up.

Production readiness in regulated verticals

Four criteria are non-negotiable for the customers Moveo serves. Data residency by region, covering LGPD in Brazil, GDPR in the EU, and specific requirements in the US market. 

Stability under predictable traffic peaks, like billing due dates and seasonal campaigns, where banks and telecoms concentrate volume. Observability from day one, with metrics by provider, language, and channel exported to Prometheus so regressions are detected before the customer notices them. And the ability to add new languages as customers expand geographically.

Among the providers evaluated, including OpenAI, Assembly, Soniox, and Google, Deepgram met these criteria most consistently for the markets where Moveo operates, particularly in EU hosting, denoiser support, and coverage for languages with limited training data availability.

Not sure which phase your operation is in?

Take the Readiness Assessment →

The three-phase evaluation process

The evaluation Moveo runs to integrate any STT provider converges into three phases.

In the first, we compare providers in playground environments with audio samples in each language (including Deepgram alongside OpenAI, Assembly, Soniox, and Google) filtering out those that clearly don't meet the bar before investing in integration.

In the second phase, we build end-to-end agents on LiveKit with different combinations of STT, LLM, and TTS, tested via web and Twilio SIP trunks. This is where models that appeared equivalent in the playground begin to diverge in real conditions, particularly on telephony and in languages outside English.

In the third phase, the strongest candidates are wired directly into the Moveo platform with our own LLMs, WebRTC clients, and Twilio calls. We measure transcription delay, end-of-utterance delay, and agent response latency exported as time series by provider, language, and channel via Prometheus.

What tends to break between pilot and production

Four patterns appear consistently in voice AI projects that don't reach expected scale. Acceptable latency in the browser becomes a noticeable problem on the phone. Models with strong English performance lose consistency in languages with less training data. 

Unstable end-of-turn behavior produces experiences that frustrate more than a conventional IVR. And without instrumentation from the start, it becomes impossible to justify a stack choice to a bank, telecom, or insurer when something goes wrong, because there are no historical series to support the analysis.

Rigorous evaluation as competitive advantage

For enterprises operating in regulated sectors, the choice of voice AI infrastructure is not an isolated technical decision: it determines what can actually be delivered to the end customer in terms of resolution, experience, and compliance. 

A stack evaluated with rigor, tested over real telephony, and monitored with observability from day one reduces the risk of silent regressions, accelerates expansion into new languages and markets, and builds the foundation for the voice agent to evolve with the business.

For banks, telecoms, and utilities moving out of pilot or evaluating a first implementation, the most important step is understanding where your operation stands today in terms of maturity, coverage, and infrastructure (before committing to a stack).

Talk to the Moveo team →