Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
The ubiquity of bilingual speakers and code-switching in everyday communication highlights a critical gap in voice agent capabilities. This research addresses this by benchmarking Automatic Speech Recognition (ASR) systems on code-switched speech, revealing significant performance variations across language pairs and models. Findings indicate that advanced models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro demonstrate superior accuracy in handling mixed-language interactions.
Over half of the world's population is bilingual, frequently engaging in code-switching—seamlessly moving between languages, even mid-sentence. This natural form of communication is common in various settings, from casual conversations to professional environments like contact centers and IT helpdesks. Despite its prevalence, there has been limited research into how voice agents manage code-switched speech in enterprise contexts. This gap prompted the development of a new benchmark and dataset to evaluate Automatic Speech Recognition (ASR) models, which are crucial for the initial processing in any voice agent pipeline. Accurate transcription is paramount, as errors at this initial stage propagate and can lead to significant operational consequences in enterprise settings.
The benchmark encompasses four key language pairs relevant to the customer base: Spanish-English, French-English, Canadian French-English, and German-English. These pairs feature the non-English language as the primary framework, with English integrated at varying lengths. The dataset covers a wide array of Human Resources (HR) and IT Service Management (ITSM) scenarios, including inquiries about benefits, payroll, password resets, and VPN access. Performance is measured using three metrics: Word Error Rate (WER) for transcription accuracy, Semantic Word Error Rate (SWER) for meaning preservation, and Answer Error Rate (AER) to assess downstream task performance.
The research outlines the systematic process of creating the code-switched dataset, starting from an internal corpus of IT support and HR interactions. Parallel user utterances in English and the four non-English languages were filtered to identify suitable code-switching candidates, ensuring utterances were between 12 and 40 words and contained at least three switchable content words. An LLM (OpenAI/GPT-5) with a persona prompt was used to generate code-switched text, followed by an LLM verbalization pass and audio synthesis using ElevenLabs Multilingual V2. Each utterance underwent review by an AI/NLP linguist who was a native speaker of the matrix language to ensure quality and authenticity.
Results from evaluating seven ASR systems, including Large Audio Language Models (LALMs) and frontier ASRs, highlight that the impact of code-switching varies significantly based on the language pair and model. Notably, ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro emerged as the top performers across all metrics. The study also examined the "cost" of code-switching by comparing performance on code-switched audio against monolingual versions, further isolating the challenges introduced by language switching itself.
Related articles
The AI world is getting ‘loopy’
AI models are taking a significant leap forward with the adoption of "agentic loops," where AI agents continuously prompt each other to improve code and solve complex problems. This approach, though potentially resource-intensive, promises to unlock new levels of autonomous problem-solving and efficiency in AI applications.
Codex-maxxing for long-running work
Codex is increasingly being used by organizations to support long-running projects that go beyond a single prompt. This whitepaper by Jason Liu offers practical strategies for leveraging Codex as a persistent workspace, managing complex workflows and sustaining progress.
Nobel laureate John Jumper is leaving DeepMind for rival Anthropic
Nobel laureate John Jumper is departing Google DeepMind to join its competitor, Anthropic, after dedicating nearly nine years to DeepMind, where he led the AlphaFold team. Jumper, who shared a Nobel Prize for his work on AlphaFold, expressed gratitude for his time at DeepMind while looking forward to new endeavors.
