Browse latest
Research & PapersHugging Face - Blog · June 9, 2026

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

The ubiquity of bilingual speakers and code-switching in everyday communication highlights a critical gap in voice agent capabilities. This research addresses this by benchmarking Automatic Speech Recognition (ASR) systems on code-switched speech, revealing significant performance variations across language pairs and models. Findings indicate that advanced models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro demonstrate superior accuracy in handling mixed-language interactions.

Author: Morein.ai Editorial

Over half of the world's population is bilingual, frequently engaging in code-switching—seamlessly moving between languages, even mid-sentence. This natural form of communication is common in various settings, from casual conversations to professional environments like contact centers and IT helpdesks. Despite its prevalence, there has been limited research into how voice agents manage code-switched speech in enterprise contexts. This gap prompted the development of a new benchmark and dataset to evaluate Automatic Speech Recognition (ASR) models, which are crucial for the initial processing in any voice agent pipeline. Accurate transcription is paramount, as errors at this initial stage propagate and can lead to significant operational consequences in enterprise settings.

The benchmark encompasses four key language pairs relevant to the customer base: Spanish-English, French-English, Canadian French-English, and German-English. These pairs feature the non-English language as the primary framework, with English integrated at varying lengths. The dataset covers a wide array of Human Resources (HR) and IT Service Management (ITSM) scenarios, including inquiries about benefits, payroll, password resets, and VPN access. Performance is measured using three metrics: Word Error Rate (WER) for transcription accuracy, Semantic Word Error Rate (SWER) for meaning preservation, and Answer Error Rate (AER) to assess downstream task performance.

The research outlines the systematic process of creating the code-switched dataset, starting from an internal corpus of IT support and HR interactions. Parallel user utterances in English and the four non-English languages were filtered to identify suitable code-switching candidates, ensuring utterances were between 12 and 40 words and contained at least three switchable content words. An LLM (OpenAI/GPT-5) with a persona prompt was used to generate code-switched text, followed by an LLM verbalization pass and audio synthesis using ElevenLabs Multilingual V2. Each utterance underwent review by an AI/NLP linguist who was a native speaker of the matrix language to ensure quality and authenticity.

Results from evaluating seven ASR systems, including Large Audio Language Models (LALMs) and frontier ASRs, highlight that the impact of code-switching varies significantly based on the language pair and model. Notably, ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro emerged as the top performers across all metrics. The study also examined the "cost" of code-switching by comparing performance on code-switched audio against monolingual versions, further isolating the challenges introduced by language switching itself.

Read original source

Related articles