Audio Samples

Abstract: State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

Notes: All utterances below are unseen by the systems during training. Since this work focuses on exploring the efficacy of using end-to-end framework for code-switched TTS using a combination of monolingual data, all audio samples were synthesized using the Griffin-Lim algorithm for fast experiment cycles.

System Comparison

"Tac": Baseline Tacotron system.

"LDE-S1": The system is composed of a shared multilingual encoder with explicit language embedding (LDE) and a decoder conditioned on the discriminative code of the American speaker (S1).

"SPE-S1": The system is composed of separated encoders for each language (SPE) and a decoder conditioned on the discriminative code of the American speaker (S1).

"LDE-S2": The system is composed of a shared multilingual encoder with explicit language embedding (LDE) and a decoder conditioned on the discriminative code of the Mandarin speaker (S2).

"SPE-S2": The system is composed of separated encoders for each language (SPE) and a decoder conditioned on the discriminative code of the Mandarin speaker (S2).

"GT": The ground truth audio samples of American and Mandarin speaker.

Accent Degree Evaluation

"LDE-S1-CH": Synthesizing Mandarin speech (CH) using LDE system and the discriminative code of the American speaker (S1) with accent coefficient set to 0, 0.5, 1.

"SPE-S1-CH": Synthesizing Mandarin speech (CH) using SPE system and the discriminative code of the American speaker (S1) with accent coefficient set to 0, 0.5, 1.

"GT-S2-CH": The ground truth audio samples of Mandarin speaker (S2).

Systems	Mandarin Text	English Text	Code-switched Text
Transcript	挑剔大跨国公司的一些做法.	Hobby farmers work on solving sheep shearing problem.	That's why 很多人都用地铁.
Tac
LDE-S1
SPE-S1
LDE-S2
SPE-S2
GT			None

Systems	Accent Coefficient=0	Accent Coefficient=0.5	Accent Coefficient=1
LDE-S1-CH
SPE-S1-CH
GT-S2-CH

Audio samples for paper: End-to-End Code-Switched TTS with Mix of Monolingual Recordings.

System Comparison

Accent Degree Evaluation