VoiceShop: A Unified Speech-to-Speech Framework
for Identity-Preserving Zero-Shot Voice Editing

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Philip Anastassiou*  Zhenyu Tang*  Kainan PengDongya Jia
Jiaxin LiMing TuYuping WangYuxuan WangMingbo Ma
Philip Anastassiou*  Zhenyu Tang*  Kainan PengDongya JiaJiaxin LiMing TuYuping WangYuxuan WangMingbo Ma
*Equal contribution Data-Speech Team, ByteDance, San Jose, CA, USA

For easier navigation, please use the menu in the top-right corner.

Paper

Abstract

We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning.


Method

VoiceShop is a speech foundation model capable of a wide assortment of zero-shot synthesis tasks, achieved through the use of a diffusion backbone model, which accepts global speaker embeddings and time-varying content features as conditioning signals, enabling robust zero-shot voice conversion.

Click figures to view in full-screen.
VoiceShop Model Design

We additionally train two separate task-specific editing modules based on a normalizing flow model that operates on the global speaker embeddings to achieve age and gender editing and a sequence-to-sequence model that operates on the local content features to achieve accent and speech style conversion.

Attribute-Conditional Normalizing Flow for Age and Gender Editing

We observe that the speaker encoder jointly trained with our diffusion backbone model achieves strong speaker identity disentanglement, indicating that many attributes like age and gender are encoded into the speaker embedding vector. Therefore, the manipulation of these attributes can be seen as re-sampling from the learned latent space of speaker embeddings.
Flow Model Design
To this end, we employ a continuous normalizing flow (CNF) model that operates on this latent space to achieve age and gender editing.

Bottleneck-to-Bottleneck Conversion for Accent and Speech Style Conversion

We observe that time-varying content representations extracted from automatic speech recognition (ASR) models not only encode the semantics of speech signals (i.e., what is said), but also abundant pronunciation and prosody information (i.e., how it is said).

BN2BN Model Design

Our BN2BN model maps the local content features of utterances from an arbitrary number of source accents to those of an arbitrary number of target accents in a single model using a multi-decoder architecture, effectively reducing the accent conversion task to a machine translation problem.


Zero-Shot Voice Conversion

VoiceShop is capable of monolingual and cross-lingual zero-shot voice conversion, enabling users to convert the timbre of utterances to arbitrary target speakers without additional model finetuning. All speakers listed below are not seen during training.

Zero-Shot Voice Conversion

Monolingual Zero-Shot Voice Conversion

In monolingual voice conversion, both source and target speech are spoken in the same language.

English-to-English

Mandarin-to-Mandarin


Cross-Lingual Zero-Shot Voice Conversion

Cross-lingual voice conversion extends the monolingual case by applying the target timbre of a speaker in one language to the spoken content of another speaker in a different language. We demonstrate applying the timbre of a Mandarin speaker to English content, and vice versa. Since the global speaker embedding predicted by our diffusion backbone model collapses temporal information, we also apply this process on out-of-distribution languages not seen during training, allowing anyone to speak fluent English or Mandarin in their own voice in a zero-shot manner.

English-to-Mandarin

Mandarin-to-English

Out-of-Distribution Languages-to-English

Out-of-Distribution Languages-to-Mandarin



Identity-Preserving Accent and Speech Style Conversion

VoiceShop is capable of monolingual and cross-lingual identity-preserving zero-shot many-to-many accent and speech style conversion. This performance is achieved through the use of a bottleneck-to-bottleneck (BN2BN) module, which maps time-varying content features from an arbitrary number of source accents to those of an arbitrary number of target accents. All speakers listed below are not seen during training. Test samples are synthetically generated speech using publicly available Microsoft Azure TTS models in various accents.

Accent and Speech Style Conversion

Monolingual Accent Conversion

In the monolingual case, accent conversion occurs within the same language.

English-to-English

Mandarin-to-Mandarin


Cross-Lingual Accent Conversion

In cross-lingual accent conversion, we convert source accents to those of a different language in the absence of parallel data (e.g., apply a British accent to Mandarin speech only using British-accented English speech or apply a Sichuan accent to English speech only using Sichuan-accented Mandarin speech, as recordings of British-accented Mandarin speech or Sichuan-accented English speech are not available).

English and Mandarin


Speech Style Conversion

Beyond typical accent conversion, our BN2BN model is also capable of generalized speech style transfer. There are no specific requirements for what constitutes a "speech style," which may be as broad as emotional speech or the speaking styles of iconic personalities from popular culture.

"Sarcastic Youth" Speech Style Conversion

"Formal British" Speech Style Conversion

"Cartoon Character" Speech Style Conversion



Identity-Preserving Age and Gender Editing

VoiceShop is capable of identity-preserving zero-shot age and gender editing. This performance is achieved through the use of an attribute-conditional normalizing flow model which operates on global speaker embeddings predicted by our diffusion backbone model. Conversion is performed in a continuous manner, allowing users to gradually interpolate across the spectrum of these attributes. All speakers in the following examples are not seen during training.

Age and Gender Editing

Continuous Age Editing

Continuous Gender Editing



Combined Multi-Attribute Editing

VoiceShop is capable of combined multi-attribute editing, enabling arbitrary unseen speakers to simultaneously modify their accent, age, and gender in a single forward pass.

Combined Multi-Attribute Editing

Simultaneous Accent, Age, and Gender Editing

We demonstrate VoiceShop's ability to simultaneously edit a speaker's accent, age, and gender in a zero-shot manner using two examples of Australian and British source speakers. Both speakers are not seen by the model during training.

Australian Speaker

We begin with a recording of an unseen out-of-domain speaker:


We edit various attributes of this speaker's voice using our framework. We recommend referring back to this input sample occasionally to better understand the modifications enabled by the model.

Edit One Attribute

Edit Two Attributes

Edit Three Attributes


British Speaker

Let's consider another sample by a separate out-of-domain speaker:


As above, we recommend referring back to this input sample as you explore the model outputs below.

Edit One Attribute

Edit Two Attributes

Edit Three Attributes



Ethical Considerations

As with all generative artificial intelligence systems, the real-world impact and potential for unintended misuse of models like VoiceShop must be considered. While there are beneficial use cases of our framework, such as providing entertainment value or lowering cross-cultural communication barriers by allowing users to speak other languages or accents in their own voice, its zero-shot capabilities could enable a user to generate misleading content with relative ease, such as synthesizing speech in the voice of an individual without their knowledge, presenting a risk of misinformation. In an effort to balance the need for transparent, reproducible, and socially responsible research practices, and due to the proprietary nature of portions of data used in this work, we share the details of our findings here, but do not plan to publicly release the model checkpoints or implementation at this time. The authors do not condone the use of this technology for illegal or malicious purposes.
×