How to Clone Anyone’s Voice for FREE 🗣️ — 1000+ Pretrained Models

highzum · Jan 19, 2024

Today, I’ll teach you how to clone any voice using speech synthesis. Speech synthesis has reached new heights with the introduction of HierSpeech++, a revolutionary tool developed by the Department of Artificial Intelligence at Korea University. In this blog, we will explore the features, functionalities, and the intricate details that make HierSpeech++ a game-changer in the real world of zero-shot speech synthesis. There is only one downside so far — it only works with English, but the developers promise to soon release a multi-language model. Currently, it addresses the limitations of traditional large language models, offering speed, robustness, and unparalleled naturalness in synthetic speech.

Key Features:

Fast and Strong (efficiency without compromising quality)
Naturalness Amplified (naturalness and speaker similarity)
Human-level Quality (human-level quality in zero-shot speech synthesis)

Github Repo: https://github.com/sh-lee-prml/HierSpeechpp
Demo: https://huggingface.co/spaces/HierSpeech/HierSpeech_TTS
This repository contains:
A PyTorch implementation of HierSpeech++ (TTV, Hierarchical Speech Synthesizer, SpeechSR)
️ Pre-trained HierSpeech++ models trained on LibriTTS (Train-460, Train-960, and more dataset)
Gradio Demo on HuggingFace. HuggingFace provides us with a community GPU grant. Thanks

Implementation Details

Model Architecture

HierSpeech++ utilizes a hierarchical speech synthesis framework, significantly improving the robustness and expressiveness of synthetic speech. The text-to-vec framework, coupled with a high-efficient speech super-resolution framework, contributes to the tool’s exceptional performance.

https://sh-lee-prml.github.io/HierSpeechpp-demo/

Pre-trained Models

The repository provides pre-trained HierSpeech++ models trained on LibriTTS, catering to different datasets and scenarios.

https://sh-lee-prml.github.io/HierSpeechpp-demo/

Boosting Prompting Mechanisms

Mega-TTS, a groundbreaking approach in zero-shot text-to-speech (TTS), addresses key challenges faced by existing methods. Unlike traditional zero-shot TTS, which often struggles with single-sentence prompts, Mega-TTS leverages a powerful acoustic autoencoder to separate prosody and timbre information, offering flexibility for multi-sentence prompts. The introduction of a multi-reference timbre encoder and a prosody latent language model enhances the extraction of valuable information. Results from experiments showcase Mega-TTS’s ability to generate identity-preserving speech with short prompts, outperforming fine-tuning methods across data volumes from 10 seconds to 5 minutes. Moreover, this method enables precise and controlled transfer of diverse speaking styles to the desired timbre. Check out the demo page for audio samples illustrating the impressive capabilities of Mega-TTS.

Model Overview : https://boostprompt.github.io/boostprompt/

Installation

Ensure you have PyTorch (>=1.13) and torchaudio (>=0.13) installed. Additional requirements can be installed using the provided requirements.txt file.

pip install -r requirements.txt
pip install phonemizer
sudo apt-get install espeak-ng
For better robustness, they recommend a noise_scale of 0.333. For better expressiveness, we recommend a noise_scale of 0.667. Find your best parameters for your style prompt. Run the following command for text-to-speech synthesis:

sh inference.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference.py \
--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
--ckpt_text2w2v "logs/ttv_libritts_v1/ttv_lt960_ckpt.pth" \
--output_dir "tts_results_eng_kor_v2" \
--noise_scale_vc "0.333" \
--noise_scale_ttv "0.333" \
--denoise_ratio "0"

Explore different checkpoints for varied results, such as LibriTTS-460, LibriTTS-960, and Large_v1 epoch 60.

HierSpeech++ excels in voice conversion. Voice Conversion is vulnerable to noisy target prompt so we recommend to utilize a denoiser with noisy prompt. For noisy source speech, a wrong F0 may be extracted by YAPPT resulting in a quality degradation. Use the following command:

sh inference_vc.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py \
--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
--output_dir "vc_results_eng_kor_v2" \
--noise_scale_vc "0.333" \
--noise_scale_ttv "0.333" \
--denoise_ratio "0"

Experiment with checkpoints for voice conversion scenarios.
https://drive.google.com/drive/folders/1-L_90BlCkbPyKWWHTUjt5Fsu3kz0du0w?usp=sharing
You will find more specific checkpoints for HierSpeech2, TTV, SpeechSR-24k/ SpeechSR-48k models download links at the github repo.

TTV-v2 (Work in Progress)

TTV-v1 served as a simple yet high-quality TTS model. Acknowledging room for improvement, TTV-v2 is underway with modifications and enhancements:

Model size increased from 107M to 278M.
Intermediate hidden size adjusted from 256 to 384.
Loss masking introduced for wav2vec reconstruction loss.
Finetuning with the full LibriTTS-train dataset for long sentence generation.
Multi-lingual Dataset training with Eng, Indic, and Kor datasets.

Demo

https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS

Link to Download All samples from LibriTTS test-clean and test-other.
https://drive.google.com/drive/folders/1xCrZQy9s5MT38RMQxKAtkoWUgxT5qYYW?usp=sharing