How to Clone Anyone’s Voice for FREE 🗣️ — 1000+ Pretrained Models | Courses | Crax

Welcome To Crax.Pro Forum!

Check our new Marketplace at Crax.Shop

   Login! SignUp Now!
  • We are in solidarity with our brothers and sisters in Palestine. Free Palestine. To learn more visit this Page

  • Crax.Pro domain has been taken down!

    Alternatives: Craxpro.io | Craxpro.com

How to Clone Anyone’s Voice for FREE 🗣️ — 1000+ Pretrained Models

How to Clone Anyone’s Voice for FREE 🗣️ — 1000+ Pretrained Models

LV
1
 

highzum

Member
Joined
Jan 11, 2024
Threads
13
Likes
14
Awards
4
Credits
1,577©
Cash
0$
Today, I’ll teach you how to clone any voice using speech synthesis. Speech synthesis has reached new heights with the introduction of HierSpeech++, a revolutionary tool developed by the Department of Artificial Intelligence at Korea University. In this blog, we will explore the features, functionalities, and the intricate details that make HierSpeech++ a game-changer in the real world of zero-shot speech synthesis. There is only one downside so far — it only works with English, but the developers promise to soon release a multi-language model. Currently, it addresses the limitations of traditional large language models, offering speed, robustness, and unparalleled naturalness in synthetic speech.

Key Features:​

  1. Fast and Strong (efficiency without compromising quality)
  2. Naturalness Amplified (naturalness and speaker similarity)
  3. Human-level Quality (human-level quality in zero-shot speech synthesis)
Github Repo: https://github.com/sh-lee-prml/HierSpeechpp
Demo: https://huggingface.co/spaces/HierSpeech/HierSpeech_TTS
This repository contains:
🪐 A PyTorch implementation of HierSpeech++ (TTV, Hierarchical Speech Synthesizer, SpeechSR)
⚡️ Pre-trained HierSpeech++ models trained on LibriTTS (Train-460, Train-960, and more dataset)
Gradio Demo on HuggingFace. HuggingFace provides us with a community GPU grant. Thanks 😊

Implementation Details​

Model Architecture​

HierSpeech++ utilizes a hierarchical speech synthesis framework, significantly improving the robustness and expressiveness of synthetic speech. The text-to-vec framework, coupled with a high-efficient speech super-resolution framework, contributes to the tool’s exceptional performance.

1aU5pRvpdDyIrgKvSQXVB3g

https://sh-lee-prml.github.io/HierSpeechpp-demo/

Pre-trained Models​

The repository provides pre-trained HierSpeech++ models trained on LibriTTS, catering to different datasets and scenarios.

1ZAh50YAzESYqLlOifkRuPQ

https://sh-lee-prml.github.io/HierSpeechpp-demo/

Boosting Prompting Mechanisms​

Mega-TTS, a groundbreaking approach in zero-shot text-to-speech (TTS), addresses key challenges faced by existing methods. Unlike traditional zero-shot TTS, which often struggles with single-sentence prompts, Mega-TTS leverages a powerful acoustic autoencoder to separate prosody and timbre information, offering flexibility for multi-sentence prompts. The introduction of a multi-reference timbre encoder and a prosody latent language model enhances the extraction of valuable information. Results from experiments showcase Mega-TTS’s ability to generate identity-preserving speech with short prompts, outperforming fine-tuning methods across data volumes from 10 seconds to 5 minutes. Moreover, this method enables precise and controlled transfer of diverse speaking styles to the desired timbre. Check out the demo page for audio samples illustrating the impressive capabilities of Mega-TTS.

0Z6LtjHr6Bad60IJo

Model Overview : https://boostprompt.github.io/boostprompt/

Installation​

Ensure you have PyTorch (>=1.13) and torchaudio (>=0.13) installed. Additional requirements can be installed using the provided requirements.txt file.

pip install -r requirements.txt
pip install phonemizer
sudo apt-get install espeak-ng
For better robustness, they recommend a noise_scale of 0.333. For better expressiveness, we recommend a noise_scale of 0.667. Find your best parameters for your style prompt. Run the following command for text-to-speech synthesis:

sh inference.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference.py \
--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
--ckpt_text2w2v "logs/ttv_libritts_v1/ttv_lt960_ckpt.pth" \
--output_dir "tts_results_eng_kor_v2" \
--noise_scale_vc "0.333" \
--noise_scale_ttv "0.333" \
--denoise_ratio "0"
Explore different checkpoints for varied results, such as LibriTTS-460, LibriTTS-960, and Large_v1 epoch 60.
HierSpeech++ excels in voice conversion. Voice Conversion is vulnerable to noisy target prompt so we recommend to utilize a denoiser with noisy prompt. For noisy source speech, a wrong F0 may be extracted by YAPPT resulting in a quality degradation. Use the following command:

sh inference_vc.sh

# --ckpt "logs/hierspeechpp_libritts460/hierspeechpp_lt460_ckpt.pth" \ LibriTTS-460
# --ckpt "logs/hierspeechpp_libritts960/hierspeechpp_lt960_ckpt.pth" \ LibriTTS-960
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1_ckpt.pth" \ Large_v1 epoch 60 (paper version)
# --ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \ Large_v1.1 epoch 200 (20. Nov. 2023)

CUDA_VISIBLE_DEVICES=0 python3 inference_vc.py \
--ckpt "logs/hierspeechpp_eng_kor/hierspeechpp_v1.1_ckpt.pth" \
--output_dir "vc_results_eng_kor_v2" \
--noise_scale_vc "0.333" \
--noise_scale_ttv "0.333" \
--denoise_ratio "0"
Experiment with checkpoints for voice conversion scenarios.
https://drive.google.com/drive/folders/1-L_90BlCkbPyKWWHTUjt5Fsu3kz0du0w?usp=sharing
You will find more specific checkpoints for HierSpeech2, TTV, SpeechSR-24k/ SpeechSR-48k models download links at the github repo.

TTV-v2 (Work in Progress)​

TTV-v1 served as a simple yet high-quality TTS model. Acknowledging room for improvement, TTV-v2 is underway with modifications and enhancements:

  1. Model size increased from 107M to 278M.
  2. Intermediate hidden size adjusted from 256 to 384.
  3. Loss masking introduced for wav2vec reconstruction loss.
  4. Finetuning with the full LibriTTS-train dataset for long sentence generation.
  5. Multi-lingual Dataset training with Eng, Indic, and Kor datasets.

Demo​

https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS

1imh00GcOWlg7NsLMIh3aqQ

Link to Download All samples from LibriTTS test-clean and test-other.
https://drive.google.com/drive/folders/1xCrZQy9s5MT38RMQxKAtkoWUgxT5qYYW?usp=sharing
 
  • Like
Reactions: fognayerku

Create an account or login to comment

You must be a member in order to leave a comment

Create account

Create an account on our community. It's easy!

Log in

Already have an account? Log in here.

Tips
Tips

Similar threads

Top Bottom