AI Personal Learning
and practical guidance
TRAE

Muyan-TTS: Personalized Podcast Speech Training and Synthesis

General Introduction

Muyan-TTS is an open source text-to-speech (TTS) model designed for podcasting scenarios. It is pre-trained with over 100,000 hours of podcast audio data and supports zero-sample speech synthesis to generate high-quality natural speech. The model is built based on Llama-3.2-3B and combines the SoVITS Muyan-TTS also supports personalized speech customization from tens of minutes of single-person speech data, tailored to the needs of specific timbres. The project is released under the Apache 2.0 license, providing the complete training code, data processing flow, and model weights, and is hosted on GitHub, Hugging Face, and ModelScope, encouraging secondary development and community contributions.

Muyan-TTS: Open Source Tool for High Quality Podcast Speech Synthesis-1


 

Function List

  • Zero-sample speech synthesis: Generate high-quality podcast-style speech without additional training, adapting to a wide range of tonal inputs.
  • Personalized voice customization: Generate a specific speaker's voice by fine-tuning a few minutes of one-person voice data.
  • Efficient Reasoning Speed: Approximately 0.33 seconds of audio generation per second on a single NVIDIA A100 GPU, outperforming multiple open-source TTS models.
  • Open source training code: Provides a complete training process from base model to fine-tuned model with support for developer customization.
  • Data processing pipeline: Integration with Whisper, FunASR and NISQA to clean and transcribe podcast audio data.
  • API Deployment Support: Provides API tools for easy integration into podcasts or other voice applications.
  • Model weights open: Muyan-TTS and Muyan-TTS-SFT model weights can be downloaded from Hugging Face and ModelScope.

 

Using Help

Installation process

The installation of Muyan-TTS should be done under Linux environment, Ubuntu is recommended. The following are the detailed steps:

  1. clone warehouse
    Open a terminal and run the following command to clone the Muyan-TTS repository:

    git clone https://github.com/MYZY-AI/Muyan-TTS.git
    cd Muyan-TTS
  1. Creating a Virtual Environment
    Create a virtual environment for Python 3.10 using Conda:

    conda create -n muyan-tts python=3.10 -y
    conda activate muyan-tts
    
  2. Installation of dependencies
    Run the following command to install the project dependencies:

    make build
    

    Note: FFmpeg needs to be installed. works on Ubuntu:

    sudo apt update
    sudo apt install ffmpeg
    
  3. Download pre-trained model
    Download the model weights from the link below:

    pretrained_models
    ├── chinese-hubert-base
    ├── Muyan-TTS
    └── Muyan-TTS-SFT
    
  4. Verify Installation
    After ensuring that all dependencies and model files are properly installed, inference or training can be performed.

Using the base model (zero-sample speech synthesis)

The base model of Muyan-TTS supports zero-sample speech synthesis, which is suitable for quickly generating podcast-style speech. The steps are as follows:

  1. Prepare input text and reference audio
    Prepare a text (text) and reference speech (ref_wav_path). Example:

    ref_wav_path="assets/Claire.wav"
    prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige."
    text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
    
  2. Run the reasoning command
    Use the following command to generate speech, specifying model_type because of base::

    python tts.py
    

    Or just run the core inference code:

    async def main(model_type, model_path):
    tts = Inference(model_type, model_path, enable_vllm_acc=False)
    wavs = await tts.generate(
    ref_wav_path="assets/Claire.wav",
    prompt_text="Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
    text="Welcome to the captivating world of podcasts, let's embark on this exciting journey together."
    )
    output_path = "logs/tts.wav"
    with open(output_path, "wb") as f:
    f.write(next(wavs))
    print(f"Speech generated in {output_path}")
    

    The generated speech is saved in the logs/tts.wavThe

  3. Replacement of reference audio
    In the zero-sample mode, theref_wav_path Can be replaced with any speaker's voice, and the model will mimic its timbre to generate a new voice.

Using the SFT model (personalized voice customization)

The SFT model is fine-tuned with single-person speech data suitable for generating specific timbres. The procedure is as follows:

  1. Preparing training data
    Collect at least a few minutes of single person speech data and save it in WAV format. Use LibriSpeech's dev-clean dataset as an example, available for download:

    wget --no-check-certificate https://www.openslr.org/resources/12/dev-clean.tar.gz
    

    After unzipping the prepare_sft_dataset.py specified in librispeech_dir is the decompression path.

  2. Generate training data
    Run the following command to process the data and generate data/tts_sft_data.json::

    ./train.sh
    

    The data format is as follows:

    {
    "file_name": "path/to/audio.wav",
    "text": "对应的文本内容"
    }
    
  3. Adjustment of training configuration
    compiler training/sft.yaml, set parameters such as learning rate, batch size, etc.
  4. Start training
    train.sh The following command is automatically executed to start training:

    llamafactory-cli train training/sft.yaml
    

    After training is complete, the model weights are saved in pretrained_models/Muyan-TTS-new-SFTThe

  5. Copy SoVITS weights
    Combine the base model's sovits.pth Copy to the new model directory:

    cp pretrained_models/Muyan-TTS/sovits.pth pretrained_models/Muyan-TTS-new-SFT
    
  6. running inference
    To generate speech using the SFT model, you need to keep ref_wav_path Consistent with the speaker used in training:

    python tts.py --model_type sft
    

Deployment via API

Muyan-TTS supports API deployment for easy integration into applications. The steps are as follows:

  1. Starting the API Service
    Run the following command to start the service. the default port is 8020:

    python api.py --model_type base
    

    Service logs are kept in the logs/llm.logThe

  2. Send Request
    Use the following Python code to send the request:

    import requests
    payload = {
    "ref_wav_path": "assets/Claire.wav",
    "prompt_text": "Although the campaign was not a complete success, it did provide Napoleon with valuable experience and prestige.",
    "text": "Welcome to the captivating world of podcasts, let's embark on this exciting journey together.",
    "temperature": 0.6,
    "speed": 1.0,
    }
    url = "http://localhost:8020/get_tts"
    response = requests.post(url, json=payload)
    with open("logs/tts.wav", "wb") as f:
    f.write(response.content)
    

caveat

  • hardware requirement: NVIDIA A100 (40GB) or equivalent GPU is recommended for inference.
  • language restriction: Since the training data is predominantly in English, only English input is currently supported.
  • Data quality: The speech quality of the SFT model relies on the clarity and consistency of the training data.
  • Training costs: The total cost of pre-training was approximately $50,000, including data processing ($30,000), LLM pre-training ($19,200), and decoder training ($0.134 million).

 

application scenario

  1. Podcast Content Creation
    Muyan-TTS quickly converts podcast scripts into natural speech for independent creators to generate high-quality audio. Users can generate podcast-style speech by simply entering text and a reference voice, reducing recording costs.
  2. Audiobook production
    With the SFT model, creators can customize the voice of specific characters to generate audiobook chapters. The model supports long audio generation, which is suitable for long content.
  3. Voice assistant development
    Developers can integrate Muyan-TTS into voice assistants via APIs to provide a natural and personalized voice interaction experience.
  4. Educational content generation
    Schools or training organizations can convert teaching materials to speech to generate audio for listening exercises or course explanations, suitable for language learning or visually impaired people.

 

QA

  1. What languages does Muyan-TTS support?
    Currently only English is supported, as the training data is dominated by English podcast audio. Other languages can be supported in the future by expanding the dataset.
  2. How to improve the speech quality of SFT models?
    Use high-quality, clear single-person speech data to avoid background noise. Ensure that the training data is consistent with the speech style of the target scene.
  3. What about slow reasoning?
    Ensure the use of support vLLM Accelerated GPU environments. By nvidia-smi Check the video memory usage to make sure the model is properly loaded to the GPU.
  4. Does it support commercial use?
    Muyan-TTS is distributed under the Apache 2.0 license and is supported for commercial use, subject to the terms of the license.
  5. How do I get technical support?
    Issues can be submitted via GitHub (Issues) or join the Discord community (Discord) Get support.
May not be reproduced without permission:Chief AI Sharing Circle " Muyan-TTS: Personalized Podcast Speech Training and Synthesis
en_USEnglish