Tortoise TTS (Text-to-Speech) Running locally on an Nvidia GPU (Geforce RTX 4070)
(This also uses deepspeed which requires Cuda toolkit in order to be built)

git clone
cd tortoise-tts

Edit the Dockerfile


# Use a base image that includes the CUDA runtime

FROM nvidia/cuda:12.2.0-base-ubuntu22.04

# Set working directory


# Install necessary packages, including the CUDA toolkit and build-essential for compiling software

RUN apt-get update && \

apt-get install -y --allow-unauthenticated --no-install-recommends \

wget \

git \

build-essential \

cuda-toolkit-12-2 \

&& apt-get autoremove -y \

&& apt-get clean -y \

&& rm -rf /var/lib/apt/lists/*

# Set up environment variables

ENV HOME "/root"

ENV CONDA_DIR "${HOME}/miniconda"




ENV TORTOISE_MODELS_DIR="$HOME/tortoise-tts/build/lib/tortoise/models"

# Install Miniconda

RUN wget -O /tmp/ \

&& bash /tmp/ -b -p "${CONDA_DIR}" -f -u \

&& "${CONDA_DIR}/bin/conda" init bash \

&& rm -f /tmp/ \

&& echo ". '${CONDA_DIR}/etc/profile.d/'" >> "${HOME}/.profile"

# --login option used to source bashrc (thus activating conda env) at every RUN statement

SHELL ["/bin/bash", "--login", "-c"]

# Create and activate the Conda environment, install dependencies, and DeepSpeed

RUN conda create --name tortoise python=3.9 numba inflect -y \

&& conda activate tortoise \

&& conda install --yes pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia \

&& conda install --yes transformers=4.31.0 \

&& pip install deepspeed

docker build . -t tort-tts-ctk-ds
mkdir results && mkdir .cache && mkdir .cache/huggingface

Create a tortoise/ file.  

This is just a fun script that keeps the model loaded in GPU VRAM and offers a prompt to paste the text into to have the model generate the audio. The script is very simple, so each time you paste text and hit enter, the output .wav files are overwritten.  The outputs are found in the 'results' directory.


import argparse

import os

import torch

import torchaudio

from api import TextToSpeech, MODELS_DIR

from import load_voices

def process_text(tts, text, voice, preset, candidates, cvvp_amount, output_path):

selected_voices = voice.split(',')

for k, selected_voice in enumerate(selected_voices):

 if '&' in selected_voice:

voice_sel = selected_voice.split('&')


voice_sel = [selected_voice]

 voice_samples, conditioning_latents = load_voices(voice_sel)

 gen, dbg_state = tts.tts_with_preset(text, k=candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents,

preset=preset, return_deterministic_state=True, cvvp_amount=cvvp_amount)

 if isinstance(gen, list):

for j, g in enumerate(gen):, f'{selected_voice}_{k}_{j}.wav'), g.squeeze(0).cpu(), 24000)

 else:, f'{selected_voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000)

if __name__ == '__main__':

parser = argparse.ArgumentParser()

parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '

'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random')

parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast')

parser.add_argument('--use_deepspeed', type=str, help='Use deepspeed for speed bump.', default=False)

parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True)

parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True)

parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/')

parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'

'should only be specified if you have custom checkpoints.', default=MODELS_DIR)

parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice.', default=3)

parser.add_argument('--cvvp_amount', type=float, help='How much the CVVP model should influence the output.'

'Increasing this can in some cases reduce the likelihood of multiple speakers. Defaults to 0 (disabled)', default=.0)

args = parser.parse_args()

# Ensure output directory exists

os.makedirs(args.output_path, exist_ok=True)

# Load the TTS model once

tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half)

print("Model loaded. Enter text to synthesize (or type 'exit' to quit):")

while True:

 # Wait for user input

 input_text = input("Input text: ")

 if input_text.lower() == 'exit':


 # Process the input text

 process_text(tts, input_text, args.voice, args.preset, args.candidates, args.cvvp_amount, args.output_path)

 print("Output generated.")


docker run --gpus all -e TORTOISE_MODELS_DIR=/models -v $(pwd)/tortoise/models:/models -v $(pwd)/results:/results -v $(pwd)/.cache/huggingface:/root/.cache/huggingface -v $(pwd):/app -it tort-tts-ctk-ds

Once in the running container bash prompt:

# conda activate tortoise
# cd /app
# python install

The first time you run the script, it will take a bit more time since deepspeed is built at this time.  Subsequent runs will be faster.

python tortoise/ --output_path /results --preset ultra_fast --voice geralt --use_deepspeed True

Now look in the 'results' directory and you should see three files:

Each is a slightly different variation of the same voice (Geralt).