Tortoise TTS (Text-to-Speech) Running locally on an Nvidia GPU (Geforce RTX 4070)
(This also uses deepspeed which requires Cuda toolkit in order to be built)
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
Edit the Dockerfile
Dockerfile
# Use a base image that includes the CUDA runtime
FROM nvidia/cuda:12.2.0-base-ubuntu22.04
# Set working directory
WORKDIR /app
# Install necessary packages, including the CUDA toolkit and build-essential for compiling software
RUN apt-get update && \
apt-get install -y --allow-unauthenticated --no-install-recommends \
wget \
git \
build-essential \
cuda-toolkit-12-2 \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/*
# Set up environment variables
ENV HOME "/root"
ENV CONDA_DIR "${HOME}/miniconda"
ENV PATH="$CONDA_DIR/bin":$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false
ENV PIP_DOWNLOAD_CACHE="$HOME/.pip/cache"
ENV TORTOISE_MODELS_DIR="$HOME/tortoise-tts/build/lib/tortoise/models"
# Install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda3.sh \
&& bash /tmp/miniconda3.sh -b -p "${CONDA_DIR}" -f -u \
&& "${CONDA_DIR}/bin/conda" init bash \
&& rm -f /tmp/miniconda3.sh \
&& echo ". '${CONDA_DIR}/etc/profile.d/conda.sh'" >> "${HOME}/.profile"
# --login option used to source bashrc (thus activating conda env) at every RUN statement
SHELL ["/bin/bash", "--login", "-c"]
# Create and activate the Conda environment, install dependencies, and DeepSpeed
RUN conda create --name tortoise python=3.9 numba inflect -y \
&& conda activate tortoise \
&& conda install --yes pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=12.1 -c pytorch -c nvidia \
&& conda install --yes transformers=4.31.0 \
&& pip install deepspeed
docker build . -t tort-tts-ctk-ds
mkdir results && mkdir .cache && mkdir .cache/huggingface
Create a tortoise/persistent_tts.py file.
This is just a fun script that keeps the model loaded in GPU VRAM and offers a prompt to paste the text into to have the model generate the audio. The script is very simple, so each time you paste text and hit enter, the output .wav files are overwritten. The outputs are found in the 'results' directory.
tortoise/persistent_tts.py
import argparse
import os
import torch
import torchaudio
from api import TextToSpeech, MODELS_DIR
from utils.audio import load_voices
def process_text(tts, text, voice, preset, candidates, cvvp_amount, output_path):
selected_voices = voice.split(',')
for k, selected_voice in enumerate(selected_voices):
if '&' in selected_voice:
voice_sel = selected_voice.split('&')
else:
voice_sel = [selected_voice]
voice_samples, conditioning_latents = load_voices(voice_sel)
gen, dbg_state = tts.tts_with_preset(text, k=candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
preset=preset, return_deterministic_state=True, cvvp_amount=cvvp_amount)
if isinstance(gen, list):
for j, g in enumerate(gen):
torchaudio.save(os.path.join(output_path, f'{selected_voice}_{k}_{j}.wav'), g.squeeze(0).cpu(), 24000)
else:
torchaudio.save(os.path.join(output_path, f'{selected_voice}_{k}.wav'), gen.squeeze(0).cpu(), 24000)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--voice', type=str, help='Selects the voice to use for generation. See options in voices/ directory (and add your own!) '
'Use the & character to join two voices together. Use a comma to perform inference on multiple voices.', default='random')
parser.add_argument('--preset', type=str, help='Which voice preset to use.', default='fast')
parser.add_argument('--use_deepspeed', type=str, help='Use deepspeed for speed bump.', default=False)
parser.add_argument('--kv_cache', type=bool, help='If you disable this please wait for a long a time to get the output', default=True)
parser.add_argument('--half', type=bool, help="float16(half) precision inference if True it's faster and take less vram and ram", default=True)
parser.add_argument('--output_path', type=str, help='Where to store outputs.', default='results/')
parser.add_argument('--model_dir', type=str, help='Where to find pretrained model checkpoints. Tortoise automatically downloads these to .models, so this'
'should only be specified if you have custom checkpoints.', default=MODELS_DIR)
parser.add_argument('--candidates', type=int, help='How many output candidates to produce per-voice.', default=3)
parser.add_argument('--cvvp_amount', type=float, help='How much the CVVP model should influence the output.'
'Increasing this can in some cases reduce the likelihood of multiple speakers. Defaults to 0 (disabled)', default=.0)
args = parser.parse_args()
# Ensure output directory exists
os.makedirs(args.output_path, exist_ok=True)
# Load the TTS model once
tts = TextToSpeech(models_dir=args.model_dir, use_deepspeed=args.use_deepspeed, kv_cache=args.kv_cache, half=args.half)
print("Model loaded. Enter text to synthesize (or type 'exit' to quit):")
while True:
# Wait for user input
input_text = input("Input text: ")
if input_text.lower() == 'exit':
break
# Process the input text
process_text(tts, input_text, args.voice, args.preset, args.candidates, args.cvvp_amount, args.output_path)
print("Output generated.")
print("Exiting...")
docker run --gpus all -e TORTOISE_MODELS_DIR=/models -v $(pwd)/tortoise/models:/models -v $(pwd)/results:/results -v $(pwd)/.cache/huggingface:/root/.cache/huggingface -v $(pwd):/app -it tort-tts-ctk-ds
Once in the running container bash prompt:
# conda activate tortoise # cd /app # python setup.py install
The first time you run the script, it will take a bit more time since deepspeed is built at this time. Subsequent runs will be faster.
python tortoise/persistent_tts.py --output_path /results --preset ultra_fast --voice geralt --use_deepspeed True
Now look in the 'results' directory and you should see three files:
geralt_0_0.wav
geralt_0_1.wav
geralt_0_2.wav
Each is a slightly different variation of the same voice (Geralt).