Convert Text to Audio on Linux – Coqui TTS

In a separate post on my blog, I documented a simple way to convert text to audio using Google’s Text-To-Speech service. This works for small projects. But, during a recent conversion, I ran into an error where my requests were getting blocked by Google for using the service too much. I didn’t realize they capped the usage on that. I don’t blame them for it. But it did mean I needed an alternative option for longer conversions.

I did a little digging and found another option that is completely local (meaning, on your own machine): TTS by Coqui. Basically, this is a Text-To-Speech engine powered by AI that runs locally. The setup is more complicated than my previous version, but the results are quite impressive. Here is how you get this working on Linux (I’m using Kubuntu 24.04).

First, you need to install some packages:

sudo apt update
sudo apt install python3 python3-pip python3-venv

I was familiar with all of these except the last one, which is a virtual environment package for Python. That is important!

Once those are installed, you then create a virtual environment to work in with the following commands:

python3 -m venv tts-env
source coqui-env/bin/activate

This creates a virtual environment called “tts-env” and then sets the source for that environment. Inside that environment, you then install Coqui TTS:

pip install TTS

FYI, I got an error when trying to install TTS on Kubuntu 24.04 because I was using Python 3.12 and that was too new. You may have to install an older version of Python to work with TTS, like this:

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.10 python3.10-venv python3.10-dev
python3.10 -m venv coqui-env
source coqui-env/bin/activate
pip install --upgrade pip
pip install TTS

Assuming TTS installs correctly, you can then test it:

tts --text "This is a test." --out_path test.wav

If that works, you then have a number of options for moving forward. First, you probably want to find a voice you like. The following code lets you test different voices. The voices all have a designation. You can hear samples of the voices here.

tts --model_name "tts_models/en/vctk/vits" \
     --text "This is a test" \
     --speaker_idx p230 \
     --out_path voice230.wav

FYI, you can get a list of the voice designations with this code:

tts --model_name "tts_models/en/vctk/vits" --list_speaker_idxs

Some of my favorites after listening to all of the voice options (as of 2025/06/13) in the vits model were:

p230 – US male (ish)
p244 – British female
p254 – US male (I think)
p263 – British female
p270 – US female
p273 – British female
p303 – British female (though need to speed up)
p304 – US female
p314 – British female
p316 – US female (I think)
p318 – British male*
p335 – US female (I think)*
p339 – British female
p340 – US male*
p360 – US female
p364 – British female*

[NOTE: There are ten voices that are gender neutral (ish). Of those, these were my favorites: p308 (US), p276 (US), p351 (British), p363 (US), and p374 (US).]

Once you find the voice that you want to use, you can then use the following script to convert multiple text files to audio. (NOTE: In the python script below, you can’t use the “p###” to indicate the voice you want. You have to link it to the number itself. So, for instance, p340 above is “96”. You have to enter 96 instead of p340.)

First, activate your virtual environment:

source coqui-env/bin/activate

Make sure you have the following installed before you do this:

pip install TTS pydub
sudo apt install ffmpeg

Then copy the script below to a file. You can name it whatever you’d like, but make sure the file extension is .py. I called mine “convert_chapters_to_mp3.py”:

import os
from TTS.api import TTS
from pydub import AudioSegment

# --- USER CONFIGURATION ---
model_name = "tts_models/en/vctk/vits"  # Change if you prefer another model
speaker_idx = p230                         # Change to desired voice ID (integer index or string name)
length_scale = 1.0                      # <1.0 = faster, >1.0 = slower

input_folder = "cleaned_chapters" # Change to whatever you named your input folder containing the TXT files
output_folder = "audiobook_chapters" # Change to whatever you want to call your output folder
os.makedirs(output_folder, exist_ok=True)

# Initialize TTS
tts = TTS(model_name)

# Show available speakers if it's a multi-speaker model
if tts.is_multi_speaker:
    print("Available speakers:")
    for i, spk in enumerate(tts.speakers):
        print(f"{i}: {spk}")
    speaker = tts.speakers[speaker_idx]
else:
    speaker = None

# Loop through all .txt files
for filename in sorted(os.listdir(input_folder)):
    if filename.endswith(".txt"):
        base_name = os.path.splitext(filename)[0]
        txt_path = os.path.join(input_folder, filename)
        wav_path = os.path.join(output_folder, base_name + ".wav")
        mp3_path = os.path.join(output_folder, base_name + ".mp3")

        print(f"Converting {filename}...")

        with open(txt_path, "r", encoding="utf-8") as f:
            text = f.read().strip()

        if not text:
            print(f"⚠️ Skipping empty file: {filename}")
            continue

        tts.tts_to_file(
            text=text,
            speaker=speaker,
            file_path=wav_path,
            length_scale=length_scale
        )

        # Convert WAV to MP3
        sound = AudioSegment.from_wav(wav_path)
        sound.export(mp3_path, format="mp3", bitrate="64k")
        os.remove(wav_path)

print("✅ All files converted to MP3.")

Make sure the script is stored in the same folder or directory as your input files, which need to be stored in a folder called “cleaned_chapters” in the script file above. If you want to change that, you can, but update the script file accordingly. Then, once you have everything ready, run the script from the command line like this:

python3 convert_chapters_to_mp3.py

The script should loop through all of the TXT files in your input folder, process them, and convert them to audio (mp3).

Just a warning – this takes a long, long time. A short TXT file of a few words will go fairly quickly. But a TXT file with, say, 8 to 10 thousand words may take an hour or more, depending on how powerful your computer is. If you have a lot of text files to convert, be prepared to let this script run for a long time (overnight, over a few days, etc.). This is a processing-intensive script.

Convert Text to Audio on Linux – Coqui TTS

Comments

Leave a Reply Cancel reply