Installation Guide

Get started with Kani TTS in minutes. Follow our comprehensive installation guide to set up the text-to-speech model on your system.

Prerequisites

System Requirements

Python Environment

Python 3.8 or higher is required for running Kani TTS. We recommend using Python 3.10 or 3.11 for optimal compatibility.

python --version

GPU Requirements

For optimal performance, a GPU with at least 2GB VRAM is recommended. The model has been tested on NVIDIA GeForce RTX 5080 with 16GB GPU memory.

nvidia-smi

CUDA Support

CUDA 12.8 or compatible version is required for GPU acceleration. Ensure your system has the appropriate CUDA drivers installed.

nvcc --version

Supported Platforms

Windows

Windows 10/11 with Python 3.8+

macOS

macOS 10.15+ with Python 3.8+

Linux

Ubuntu 18.04+ or equivalent with Python 3.8+

Core Dependencies Installation

Step 1: Install Core Dependencies

Install the essential packages required for Kani TTS to function properly.

# Core dependencies
pip install torch librosa soundfile numpy huggingface_hub
pip install "nemo_toolkit[tts]"

Step 2: Install Custom Transformers

CRITICAL: Kani TTS requires a custom transformers build for the "lfm2" model type. This is essential for proper functionality.

# CRITICAL: Custom transformers build required for "lfm2" model type
pip install -U "git+https://github.com/huggingface/transformers.git"

Step 3: Optional Web Interface

For browser-based interface with real-time audio playback, install these additional packages.

# Optional: For web interface
pip install fastapi uvicorn

Quick Start Guide

Basic Usage

Generate Audio with Default Text

Run the basic example with built-in sample text to test your installation.

python basic/main.py

Generate Audio with Custom Text

Provide your own text input for speech generation.

python basic/main.py --prompt "Hello world! My name is Kani, I'm a speech generation model!"

What Happens Next

1

Model Loading

The TTS model loads into memory and initializes the processing pipeline.

2

Speech Generation

The system generates speech from the provided text using the neural network.

3

Audio Output

Audio is saved as generated_audio_YYYYMMDD_HHMMSS.wav in the current directory.

Web Interface Setup

Start the FastAPI Server

Launch the web interface for browser-based interaction with real-time audio playback.

# Start the FastAPI server
python fastapi_example/server.py

The server will run on http://localhost:8000

Web Interface Features

Interactive text input with example prompts
Parameter adjustment (temperature, max tokens)
Real-time audio generation and playback
Download functionality for generated audio
Server health monitoring
Open fastapi_example/client.html in your browser

Configuration Options

Default Configuration

Model Settings

Modelnineninesix/kani-tts-450m-0.1-pt
Sample Rate22,050 Hz
Max Tokens1200
Temperature1.4

Model Variants

Base Model (Default)

nineninesix/kani-tts-450m-0.1-pt

Generates random voices

Female Voice

nineninesix/kani-tts-450m-0.2-ft

Specialized for female voice characteristics

Male Voice

nineninesix/kani-tts-450m-0.1-ft

Specialized for male voice characteristics

Customizing Model Configuration

To use a different model, modify the ModelConfig class in config.py:

# Example: Switching to female voice model
class ModelConfig:
    model_name = "nineninesix/kani-tts-450m-0.2-ft"
    sample_rate = 22050
    max_tokens = 1200
    temperature = 1.4

Troubleshooting

Common Issues

CUDA Out of Memory

Reduce batch size or use CPU mode if GPU memory is insufficient.

Model Loading Errors

Ensure the custom transformers package is properly installed.

Audio Quality Issues

Check sample rate settings and ensure proper audio drivers are installed.

Performance Tips

GPU Optimization

Use GPU acceleration for faster processing and better performance.

Memory Management

Monitor VRAM usage and adjust parameters accordingly.

Batch Processing

Process multiple texts in batches for improved efficiency.

Ready to Get Started?

Now that you have Kani TTS installed, explore the demo and start generating high-quality speech from text.