Kani TTS 2

Overview

KaniTTS2 employs a two-stage pipeline that combines a Large Language Model (LLM) with a Finite Scalar Quantization (FSQ) audio codec. This architecture enables high-quality speech synthesis suitable for real-time applications.

Region	Speaker Name
🇺🇸 Boston	Frank
🇺🇸 Oakland	Jermaine
🏴 Glasgow	Rory
🏴 Liverpool	Baddy
🇺🇸 New York	Chelsea
🇺🇸 San Francisco	Andrew

Region

Speaker Name

🇺🇸 Boston

Frank

🇺🇸 Oakland

Jermaine

🏴 Glasgow

Rory

🏴 Liverpool

Baddy

🇺🇸 New York

Chelsea

🇺🇸 San Francisco

Andrew

What's New in KaniTTS-2?

Speaker Embeddings

True voice control through learned speaker representations. Users can clone any voice with a reference audio sample, removing the need for fine-tuning per speaker.

Frame-Level Position Encoding

Audio tokens are organized in frames (default 4 tokens per frame), utilizing Kani's specific audio step encoding. This provides precise temporal control and alignment.

Learnable RoPE Theta

Implements per-layer frequency scaling for Rotary Position Embeddings. Each layer learns its own alpha parameter, improving the model's ability to handle dependencies across the network depth.

Extended Generation

Capable of generating up to 40 seconds of continuous high-quality audio.

Voice Cloning

KaniTTS-2 allows for voice cloning by extracting speaker characteristics from a reference audio file.

from kani_tts import KaniTTS, SpeakerEmbedder

# Initialize models
model = KaniTTS('repo/model')
embedder = SpeakerEmbedder()

# Extract speaker embedding from reference audio
speaker_embedding = embedder.embed_audio_file("reference_voice.wav")

# Generate speech with cloned voice
audio, text = model(
    "This is a cloned voice speaking!",
    speaker_emb=speaker_embedding
)
model.save_audio(audio, "cloned_voice.wav")

💡 Tips for Best Results

Use 10-20 seconds of clean reference audio.
Avoid background noise or music.
Reference audio can be any sample rate (system automatically resamples to 16kHz).

Technical Architecture

Generation Process

Input

Text + Optional Language Tag + Speaker Embedding

Tokenization

Text tokens processed with special control tokens

LLaMA-based Causal LM

Applies Learnable RoPE and Frame-level position encoding to generate audio token sequences

NeMo NanoCodec Decoder

Converts the 4-token frames into continuous waveforms

Output

22kHz High-Fidelity Audio

Performance Metrics

Benchmarks based on Nvidia RTX 5080 hardware.

~0.2

Real Time Factor (RTF)

3GB

VRAM Usage

~10k

Training Hours

Training Time (8x H100)

Use Cases & Limitations

Ideal for

Conversational AI and Chatbots requiring low latency.
Research into voice, accent, or emotion cloning.
Real-time interactive applications.

Limitations

Performance may decrease for inputs longer than 40 seconds.
Prosody or pronunciation may reflect biases in the training data.
Optimized primarily for English.

Interactive Demo