Kani TTS 2

Text-to-Speech Model with Frame-level Position Encodings, optimized for Realtime Conversations.

Apache 2.0 LicenseCreated by nineninesix.aiGitHub

Interactive Demo

Running on HuggingFace Spaces

Overview

KaniTTS2 employs a two-stage pipeline that combines a Large Language Model (LLM) with a Finite Scalar Quantization (FSQ) audio codec. This architecture enables high-quality speech synthesis suitable for real-time applications.

Core Architecture

  • Two-stage pipeline design
  • LLM-based token generation
  • FSQ audio codec integration

Key Specifications

  • Model Size400M parameters
  • Sample Rate22kHz
  • LanguagesEnglish

Audio Examples & Speakers

RegionSpeaker Name
🇺🇸 BostonFrank
🇺🇸 OaklandJermaine
🏴 GlasgowRory
🏴 LiverpoolBaddy
🇺🇸 New YorkChelsea
🇺🇸 San FranciscoAndrew

Installation & Usage

Install via pip

pip install kani-tts-2
pip install -U "transformers==4.56.0"

Quick Generation

from kani_tts import KaniTTS

# Initialize model
model = KaniTTS('repo/model')

# Generate speech
audio, text = model("Hello, world!")

# Save to file
model.save_audio(audio, "output.wav")

What's New in KaniTTS-2?

Speaker Embeddings

True voice control through learned speaker representations. Users can clone any voice with a reference audio sample, removing the need for fine-tuning per speaker.

Frame-Level Position Encoding

Audio tokens are organized in frames (default 4 tokens per frame), utilizing Kani's specific audio step encoding. This provides precise temporal control and alignment.

Learnable RoPE Theta

Implements per-layer frequency scaling for Rotary Position Embeddings. Each layer learns its own alpha parameter, improving the model's ability to handle dependencies across the network depth.

Extended Generation

Capable of generating up to 40 seconds of continuous high-quality audio.

Voice Cloning

KaniTTS-2 allows for voice cloning by extracting speaker characteristics from a reference audio file.

from kani_tts import KaniTTS, SpeakerEmbedder

# Initialize models
model = KaniTTS('repo/model')
embedder = SpeakerEmbedder()

# Extract speaker embedding from reference audio
speaker_embedding = embedder.embed_audio_file("reference_voice.wav")

# Generate speech with cloned voice
audio, text = model(
    "This is a cloned voice speaking!",
    speaker_emb=speaker_embedding
)
model.save_audio(audio, "cloned_voice.wav")

💡 Tips for Best Results

  • Use 10-20 seconds of clean reference audio.
  • Avoid background noise or music.
  • Reference audio can be any sample rate (system automatically resamples to 16kHz).

Technical Architecture

Generation Process

Input

Text + Optional Language Tag + Speaker Embedding

Tokenization

Text tokens processed with special control tokens

LLaMA-based Causal LM

Applies Learnable RoPE and Frame-level position encoding to generate audio token sequences

NeMo NanoCodec Decoder

Converts the 4-token frames into continuous waveforms

Output

22kHz High-Fidelity Audio

Performance Metrics

Benchmarks based on Nvidia RTX 5080 hardware.

~0.2
Real Time Factor (RTF)
3GB
VRAM Usage
~10k
Training Hours
6h
Training Time (8x H100)

Use Cases & Limitations

Ideal for

  • Conversational AI and Chatbots requiring low latency.
  • Research into voice, accent, or emotion cloning.
  • Real-time interactive applications.

Limitations

  • Performance may decrease for inputs longer than 40 seconds.
  • Prosody or pronunciation may reflect biases in the training data.
  • Optimized primarily for English.