Text-to-Speech Model with Frame-level Position Encodings, optimized for Realtime Conversations.
KaniTTS2 employs a two-stage pipeline that combines a Large Language Model (LLM) with a Finite Scalar Quantization (FSQ) audio codec. This architecture enables high-quality speech synthesis suitable for real-time applications.
| Region | Speaker Name |
|---|---|
| 🇺🇸 Boston | Frank |
| 🇺🇸 Oakland | Jermaine |
| 🏴 Glasgow | Rory |
| 🏴 Liverpool | Baddy |
| 🇺🇸 New York | Chelsea |
| 🇺🇸 San Francisco | Andrew |
pip install kani-tts-2
pip install -U "transformers==4.56.0"from kani_tts import KaniTTS
# Initialize model
model = KaniTTS('repo/model')
# Generate speech
audio, text = model("Hello, world!")
# Save to file
model.save_audio(audio, "output.wav")True voice control through learned speaker representations. Users can clone any voice with a reference audio sample, removing the need for fine-tuning per speaker.
Audio tokens are organized in frames (default 4 tokens per frame), utilizing Kani's specific audio step encoding. This provides precise temporal control and alignment.
Implements per-layer frequency scaling for Rotary Position Embeddings. Each layer learns its own alpha parameter, improving the model's ability to handle dependencies across the network depth.
Capable of generating up to 40 seconds of continuous high-quality audio.
KaniTTS-2 allows for voice cloning by extracting speaker characteristics from a reference audio file.
from kani_tts import KaniTTS, SpeakerEmbedder
# Initialize models
model = KaniTTS('repo/model')
embedder = SpeakerEmbedder()
# Extract speaker embedding from reference audio
speaker_embedding = embedder.embed_audio_file("reference_voice.wav")
# Generate speech with cloned voice
audio, text = model(
"This is a cloned voice speaking!",
speaker_emb=speaker_embedding
)
model.save_audio(audio, "cloned_voice.wav")Text + Optional Language Tag + Speaker Embedding
Text tokens processed with special control tokens
Applies Learnable RoPE and Frame-level position encoding to generate audio token sequences
Converts the 4-token frames into continuous waveforms
22kHz High-Fidelity Audio
Benchmarks based on Nvidia RTX 5080 hardware.