kyutai: open-science AI lab

48.856649723289

2.3522238328

OPEN-SCIENCE

AI LAB

// WELCOME TO KYUTAI

OUR MISSION IS TO
BUILD AND DEMOCRATIZE
ARTIFICIAL GENERAL INTELLIGENCE
THROUGH OPEN SCIENCE

AI RESEARCH LAB BASED IN PARIS

// LATEST NEWS

Invincible Voice online demo released2026-02-24An AI dialogue assistant designed to help people living with ALS communicate more effectively.

Hibiki-Zero: Simultaneous Speech-to-Speech Translation Without Aligned Data2026-02-12Real-time speech translation from four languages.

Pocket TTS: a high-quality TTS with voice cloning that runs on CPU2026-01-13A tiny TTS with any voice you like.

Cascaded Voice AI

One of our core research areas is making modular building blocks for voice AI.

Pocket TTS is our 100M-parameter text-to-speech model that matches 10x larger state-of-the-art models in quality, and supports voice cloning. It runs in real-time on CPU.

Unmute allows any LLM to listen and speak. At its core are our low-latency streaming text-to-speech and speech-to-text models, optimized for real time usage. Every component of the Unmute pipeline is open-source and freely available to use in your own projects.

Our first spin-off, Gradium, is committed to further turn this open research into production-ready systems to support all voice applications and enable real-time voice interactions at scale.

Unmute

Text-to-Speech

Speech-to-Text

Speech-Native Models

Moshi is the first speech-native dialogue system, unveiled during our first keynote. Moshi processes speech directly rather than converting to text and back, which means it has minimal latency, and can understand emotions, and other non-verbal aspects of communication.

Moshi's multi-stream paradigm also enabled us to create Hibiki-Zero, an end-to-end model that translates speech in real time from four languages to English.

Finally, Moshi extends seamlessly to multimodal inputs: we showcase this with MoshiVis, a Moshi that you can talk to about images.

Moshi

MoshiVis

Hibiki-Zero

Efficient Vision Integration

In vision, we are investigating how to efficiently fuse visual inputs with text or speech for real-time interaction with long-horizon visual streams.

Starting with MoshiVis, we extend Moshi to handle image inputs while preserving its real-time latency and natural conversation abilities. MoshiVis relies on cross-attention to efficiently integrate visual signals.

Going further, we propose CASA, a simple yet powerful cross-attention framework, which allows us to adapt pretrained token-insertion VLMs to enjoy the practical benefits of cross-attention.

CASA

MoshiVis

Neural Audio Codecs

Encoding and decoding signals in a compressed yet accurate manner is a cornerstone of modern AI systems. Our streaming neural audio codec Mimi can efficiently model both semantic and acoustic information while achieving real-time latency. Originally developed for Moshi, Mimi is now a key component of all our audio projects.

If you want to dive deeper, check out our tutorial on neural audio codecs. It builds from the basics all the way to modern codecs like Mimi, with plenty of examples and animations along the way.

Mimi Codec

Neural Codec Tutorial

Compact Language Models

We are working on turning language models from monoliths into modular systems. Using the same model for everything is wasteful. What if you could select the knowledge, abilities and languages that you want your LLM to have, and get a specialized model 10x smaller than an equally smart generic LLM?

The first step is Helium 1, our modular and multilingual 2B-parameter LLM. In the spirit of open science, we are also releasing the dactorycodebase and tools required to reproduce our training dataset.

We further release ARC-Encoder, a method to compress large contexts for LLMs, and neutral residues, an improvement over LoRA for adapting LLMs to new domains.

Helium 1

ARC-Encoder

Neutral Residues

Cascaded Voice AI

Speech-Native Models

Efficient Vision Integration

Neural Audio Codecs

Compact Language Models

Learn more about what we do at Kyutai