RealtimeVoiceChat: Low Latency Natural Spoken Conversation with AI

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

RealtimeVoiceChat is an open source project that focuses on real-time, natural conversations with artificial intelligence via voice. Users use the microphone to input voice , the system captures the audio through the browser , quickly converted to text , a large language model (LLM) to generate a reply , and then convert the text to speech output , the whole process is close to real-time . The project uses a client-server architecture that emphasizes low latency and supports WebSocket streaming and dynamic conversation management. It provides Docker deployment, is recommended to run on Linux system and NVIDIA GPU environment, and integrates RealtimeSTT, RealtimeTTS and Ollama and other technologies that are suitable for developers to build voice interaction applications.

RealtimeVoiceChat: low latency natural spoken conversation with AI-1

Function List

real time voice interaction: Users input their voice through the browser microphone, and the system transcribes and generates a voice response in real time.
Low latency processing: Optimize speech-to-text and text-to-speech latency to 0.5-1 seconds by streaming audio using WebSocket.
Speech to text (STT): Quickly convert speech to text using RealtimeSTT (based on Whisper), with dynamic transcription support.
Text-to-speech (TTS): Generate natural speech via RealtimeTTS (supports Coqui, Kokoro, Orpheus) with a choice of voice styles.
Intelligent Dialog Management: Integrate language models from Ollama or OpenAI to support flexible dialog generation and interrupt handling.
Dynamic Speech Detection: By turndetect.py Enables intelligent silence detection and adapts to the rhythm of the conversation.
web interface: Provides a clean browser interface, uses Vanilla JS and Web Audio API, and supports real-time feedback.
Docker Deployment: Simplified installation with Docker Compose, support for GPU acceleration and model management.
Model Customization: Support for switching between STT, TTS and LLM models to adjust speech and dialog parameters.
open source and scalable: The code is publicly available and developers are free to modify or extend the functionality.

Using Help

Installation process

RealtimeVoiceChat supports both Docker deployment (recommended) and manual installation; Docker for Linux systems, especially with NVIDIA GPUs, and manual installation for Windows or scenarios where more control is needed. Below are the detailed steps:

Docker Deployment (recommended)

Requires Docker Engine, Docker Compose v2+, and NVIDIA Container Toolkit (GPU users). Linux systems are recommended for optimal GPU support.

clone warehouse::

git clone https://github.com/KoljaB/RealtimeVoiceChat.git
cd RealtimeVoiceChat

Building a Docker Image::
```
docker compose build
```
This step downloads the base image, installs Python and machine learning dependencies, and pre-downloads the default STT models (Whisper base.en). It takes a long time, so please make sure your network is stable.
Starting services::
```
docker compose up -d
```
Start the application and the Ollama service with the container running in the background. Wait about 1-2 minutes for initialization to complete.
Pull Ollama model::
```
docker compose exec ollama ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M
```
This command pulls the default language model. Users can find the default language model in the code/server.py modifications LLM_START_MODEL to use other models.
Validating Model Usability::
```
docker compose exec ollama ollama list
```
Make sure the model is loaded correctly.

Stop or restart the service::

docker compose down  # 停止服务
docker compose up -d  # 重启服务

View Log::

docker compose logs -f app  # 查看应用日志
docker compose logs -f ollama  # 查看 Ollama 日志

Manual installation (Windows/Linux/macOS)

Manual installation requires Python 3.9+, CUDA 12.1 (GPU users) and FFmpeg. Windows users can use the provided install.bat Scripts simplify the process.

Installation of basic dependencies::
- Make sure Python 3.9+ is installed.
- GPU users install NVIDIA CUDA Toolkit 12.1 and cuDNN.
- Install FFmpeg:
```
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows (使用 Chocolatey)
choco install ffmpeg
```

Clone the repository and create a virtual environment::

git clone https://github.com/KoljaB/RealtimeVoiceChat.git
cd RealtimeVoiceChat
python -m venv venv
# Linux/macOS
source venv/bin/activate
# Windows
.\venv\Scripts\activate

Installing PyTorch (matching hardware)::

GPU (CUDA 12.1):

pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121

CPU (slower performance):

pip install torch torchaudio torchvision

Installing additional dependencies::
```
cd code
pip install -r requirements.txt
```
Note: DeepSpeed can be complicated to install, Windows users can install it via the install.bat Automatic processing.
Install Ollama (non-Docker users)::
- Refer to the official Ollama documentation for installation.
- Pull the model:
```
ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M
```
Running the application::
```
python server.py
```

caveat

hardware requirement: NVIDIA GPUs (at least 8GB of RAM) are recommended to ensure low latency. running on a CPU can cause significant performance degradation.
Docker Configuration: Modification code/*.py maybe docker-compose.yml Before you do so, you need to re-run the docker compose buildThe
License Compliance: TTS engines (e.g. Coqui XTTSv2) and LLM models have separate licenses and are subject to their terms.

workflow

Accessing the Web Interface::
- Open your browser and visit http://localhost:8000(or remote server IP).
- Grant microphone privileges and click "Start" to begin the conversation.
voice interaction::
- Speak into the microphone and the system captures the audio via the Web Audio API.
- Audio is transferred to the backend via WebSocket, RealtimeSTT is converted to text, Ollama/OpenAI generates replies, and RealtimeTTS is converted to speech and played back through the browser.
- Conversation delays are typically 0.5-1 seconds, with support for interrupting and continuing at any time.
Real-time feedback::
- The interface displays partially transcribed text and AI responses, making it easy for users to follow the conversation.
- You can click "Stop" to end the dialog and "Reset" to clear the history.
Configuration adjustments::
- TTS Engine: in code/server.py set up in START_ENGINE(e.g. coqui,kokoro,orpheus), adjusting the voice style.
- LLM model: Modification LLM_START_PROVIDER cap (a poem) LLM_START_MODELThe program supports either Ollama or OpenAI.
- STT Parameters: in code/transcribe.py mid-range adjustment Whisper modeling, language, or silence thresholds.
- Silence detection: in code/turndetect.py modifications silence_limit_seconds(default 0.2 seconds) to optimize conversation pacing.
Debugging and Optimization::
- View Log:docker compose logs -f(Docker) or just check out the server.py Output.
- Performance Issues: Ensure CUDA version matching, reduce realtime_batch_size Or use a lightweight model.
- Network Configuration: If HTTPS is required, set the USE_SSL = True and provide the certificate path (refer to the official SSL configuration).

Featured Function Operation

Low Latency Streaming Processing: Audio chunking via WebSocket, combined with RealtimeSTT and RealtimeTTS, with latency as low as 0.5 seconds. Users can have smooth conversations without waiting.
Dynamic dialogue management::turndetect.py Intelligently detects the end of speech and supports natural interruptions. For example, a user can interrupt at any time and the system will pause to generate and process new input.
Web Interface Interaction: The browser interface uses Vanilla JS and Web Audio API to provide real-time transcription and response display. Users can control the dialog with "Start/Stop/Reset" buttons.
Model Flexibility: Support for switching TTS engines (Coqui/Kokoro/Orpheus) and LLM backends (Ollama/OpenAI). For example, switching TTS:
```
START_ENGINE = "kokoro"  # 在 code/server.py 中修改
```
Docker Management: The service is managed through Docker Compose, and updating the model is only required:
```
docker compose exec ollama ollama pull <new_model>
```

application scenario

AI Voice Interaction Research
Developers can test the integration effect of RealtimeSTT, RealtimeTTS, and LLM to explore the optimization of low-latency speech interaction. The open source code supports customized parameters and is suitable for academic research.
Intelligent Customer Service Prototype
Enterprises can develop voice customer service systems based on the project. Users ask questions by voice and the system answers common questions in real time, such as technical support or product inquiries.
Language Learning Tools
Educational institutions can utilize the multilingual TTS functionality to develop voice conversation practice tools. Students talk to an AI to practice pronunciation and conversation, and the system provides real-time feedback.
Personal Voice Assistants
Technology enthusiasts can deploy projects to experience natural voice interaction with AI and simulate intelligent assistants for personal entertainment or small projects.

QA

What hardware support is required?
Recommended Linux system with NVIDIA GPU (at least 8GB of video memory) and CUDA 12.1. CPU runs feasibly, but with high latency. Minimum requirements: Python 3.9+, 8GB RAM.
How to solve Docker deployment problems?
Ensure that Docker, Docker Compose, and NVIDIA Container Toolkit are installed correctly. Check docker-compose.yml of the GPU configuration to view the logs:docker compose logs -fThe
How do I switch between voice or model?
modifications code/server.py hit the nail on the head START_ENGINE(TTS) or LLM_START_MODEL(LLM). the Docker user needs to re-pull the model:docker compose exec ollama ollama pull <model>The
What languages are supported?
RealtimeTTS supports multiple languages (e.g., English, Chinese, and Japanese), and needs to be configured in the code/audio_module.py Specify the language and speech model in the

RealtimeVoiceChat: low-latency natural spoken conversation with AI

General Introduction

Function List

Using Help

Installation process

Docker Deployment (recommended)

Manual installation (Windows/Linux/macOS)

caveat

workflow

Featured Function Operation

application scenario

QA

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification