General Introduction
RealtimeVoiceChat is an open source project that focuses on real-time, natural conversations with artificial intelligence via voice. Users use the microphone to input voice , the system captures the audio through the browser , quickly converted to text , a large language model (LLM) to generate a reply , and then convert the text to speech output , the whole process is close to real-time . The project uses a client-server architecture that emphasizes low latency and supports WebSocket streaming and dynamic conversation management. It provides Docker deployment, is recommended to run on Linux system and NVIDIA GPU environment, and integrates RealtimeSTT, RealtimeTTS and Ollama and other technologies that are suitable for developers to build voice interaction applications.
Function List
- real time voice interaction: Users input their voice through the browser microphone, and the system transcribes and generates a voice response in real time.
- Low latency processing: Optimize speech-to-text and text-to-speech latency to 0.5-1 seconds by streaming audio using WebSocket.
- Speech to text (STT): Quickly convert speech to text using RealtimeSTT (based on Whisper), with dynamic transcription support.
- Text-to-speech (TTS): Generate natural speech via RealtimeTTS (supports Coqui, Kokoro, Orpheus) with a choice of voice styles.
- Intelligent Dialog Management: Integrate language models from Ollama or OpenAI to support flexible dialog generation and interrupt handling.
- Dynamic Speech Detection: By
turndetect.py
Enables intelligent silence detection and adapts to the rhythm of the conversation. - web interface: Provides a clean browser interface, uses Vanilla JS and Web Audio API, and supports real-time feedback.
- Docker Deployment: Simplified installation with Docker Compose, support for GPU acceleration and model management.
- Model Customization: Support for switching between STT, TTS and LLM models to adjust speech and dialog parameters.
- open source and scalable: The code is publicly available and developers are free to modify or extend the functionality.
Using Help
Installation process
RealtimeVoiceChat supports both Docker deployment (recommended) and manual installation; Docker for Linux systems, especially with NVIDIA GPUs, and manual installation for Windows or scenarios where more control is needed. Below are the detailed steps:
Docker Deployment (recommended)
Requires Docker Engine, Docker Compose v2+, and NVIDIA Container Toolkit (GPU users). Linux systems are recommended for optimal GPU support.
- clone warehouse::
git clone https://github.com/KoljaB/RealtimeVoiceChat.git cd RealtimeVoiceChat
- Building a Docker Image::
docker compose build
This step downloads the base image, installs Python and machine learning dependencies, and pre-downloads the default STT models (Whisper
base.en
). It takes a long time, so please make sure your network is stable. - Starting services::
docker compose up -d
Start the application and the Ollama service with the container running in the background. Wait about 1-2 minutes for initialization to complete.
- Pull Ollama model::
docker compose exec ollama ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M
This command pulls the default language model. Users can find the default language model in the
code/server.py
modificationsLLM_START_MODEL
to use other models. - Validating Model Usability::
docker compose exec ollama ollama list
Make sure the model is loaded correctly.
- Stop or restart the service::
docker compose down # 停止服务 docker compose up -d # 重启服务
- View Log::
docker compose logs -f app # 查看应用日志 docker compose logs -f ollama # 查看 Ollama 日志
Manual installation (Windows/Linux/macOS)
Manual installation requires Python 3.9+, CUDA 12.1 (GPU users) and FFmpeg. Windows users can use the provided install.bat
Scripts simplify the process.
- Installation of basic dependencies::
- Make sure Python 3.9+ is installed.
- GPU users install NVIDIA CUDA Toolkit 12.1 and cuDNN.
- Install FFmpeg:
# Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # Windows (使用 Chocolatey) choco install ffmpeg
- Clone the repository and create a virtual environment::
git clone https://github.com/KoljaB/RealtimeVoiceChat.git cd RealtimeVoiceChat python -m venv venv # Linux/macOS source venv/bin/activate # Windows .\venv\Scripts\activate
- Installing PyTorch (matching hardware)::
- GPU (CUDA 12.1):
pip install torch==2.5.1+cu121 torchaudio==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121
- CPU (slower performance):
pip install torch torchaudio torchvision
- GPU (CUDA 12.1):
- Installing additional dependencies::
cd code pip install -r requirements.txt
Note: DeepSpeed can be complicated to install, Windows users can install it via the
install.bat
Automatic processing. - Install Ollama (non-Docker users)::
- Refer to the official Ollama documentation for installation.
- Pull the model:
ollama pull hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M
- Running the application::
python server.py
caveat
- hardware requirement: NVIDIA GPUs (at least 8GB of RAM) are recommended to ensure low latency. running on a CPU can cause significant performance degradation.
- Docker Configuration: Modification
code/*.py
maybedocker-compose.yml
Before you do so, you need to re-run thedocker compose build
The - License Compliance: TTS engines (e.g. Coqui XTTSv2) and LLM models have separate licenses and are subject to their terms.
workflow
- Accessing the Web Interface::
- Open your browser and visit
http://localhost:8000
(or remote server IP). - Grant microphone privileges and click "Start" to begin the conversation.
- Open your browser and visit
- voice interaction::
- Speak into the microphone and the system captures the audio via the Web Audio API.
- Audio is transferred to the backend via WebSocket, RealtimeSTT is converted to text, Ollama/OpenAI generates replies, and RealtimeTTS is converted to speech and played back through the browser.
- Conversation delays are typically 0.5-1 seconds, with support for interrupting and continuing at any time.
- Real-time feedback::
- The interface displays partially transcribed text and AI responses, making it easy for users to follow the conversation.
- You can click "Stop" to end the dialog and "Reset" to clear the history.
- Configuration adjustments::
- TTS Engine: in
code/server.py
set up inSTART_ENGINE
(e.g.coqui
,kokoro
,orpheus
), adjusting the voice style. - LLM model: Modification
LLM_START_PROVIDER
cap (a poem)LLM_START_MODEL
The program supports either Ollama or OpenAI. - STT Parameters: in
code/transcribe.py
mid-range adjustment Whisper modeling, language, or silence thresholds. - Silence detection: in
code/turndetect.py
modificationssilence_limit_seconds
(default 0.2 seconds) to optimize conversation pacing.
- TTS Engine: in
- Debugging and Optimization::
- View Log:
docker compose logs -f
(Docker) or just check out theserver.py
Output. - Performance Issues: Ensure CUDA version matching, reduce
realtime_batch_size
Or use a lightweight model. - Network Configuration: If HTTPS is required, set the
USE_SSL = True
and provide the certificate path (refer to the official SSL configuration).
- View Log:
Featured Function Operation
- Low Latency Streaming Processing: Audio chunking via WebSocket, combined with RealtimeSTT and RealtimeTTS, with latency as low as 0.5 seconds. Users can have smooth conversations without waiting.
- Dynamic dialogue management::
turndetect.py
Intelligently detects the end of speech and supports natural interruptions. For example, a user can interrupt at any time and the system will pause to generate and process new input. - Web Interface Interaction: The browser interface uses Vanilla JS and Web Audio API to provide real-time transcription and response display. Users can control the dialog with "Start/Stop/Reset" buttons.
- Model Flexibility: Support for switching TTS engines (Coqui/Kokoro/Orpheus) and LLM backends (Ollama/OpenAI). For example, switching TTS:
START_ENGINE = "kokoro" # 在 code/server.py 中修改
- Docker Management: The service is managed through Docker Compose, and updating the model is only required:
docker compose exec ollama ollama pull <new_model>
application scenario
- AI Voice Interaction Research
Developers can test the integration effect of RealtimeSTT, RealtimeTTS, and LLM to explore the optimization of low-latency speech interaction. The open source code supports customized parameters and is suitable for academic research. - Intelligent Customer Service Prototype
Enterprises can develop voice customer service systems based on the project. Users ask questions by voice and the system answers common questions in real time, such as technical support or product inquiries. - Language Learning Tools
Educational institutions can utilize the multilingual TTS functionality to develop voice conversation practice tools. Students talk to an AI to practice pronunciation and conversation, and the system provides real-time feedback. - Personal Voice Assistants
Technology enthusiasts can deploy projects to experience natural voice interaction with AI and simulate intelligent assistants for personal entertainment or small projects.
QA
- What hardware support is required?
Recommended Linux system with NVIDIA GPU (at least 8GB of video memory) and CUDA 12.1. CPU runs feasibly, but with high latency. Minimum requirements: Python 3.9+, 8GB RAM. - How to solve Docker deployment problems?
Ensure that Docker, Docker Compose, and NVIDIA Container Toolkit are installed correctly. Checkdocker-compose.yml
of the GPU configuration to view the logs:docker compose logs -f
The - How do I switch between voice or model?
modificationscode/server.py
hit the nail on the headSTART_ENGINE
(TTS) orLLM_START_MODEL
(LLM). the Docker user needs to re-pull the model:docker compose exec ollama ollama pull <model>
The - What languages are supported?
RealtimeTTS supports multiple languages (e.g., English, Chinese, and Japanese), and needs to be configured in thecode/audio_module.py
Specify the language and speech model in the