Blog
AI/Voice
December 10, 20258 min

Building a Real-time Voice Pipeline: From Raw Audio to AI Response

The heart of TAMSIV is voice. Not a gadget — voice IS the primary interface. Building a real-time voice pipeline solo means entering a world where every millisecond counts.

The Pipeline Architecture

Audio PCM 16kHz mono → WebSocket (JWT) → Deepgram Live STT (VAD) → OpenRouter LLM → Function calling → OpenAI TTS → Audio response

Audio must be PCM 16-bit, 16kHz, mono. The phone sends raw binary chunks via WebSocket.

Authenticated WebSocket

Supabase JWT token in query string: ws://backend:3001?token=eyJhbG.... Validation on connection. If the token expires mid-conversation, automatic reconnection with a fresh token.

Deepgram Live STT and VAD

Deepgram's VAD (Voice Activity Detection) detects when the user has finished speaking. Without VAD, a client-side silence timeout is needed — too short and it cuts off, too long and it lags. Deepgram handles this with precision.

The challenge: managing intermediate results (is_final: false) vs. final results (is_final: true). Final results must be accumulated to build the complete transcription.

LLM Orchestration

The transcription goes to OpenRouter with function calling. The model is configurable with automatic fallback. Total latency: STT finalization (~200ms) + LLM (~800ms-2s) + TTS (~500ms). Between 1.5 and 3 seconds.

OpenAI TTS

nova voice. Audio is streamed back via the same WebSocket. The frontend starts playing as soon as the first chunks arrive, without waiting for the full response.

Lessons from Real-time

Anything can fail at any time. Smart retries, circuit breakers, fallbacks at every step. The pipeline is robust today, but every error handling line represents a bug encountered.