Overview
API Reference¶
Comprehensive REST API and WebSocket documentation for the Real-Time Voice Agent backend built on Python 3.11 + FastAPI.
Quick Start¶
The API provides comprehensive Azure integrations for voice-enabled applications:
- Azure Communication Services - Call automation and bidirectional media streaming
- Azure Speech Services - Neural text-to-speech and speech recognition
- Azure OpenAI - Conversational AI and language processing
API Endpoints¶
The V1 API provides REST and WebSocket endpoints for real-time voice processing:
REST Endpoints¶
/api/v1/calls/
- Phone call management (initiate, answer, callbacks)/api/v1/health/
- Service health monitoring and validation
WebSocket Endpoints¶
/api/v1/media/stream
- ACS media streaming and session management/api/v1/realtime/conversation
- Browser-based voice conversations
Interactive API Documentation¶
👉 Complete API Reference - Interactive OpenAPI documentation with all REST endpoints, WebSocket details, authentication, and configuration.
Key Features¶
- Call Management - Phone call lifecycle through Azure Communication Services
- Media Streaming - Real-time audio processing for ACS calls
- Real-time Communication - Browser-based voice conversations
- Health Monitoring - Service validation and diagnostics
WebSocket Protocol¶
Real-time bidirectional audio streaming following Azure Communication Services WebSocket specifications:
- Audio Format: PCM 16kHz mono (ACS) / PCM 24kHz mono (Azure OpenAI Realtime)
- Transport: WebSocket over TCP with full-duplex communication
- Latency: Sub-50ms for voice activity detection and response generation
� WebSocket Details - Complete protocol documentation
Observability¶
OpenTelemetry Tracing - Built-in distributed tracing for production monitoring with Azure Monitor integration:
- Session-level spans for complete request lifecycle
- Service dependency mapping (Speech, Communication Services, Redis, OpenAI)
- Audio processing latency and error rate monitoring
Streaming Modes¶
The API supports multiple streaming modes configured via ACS_STREAMING_MODE
:
- MEDIA Mode (Default) - Traditional STT/TTS with orchestrator processing
- VOICE_LIVE Mode - Azure OpenAI Realtime API integration
- TRANSCRIPTION Mode - Real-time transcription without AI responses
👉 Detailed Configuration - Complete streaming mode documentation
Architecture¶
Three-Thread Design - Optimized for real-time conversational AI with sub-10ms barge-in detection following Azure Speech SDK best practices.
� Architecture Details - Complete three-thread architecture documentation
Reliability¶
Graceful Degradation - Following Azure Communication Services reliability patterns:
- Connection pooling and retry logic with exponential backoff
- Headless environment support with memory-only audio synthesis
- Managed identity authentication with automatic token refresh
Related Documentation¶
- API Reference - Complete OpenAPI specification with interactive testing
- Speech Synthesis - Comprehensive TTS implementation guide
- Speech Recognition - Advanced STT capabilities and configuration
- Streaming Modes - Audio processing pipeline configuration
- Utilities - Supporting services and infrastructure components
- Architecture Overview - System architecture and deployment patterns