Speech Synthesis
Speech Synthesis API¶
The Real-Time Voice Agent provides enterprise-grade text-to-speech capabilities through the SpeechSynthesizer class, built on Azure Speech Services with comprehensive integration features.
Key Features¶
- Multiple authentication methods: API Key and Azure Entra ID (Default Credentials)
- Real-time synthesis: Base64 frame streaming for WebSocket clients
- Local speaker playback: Intelligent headless environment detection
- OpenTelemetry tracing: Integration for Application Insights monitoring
- Concurrent synthesis limiting: Prevents service overload
- Advanced voice control: Neural styles, prosody, multilingual support
SpeechSynthesizer Class¶
Located in src/speech/text_to_speech.py, the SpeechSynthesizer provides comprehensive text-to-speech functionality with Azure integration.
Authentication Methods¶
Azure Entra ID (Recommended for Production)¶
from src.speech.text_to_speech import SpeechSynthesizer
# Uses DefaultAzureCredential - no API key required
synthesizer = SpeechSynthesizer(
region="eastus",
voice="en-US-JennyMultilingualNeural",
enable_tracing=True
)
API Key (Development/Testing)¶
# Traditional API key authentication
synthesizer = SpeechSynthesizer(
key="your-speech-key",
region="eastus",
voice="en-US-AriaNeural"
)
Basic Usage Examples¶
Simple Text-to-Speech¶
# Synthesize to memory
audio_data = synthesizer.synthesize_speech(
"Hello! Welcome to our voice application.",
style="chat",
rate="+10%"
)
# Save to file
with open("output.wav", "wb") as f:
f.write(audio_data)
Real-time Streaming for WebSocket¶
# Generate base64-encoded frames for streaming
frames = synthesizer.synthesize_to_base64_frames(
"This is real-time streaming audio",
sample_rate=16000
)
# Send frames to WebSocket client
for frame in frames:
websocket.send(frame)
Local Speaker Playback¶
# Play audio through system speakers (if available)
synthesizer = SpeechSynthesizer(
key="your-key",
region="eastus",
playback="auto" # Automatic hardware detection
)
# Speak text directly
synthesizer.start_speaking_text(
"This will play through your speakers!",
voice="en-US-AriaNeural",
style="excited"
)
# Stop playback
import time
time.sleep(3)
synthesizer.stop_speaking()
Advanced Configuration¶
Production Setup with Managed Identity¶
import os
from src.speech.text_to_speech import SpeechSynthesizer
# Production configuration
synthesizer = SpeechSynthesizer(
region=os.getenv("AZURE_SPEECH_REGION"),
voice="en-US-JennyMultilingualNeural",
playback="never", # Headless deployment
enable_tracing=True, # OpenTelemetry monitoring
call_connection_id="session-abc123" # Correlation tracking
)
# Validate configuration
if synthesizer.validate_configuration():
print("✅ Speech synthesizer ready for production")
else:
print("❌ Configuration validation failed")
Voice Styles and Prosody Control¶
# Advanced voice styling
audio = synthesizer.synthesize_speech(
"Production-ready voice synthesis",
voice="en-US-AriaNeural",
style="news", # Available: chat, cheerful, sad, angry, etc.
rate="+5%", # Speed adjustment
pitch="+2Hz", # Pitch control
volume="+10dB" # Volume adjustment
)
Environment Configuration¶
Required environment variables for production deployment:
# Azure Speech Services
AZURE_SPEECH_REGION=eastus
AZURE_SPEECH_RESOURCE_ID=/subscriptions/.../resourceGroups/.../providers/Microsoft.CognitiveServices/accounts/...
# Optional: Custom endpoint
AZURE_SPEECH_ENDPOINT=https://your-custom-endpoint.cognitiveservices.azure.com
# Optional: Audio playback control
TTS_ENABLE_LOCAL_PLAYBACK=false # Set to false for headless environments
Error Handling and Validation¶
Configuration Validation¶
# Test configuration before use
if synthesizer.validate_configuration():
print('✅ Configuration is valid')
# Test basic synthesis
audio_data = synthesizer.synthesize_speech("Hello, world!")
print(f'✅ Generated {len(audio_data)} bytes of audio')
else:
print('❌ Configuration validation failed')
Common Issues¶
Authentication Errors
Audio Hardware Issues
# Check headless environment detection
from src.speech.text_to_speech import _is_headless
print(f"Headless environment: {_is_headless()}")
Import Errors
# Ensure dependencies are installed
pip install azure-cognitiveservices-speech
python -c "import src.speech.text_to_speech; print('✅ Import successful')"
OpenTelemetry Integration¶
The SpeechSynthesizer includes built-in tracing for production monitoring:
# Enable comprehensive tracing
synthesizer = SpeechSynthesizer(
region="eastus",
enable_tracing=True,
call_connection_id="acs-call-123" # Correlation ID
)
# All operations automatically traced with:
# - Session-level spans for complete request lifecycle
# - Service dependency mapping for Azure Monitor App Map
# - Call correlation across distributed components
Performance Considerations¶
- Connection pooling: Default limit of 4 concurrent synthesis operations
- Memory efficiency: Streaming operations with automatic resource cleanup
- Lazy initialization: Audio components initialized only when needed
- Headless detection: Automatic fallback for containerized environments
Integration with Container Apps¶
For Azure Container Apps deployment, ensure proper configuration:
# Dockerfile example
FROM python:3.11-slim
# Set environment for headless operation
ENV TTS_ENABLE_LOCAL_PLAYBACK=false
ENV AZURE_SPEECH_REGION=eastus
# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
# Copy application
COPY src/ ./src/
CMD ["python", "-m", "your_app"]
API Integration¶
The speech synthesis functionality integrates with the main API endpoints - see API Reference for complete endpoint documentation:
- Call Management - TTS for outbound call prompts and conversation responses
- Media Streaming - Real-time TTS synthesis for ACS call conversations
- Health Monitoring - TTS service validation and voice testing
For complete API documentation, see the API Overview.