Voice Processing
Voice Processing Architecture¶
The voice module handles all real-time voice interactions including Speech-to-Text (STT), Language Model orchestration, and Text-to-Speech (TTS) for both web browsers and telephone calls.
Overview¶
The voice system supports two transport types: - Browser: WebRTC audio at 48kHz for web applications - ACS (Azure Communication Services): Telephony audio at 16kHz for phone calls
Directory Structure¶
voice/
├── handler.py Main VoiceHandler (recommended)
├── tts/ Text-to-Speech module
│ └── playback.py TTS audio playback
├── speech_cascade/ Speech processing pipeline
│ ├── handler.py 3-thread speech handler (STT, Turn, Barge-in)
│ ├── orchestrator.py LLM routing and turn management
│ ├── tts_processor.py Text utilities (markdown cleanup, sentence detection)
│ └── metrics.py Telemetry and metrics
├── shared/ Shared utilities
│ ├── context.py Session context for dependency injection
│ └── config_resolver.py Orchestrator configuration
└── messaging/ WebSocket helpers
├── transcripts.py Transcript message formatting
└── barge_in.py Interrupt handling
Key Components¶
VoiceHandler¶
The main entry point for voice interactions. Manages the complete lifecycle of a voice session.
Location: apps/artagent/backend/voice/handler.py
from apps.artagent.backend.voice import VoiceHandler, VoiceSessionContext
# Create session context
context = VoiceSessionContext(
session_id=session_id,
websocket=ws,
transport=TransportType.BROWSER, # or TransportType.ACS
cancel_event=asyncio.Event(),
current_agent=agent
)
# Create and start handler
handler = await VoiceHandler.create(config, app_state)
await handler.start()
# Cleanup when done
await handler.stop()
Text-to-Speech (TTS)¶
Converts text responses into audio and streams them to the user.
Location: apps/artagent/backend/voice/tts/playback.py
from apps.artagent.backend.voice.tts import TTSPlayback
# Create TTS instance
tts = TTSPlayback(context, app_state)
# Speak to user (automatically routes to browser or ACS)
await tts.speak("Hello! How can I help you?")
Key Features: - Automatic transport routing (browser vs telephony) - Voice customization per agent - Cancellation support for barge-in - Streaming for low latency
Speech Cascade¶
Processes user speech through a multi-threaded pipeline:
- STT Thread: Converts audio to text using Azure Speech Services
- Turn Thread: Routes conversation turns to appropriate LLM/agent
- Barge-in Thread: Handles user interruptions during agent responses
Location: apps/artagent/backend/voice/speech_cascade/
The cascade orchestrator manages this flow and coordinates between threads.
VoiceSessionContext¶
A shared context object that holds all session state, enabling clean dependency injection.
from apps.artagent.backend.voice.shared import VoiceSessionContext, TransportType
context = VoiceSessionContext(
session_id="unique-session-id",
websocket=websocket_connection,
transport=TransportType.BROWSER,
cancel_event=asyncio.Event(),
current_agent=agent_instance
)
Audio Specifications¶
Browser Transport (48kHz)¶
- Sample Rate: 48,000 Hz
- Format: PCM16 mono
- Chunk Size: 9,600 bytes (100ms per chunk)
- Encoding: Base64-encoded JSON over WebSocket
ACS Telephony Transport (16kHz)¶
- Sample Rate: 16,000 Hz
- Format: PCM16 mono
- Chunk Size: 1,280 bytes (40ms per chunk)
- Pacing: 40ms delay between chunks
- Encoding: Base64-encoded ACS AudioData messages
Common Usage Patterns¶
Starting a Voice Session¶
from apps.artagent.backend.voice import VoiceHandler, VoiceSessionContext, TransportType
# 1. Create session context with all needed state
context = VoiceSessionContext(
session_id=f"session-{uuid.uuid4()}",
websocket=websocket,
transport=TransportType.BROWSER,
cancel_event=asyncio.Event(),
current_agent=initial_agent
)
# 2. Initialize voice handler
handler = await VoiceHandler.create(agent_config, app_state)
await handler.start()
# Handler automatically manages STT, orchestration, and TTS
Playing Audio to User¶
from apps.artagent.backend.voice.tts import TTSPlayback
# Create TTS instance
tts = TTSPlayback(context, app_state)
# Simple speak (uses agent's default voice)
await tts.speak("Your account balance is $1,234.56")
# Custom voice override
await tts.speak(
"Welcome to our bank!",
voice_name="en-US-JennyNeural",
style="friendly"
)
Processing Text for TTS¶
Some LLM responses contain markdown or formatting that needs cleaning:
from apps.artagent.backend.voice.speech_cascade.tts_processor import TTSTextProcessor
# Remove markdown formatting
clean_text = TTSTextProcessor.sanitize_tts_text(
"**Bold text** and _italic text_"
)
# Result: "Bold text and italic text"
# Split streaming text into complete sentences
sentences, buffer = TTSTextProcessor.process_streaming_text(
llm_chunk,
previous_buffer
)
Best Practices¶
DO ✅¶
-
Use VoiceSessionContext for passing session state
-
Let TTS auto-route transports
-
Handle cancellation for barge-in
DON'T ❌¶
-
Don't manually manage transports
-
Don't confuse TTS playback with text processing
-
Don't forget to clean up handlers
Telemetry¶
The voice system emits OpenTelemetry spans for monitoring:
tts.synthesize
├── attributes: voice, style, rate, sample_rate, text_length
└── duration: synthesis time
tts.stream.browser | tts.stream.acs
├── attributes: transport, audio_bytes, chunks_sent
└── duration: streaming time
cascade.process_turn
├── invoke_agent {agent_name}
└── tts.synthesize + tts.stream
Custom metrics tracked:
- tts.synthesis.duration_ms - TTS generation time
- tts.stream.duration_ms - Audio streaming time
- tts.stream.chunks_sent - Number of audio chunks
- tts.stream.audio_bytes - Total audio size
Troubleshooting¶
Audio sounds slow or distorted (ACS only)¶
Check: Verify chunk size is 1,280 bytes (not 640)
Location: apps/artagent/backend/voice/tts/playback.py:490
No audio playing¶
- Check WebSocket connection is active
- Verify TTS synthesis succeeded (check logs for "Synthesis complete")
- Confirm transport type matches endpoint (browser vs ACS)
- Check for cancellation events interrupting playback
High latency¶
Typical TTS latencies: - Synthesis: 100-500ms (depends on text length) - First chunk: +50-100ms (encoding + network) - Total TTFB: 150-600ms
If higher, check:
- Azure region proximity
- Network connectivity
- TTS pool health: /api/v1/tts/health
Testing¶
Prerequisites¶
Install test dependencies first:
Unit Tests¶
Run the voice handler component tests:
# Test all voice handler components (context, TTS playback, agent voice resolution)
pytest tests/test_voice_handler_components.py -v
# Test voice handler compatibility layer
pytest tests/test_voice_handler_compat.py -v
# Test cascade orchestrator functionality
pytest tests/test_cascade_orchestrator_entry_points.py -v
pytest tests/test_cascade_llm_processing.py -v
# Run all voice-related tests
pytest tests/ -k "voice or cascade" -v
Optional: Run with coverage reporting (requires pytest-cov from dev dependencies):
pytest tests/test_voice_handler_components.py --cov=apps.artagent.backend.voice --cov-report=term-missing
Integration Testing¶
Test the full orchestrator pipeline interactively:
# Run interactive orchestrator test (browser mode)
./devops/scripts/misc/quick_test.sh
# Or run directly with Python
python devops/scripts/misc/test_orchestrator.py --interactive
This will test the complete flow: speech recognition → LLM orchestration → TTS playback.
Manual Testing¶
Browser:
1. Start the backend server: uvicorn apps.artagent.backend.main:app --reload
2. Open web UI at http://localhost:8000
3. Start voice conversation
4. Verify clear audio (not slow/choppy)
5. Test barge-in by speaking during response
ACS Telephony:
1. Make test call via /api/v1/call/outbound
2. Verify natural audio playback
3. Test interruptions (DTMF or speech)
4. Check logs for correct chunk sizes (1,280 bytes for 16kHz)
Related Documentation¶
- Orchestration System - How agents are selected and managed
- Speech Recognition - STT implementation details
- Session Management - Session state and lifecycle
- Telemetry - Monitoring and observability