Skip to content

Overview

Orchestration Architecture

This section describes the dual orchestration architecture that powers the ART Voice Agent Accelerator. The system supports two distinct orchestration modes, each optimized for different use cases and performance characteristics.


Overview

The accelerator provides two orchestrator implementations that share the same Unified Agent Framework but differ in how they process audio, execute tools, and manage agent handoffs:

Mode Orchestrator Audio Processing Best For
SpeechCascade CascadeOrchestratorAdapter Azure Speech SDK (STT/TTS) Fine-grained control, custom VAD
VoiceLive LiveOrchestrator OpenAI Realtime API Lowest latency, managed audio

Both orchestrators:

  • Use the same UnifiedAgent configurations from apps/artagent/backend/agents/
  • Share the centralized tool registry
  • Support multi-agent handoffs via scenario-driven routing
  • Use the unified HandoffService for consistent handoff behavior
  • Integrate with MemoManager for session state
  • Emit OpenTelemetry spans for observability

Scenario-Based Orchestration

Handoff routing is defined at the scenario level, not embedded in agents. This enables the same agents to behave differently in different use cases.

flowchart LR subgraph Scenario["Scenario Configuration"] S[scenario.yaml] end subgraph Service["Handoff Resolution"] HS[HandoffService] end subgraph Orchestrators["Orchestrators"] CO[CascadeOrchestrator] LO[LiveOrchestrator] end subgraph Agents["Agent Layer"] A1[Agent A] A2[Agent B] end S --> HS HS --> CO HS --> LO CO --> A1 CO --> A2 LO --> A1 LO --> A2

Key Benefits

Benefit Description
Modularity Agents focus on capabilities; scenarios handle orchestration
Reusability Same agent behaves differently in banking vs. insurance
Contextual Behavior Handoff can be "announced" or "discrete" per scenario
Session Scenarios Scenario Builder creates session-scoped scenarios at runtime

For full details, see Scenario-Based Orchestration.


Architecture Diagram

flowchart LR subgraph Input WS[WebSocket] end subgraph Orchestration WS --> Mode{Streaming Mode} Mode -->|MEDIA / TRANSCRIPTION| Cascade[SpeechCascade] Mode -->|VOICE_LIVE| Live[VoiceLive] end subgraph Cascade Path Cascade --> STT[Azure Speech STT] STT --> LLM[Azure OpenAI] LLM --> TTS[Azure Speech TTS] end subgraph VoiceLive Path Live --> RT[OpenAI Realtime API] end subgraph Shared Agents[(Agents)] Tools[(Tools)] State[(State)] end Cascade -.-> Shared Live -.-> Shared

Both orchestrators share: - Agents — Unified agent registry from apps/artagent/backend/agents/ - Tools — Centralized tool registry with handoff support - StateMemoManager for session persistence


Mode Selection

The orchestration mode is selected via the ACS_STREAMING_MODE environment variable:

# SpeechCascade modes (Azure Speech SDK)
export ACS_STREAMING_MODE=MEDIA          # Raw audio with local VAD
export ACS_STREAMING_MODE=TRANSCRIPTION  # ACS-provided transcriptions

# VoiceLive mode (OpenAI Realtime API)
export ACS_STREAMING_MODE=VOICE_LIVE

Decision Matrix

Requirement SpeechCascade VoiceLive
Lowest possible latency ⭐⭐⭐ ⭐⭐⭐⭐⭐
Custom VAD/segmentation ⭐⭐⭐⭐⭐ ⭐⭐
Sentence-level TTS control ⭐⭐⭐⭐⭐ ⭐⭐
Azure Speech voices ⭐⭐⭐⭐⭐ ⭐⭐
Phrase list customization ⭐⭐⭐⭐⭐
Simplicity of setup ⭐⭐ ⭐⭐⭐⭐
Audio quality control ⭐⭐⭐⭐ ⭐⭐⭐

Shared Abstractions

Both orchestrators use common data structures defined in apps/artagent/backend/voice/shared/base.py:

OrchestratorContext

Input context for turn processing:

@dataclass
class OrchestratorContext:
    """Context for orchestrator turn processing."""
    user_text: str                    # Transcribed user input
    conversation_history: List[Dict]  # Prior messages
    metadata: Dict[str, Any]          # Session metadata
    memo_manager: Optional[MemoManager] = None

OrchestratorResult

Output from turn processing:

@dataclass
class OrchestratorResult:
    """Result from orchestrator turn processing."""
    response_text: str               # Agent response
    tool_calls: List[Dict]           # Tools executed
    handoff_occurred: bool           # Whether agent switched
    new_agent: Optional[str]         # Target agent if handoff
    metadata: Dict[str, Any]         # Telemetry data

Turn Processing Patterns

SpeechCascade: Synchronous Turns

SpeechCascade processes turns synchronously — one complete user utterance triggers one agent response:

User Speech → STT → Transcript → LLM (streaming) → Sentences → TTS → Audio
                              Tool Execution
                              Handoff Check

The CascadeOrchestratorAdapter.process_turn() method:

  1. Receives complete transcript
  2. Renders agent prompt with context
  3. Calls Azure OpenAI with tools
  4. Streams response sentence-by-sentence to TTS
  5. Executes any tool calls
  6. Handles handoffs via state update

VoiceLive: Event-Driven

VoiceLive is event-driven — the orchestrator reacts to events from the OpenAI Realtime API:

flowchart LR subgraph Audio["Bidirectional Audio"] A[Audio Stream] end subgraph Realtime["OpenAI Realtime API"] R[Realtime API] E1[Events] end subgraph Orchestrator["LiveOrchestrator"] L[handle_event] H[Event Handlers] T[Tool Execution] S[Handoff + Session Update] end A <--> R R --> E1 E1 -->|transcription
audio delta
tool call| H H --> T T --> S

The LiveOrchestrator.handle_event() method routes events:

  • SESSION_UPDATED → Apply agent configuration
  • INPUT_AUDIO_BUFFER_SPEECH_STARTED → Barge-in handling
  • RESPONSE_AUDIO_DELTA → Queue audio for playback
  • RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE → Execute tool
  • RESPONSE_DONE → Finalize turn

Handoff Strategies

Both orchestrators now use the unified HandoffService for consistent handoff behavior. The service:

  • Resolves handoff targets from scenario configuration
  • Applies scenario-defined handoff types (announced vs discrete)
  • Builds consistent system_vars for agent context
  • Selects appropriate greetings based on handoff mode

Unified Resolution Flow

# Both orchestrators use the same pattern
resolution = handoff_service.resolve_handoff(
    tool_name="handoff_fraud_agent",
    tool_args=args,
    source_agent=current_agent,
    current_system_vars=system_vars,
)

if resolution.success:
    await self._switch_to(resolution.target_agent, resolution.system_vars)

    if resolution.greet_on_switch:
        greeting = handoff_service.select_greeting(agent, ...)

Handoff Types

Type Behavior Scenario Config
announced Target agent greets the user type: announced
discrete Target agent continues naturally type: discrete

For detailed HandoffService documentation, see Handoff Service.

State-Based (SpeechCascade)

Handoffs are executed by updating MemoManager state:

# In tool execution
if handoff_service.is_handoff(tool_name):
    resolution = handoff_service.resolve_handoff(...)
    if resolution.success:
        self._pending_handoff = resolution
        memo_manager.set_corememory("pending_handoff", resolution.target_agent)

# End of turn
if self._pending_handoff:
    await self._execute_handoff()

Tool-Based (VoiceLive)

Handoffs are immediate upon tool call completion:

async def _execute_tool_call(self, call_id, name, args_json):
    if handoff_service.is_handoff(name):
        resolution = handoff_service.resolve_handoff(
            tool_name=name,
            tool_args=json.loads(args_json),
            source_agent=self._active_agent_name,
            current_system_vars=self._system_vars,
        )

        if resolution.success:
            await self._switch_to(
                resolution.target_agent,
                resolution.system_vars,
            )
            return

    # Execute non-handoff tool
    result = await execute_tool(name, json.loads(args_json))

MemoManager Integration

Both orchestrators sync state with MemoManager for session continuity:

State Keys

class StateKeys:
    ACTIVE_AGENT = "active_agent"
    PENDING_HANDOFF = "pending_handoff"
    HANDOFF_CONTEXT = "handoff_context"
    PREVIOUS_AGENT = "previous_agent"
    VISITED_AGENTS = "visited_agents"

Sync Patterns

# Restore state at turn start
def _sync_from_memo_manager(self):
    self.active = memo.get_corememory("active_agent")
    self.visited_agents = set(memo.get_corememory("visited_agents"))
    self._system_vars["session_profile"] = memo.get_corememory("session_profile")

# Persist state at turn end
def _sync_to_memo_manager(self):
    memo.set_corememory("active_agent", self.active)
    memo.set_corememory("visited_agents", list(self.visited_agents))

Telemetry

Both orchestrators emit OpenTelemetry spans following GenAI semantic conventions:

Span Hierarchy

flowchart TD IA["invoke_agent
(per agent session)"] LR["llm_request
(LLM call)"] TE1["tool_execution
(each tool)"] TE2["tool_execution"] AS["agent_switch
(if handoff)"] IA --> LR LR --> TE1 LR --> TE2 IA --> AS

Key Attributes

Attribute Description
gen_ai.operation.name invoke_agent
gen_ai.agent.name Current agent name
gen_ai.usage.input_tokens Tokens consumed
gen_ai.usage.output_tokens Tokens generated
voicelive.llm_ttft_ms Time to first token

Deep Dive Documentation