Skip to content

Production

Production Deployment Guide

Demo vs Production

This accelerator provides a simplified infrastructure designed for learning and demonstration purposes. Production deployments require significant additional hardening, security controls, and architectural considerations outlined in this guide.


Current State

The accelerator implements a "flat" architecture optimized for rapid deployment and experimentation:

Component Demo Implementation Production Requirement
Networking Public endpoints with authentication Private endpoints, VNet integration
Authentication Basic scaffolding Full identity management, MFA
API Gateway Direct container access API Management with rate limiting
Message Queuing In-process handling Managed messaging service
Observability Application Insights Full monitoring stack with alerting

Security Hardening

Network Perimeter

Production deployments should isolate all services within a private network:

graph TB subgraph Internet Users[Callers] end subgraph Azure["Azure Network Perimeter"] subgraph PublicZone["DMZ"] AppGW[Application Gateway] end subgraph PrivateZone["Private VNet"] Apps[Container Apps] ACS[ACS] Redis[Redis Cache] Cosmos[Cosmos DB] APIM[API Management] Speech[Speech Services] AOAI[Azure OpenAI] end end Users --> AppGW AppGW --> Apps Apps --> ACS Apps --> Redis Apps --> Cosmos Apps --> APIM APIM --> Speech APIM --> AOAI
Application Gateway for SSL Termination

Use Azure Application Gateway as the ingress point for all HTTP/WebSocket traffic:

  • Centralized SSL/TLS termination with managed certificates
  • Web Application Firewall (WAF) for threat protection
  • Path-based routing and load balancing
  • Health probes for backend availability
API Management as AI Gateway

Azure API Management provides critical controls for AI workloads:

Capability Purpose
Rate Limiting Prevent resource exhaustion and cost overruns
Token Counting Track and limit Azure OpenAI token consumption
Request Transformation Normalize requests across different AI models
Caching Cache common responses to reduce latency and cost
Circuit Breaking Graceful degradation when backends fail
Private Endpoints

All Azure services should be accessed via private endpoints:

Service Documentation
Azure Communication Services ACS Private Link
Azure Cache for Redis Redis Private Link
Cosmos DB Cosmos DB Private Link
Azure OpenAI AOAI Private Link
Speech Services Speech Private Endpoints
Container Apps VNet Integration

Authentication Architecture

Current State

The accelerator provides scaffolding for authentication flows (see Authentication Guide), but production implementations require building out the full identity management layer.

Production authentication should include:

  1. Azure Entra ID for operator/admin access
  2. Managed Identity for all service-to-service communication
  3. Call Authentication via DTMF, SIP headers, or external IDP
  4. Session Validation with secure token management

Scalability Patterns

Managed Messaging for Backpressure

The demo implementation handles message flow in-process, which limits scalability and resilience. Production deployments should introduce managed messaging services:

Service Use Case Benefits
Azure Service Bus Command/event queuing Dead-letter support, transactions, ordering
Azure SignalR Service WebSocket management Connection offloading, auto-scaling
Azure Event Hubs High-throughput streaming Partitioned ingestion, replay capability

These services address:

  • Connection Management - Handle thousands of concurrent WebSocket connections
  • Backpressure Handling - Queue-based buffering prevents overload
  • Failure Isolation - Failed processing doesn't block healthy flows
  • Horizontal Scaling - Stateless workers scale independently

Container Apps Configuration

scale:
  minReplicas: 2
  maxReplicas: 20
  rules:
    - name: http-requests
      http:
        metadata:
          concurrentRequests: "50"
    - name: cpu-utilization
      custom:
        type: cpu
        metadata:
          value: "70"

Observability

Monitoring Stack

Tool Purpose
Application Insights Distributed tracing, request logging
Azure Monitor Infrastructure metrics, alerting
Log Analytics Centralized log aggregation
Dashboards Real-time operational visibility

Key Metrics

Category Metric Target Alert Threshold
Latency End-to-end response < 2.5s > 4s
Latency STT processing < 500ms > 1s
Latency TTS generation < 1s > 2s
Availability Service uptime 99.9% < 99.5%
Quality Call drop rate < 1% > 2%
Resources Container CPU < 70% > 85%
Resources Redis memory < 70% > 80%

Correlation

Ensure all logs include correlation identifiers:

  • callConnectionId - ACS call identifier
  • sessionId - Application session
  • requestId - Individual request tracking

Deployment Checklist

Pre-Production

  • [ ] Private endpoints configured for all services
  • [ ] Application Gateway with WAF deployed
  • [ ] API Management configured with rate limiting
  • [ ] Managed Identity enabled (no connection strings)
  • [ ] Key Vault integration for all secrets
  • [ ] Monitoring dashboards and alerts configured
  • [ ] Load testing completed at target scale
  • [ ] Security assessment performed
  • [ ] Disaster recovery procedures documented

Go-Live

  • [ ] Production environment validated
  • [ ] Rollback procedures tested
  • [ ] Support team trained and on standby
  • [ ] Incident response playbooks ready
  • [ ] Performance baselines established

Post-Launch

  • [ ] Metrics reviewed against targets
  • [ ] Cost optimization opportunities identified
  • [ ] Continuous improvement roadmap updated

Cost Optimization

Strategy Implementation
Right-sizing Start with minimum SKUs, scale based on usage
Reserved Capacity 1-year reservations for predictable workloads
Auto-scaling Scale down during off-peak hours
Caching Cache TTS responses and common AI completions
Idle Timeout End idle sessions after 60 seconds

Cascade Architecture with Fine-Tuned SLMs

For production workloads, consider implementing a cascade architecture that routes requests through progressively capable models:

graph LR Request[Incoming Request] --> Router[Intent Router] Router --> SLM[Fine-Tuned SLM] Router --> LLM[GPT-4o] SLM --> Response[Response] LLM --> Response
Tier Model Use Case Cost Profile
Tier 1 Fine-tuned Phi-3/SLM Common intents, FAQ responses Low
Tier 2 GPT-4o-mini Moderate complexity, structured outputs Medium
Tier 3 GPT-4o Complex reasoning, multi-turn context High

Benefits of cascade routing:

  • 80-90% cost reduction by handling common queries with SLMs
  • Lower latency for simple requests (SLMs respond faster)
  • Graceful fallback when SLM confidence is low
  • Domain specialization through fine-tuning on your conversation data

Context Management

Structured Memory and Context Optimization

Large language models have finite context windows. Production deployments should implement structured memory patterns to optimize token usage and maintain conversation coherence:

Pattern Description Benefit
Hierarchical Summarization Summarize older turns, keep recent turns verbatim Reduces tokens while preserving context
Semantic Chunking Store conversation segments by topic/intent Enables selective retrieval
Memory Tiering Hot (Redis) → Warm (Cosmos) → Cold (Blob) Cost-effective storage with fast access
Entity Extraction Extract and persist key entities separately Compact context representation

Consider exploring:

  • Azure AI Foundry for managed memory and agent orchestration
  • Semantic Kernel for structured memory plugins
  • Custom embeddings for conversation similarity search

Evaluations

Quality Assurance for AI Agents

Production AI agents require systematic evaluation to ensure quality, safety, and alignment with business objectives.

Current State

The accelerator includes evaluation scaffolding in samples/labs/dev/evaluation_playground.ipynb. This provides a starting point for building comprehensive evaluation pipelines.

Evaluation Dimensions
Dimension Metrics Tools
Fluency Coherence, grammar, naturalness Azure AI Evaluation SDK
Groundedness Factual accuracy, hallucination detection Promptflow evaluators
Relevance Response appropriateness, intent alignment Custom evaluators
Safety Content filtering, PII detection Azure AI Content Safety
Latency End-to-end response time, P95 targets Application Insights
  1. Baseline Metrics - Establish performance benchmarks before changes
  2. Golden Dataset - Curate representative conversations with expected outputs
  3. Automated Pipelines - Run evaluations on every deployment
  4. Human-in-the-Loop - Periodic manual review of edge cases
  5. A/B Testing - Compare model/prompt variations in production

Resources