Production
Production Deployment Guide¶
Demo vs Production
This accelerator provides a simplified infrastructure designed for learning and demonstration purposes. Production deployments require significant additional hardening, security controls, and architectural considerations outlined in this guide.
Current State¶
The accelerator implements a "flat" architecture optimized for rapid deployment and experimentation:
| Component | Demo Implementation | Production Requirement |
|---|---|---|
| Networking | Public endpoints with authentication | Private endpoints, VNet integration |
| Authentication | Basic scaffolding | Full identity management, MFA |
| API Gateway | Direct container access | API Management with rate limiting |
| Message Queuing | In-process handling | Managed messaging service |
| Observability | Application Insights | Full monitoring stack with alerting |
Security Hardening¶
Network Perimeter¶
Production deployments should isolate all services within a private network:
Application Gateway for SSL Termination¶
Use Azure Application Gateway as the ingress point for all HTTP/WebSocket traffic:
- Centralized SSL/TLS termination with managed certificates
- Web Application Firewall (WAF) for threat protection
- Path-based routing and load balancing
- Health probes for backend availability
API Management as AI Gateway¶
Azure API Management provides critical controls for AI workloads:
| Capability | Purpose |
|---|---|
| Rate Limiting | Prevent resource exhaustion and cost overruns |
| Token Counting | Track and limit Azure OpenAI token consumption |
| Request Transformation | Normalize requests across different AI models |
| Caching | Cache common responses to reduce latency and cost |
| Circuit Breaking | Graceful degradation when backends fail |
Private Endpoints¶
All Azure services should be accessed via private endpoints:
| Service | Documentation |
|---|---|
| Azure Communication Services | ACS Private Link |
| Azure Cache for Redis | Redis Private Link |
| Cosmos DB | Cosmos DB Private Link |
| Azure OpenAI | AOAI Private Link |
| Speech Services | Speech Private Endpoints |
| Container Apps | VNet Integration |
Authentication Architecture¶
Current State
The accelerator provides scaffolding for authentication flows (see Authentication Guide), but production implementations require building out the full identity management layer.
Production authentication should include:
- Azure Entra ID for operator/admin access
- Managed Identity for all service-to-service communication
- Call Authentication via DTMF, SIP headers, or external IDP
- Session Validation with secure token management
Documentation
Scalability Patterns¶
Managed Messaging for Backpressure¶
The demo implementation handles message flow in-process, which limits scalability and resilience. Production deployments should introduce managed messaging services:
| Service | Use Case | Benefits |
|---|---|---|
| Azure Service Bus | Command/event queuing | Dead-letter support, transactions, ordering |
| Azure SignalR Service | WebSocket management | Connection offloading, auto-scaling |
| Azure Event Hubs | High-throughput streaming | Partitioned ingestion, replay capability |
These services address:
- Connection Management - Handle thousands of concurrent WebSocket connections
- Backpressure Handling - Queue-based buffering prevents overload
- Failure Isolation - Failed processing doesn't block healthy flows
- Horizontal Scaling - Stateless workers scale independently
Documentation
Container Apps Configuration¶
scale:
minReplicas: 2
maxReplicas: 20
rules:
- name: http-requests
http:
metadata:
concurrentRequests: "50"
- name: cpu-utilization
custom:
type: cpu
metadata:
value: "70"
Observability¶
Monitoring Stack¶
| Tool | Purpose |
|---|---|
| Application Insights | Distributed tracing, request logging |
| Azure Monitor | Infrastructure metrics, alerting |
| Log Analytics | Centralized log aggregation |
| Dashboards | Real-time operational visibility |
Key Metrics¶
| Category | Metric | Target | Alert Threshold |
|---|---|---|---|
| Latency | End-to-end response | < 2.5s | > 4s |
| Latency | STT processing | < 500ms | > 1s |
| Latency | TTS generation | < 1s | > 2s |
| Availability | Service uptime | 99.9% | < 99.5% |
| Quality | Call drop rate | < 1% | > 2% |
| Resources | Container CPU | < 70% | > 85% |
| Resources | Redis memory | < 70% | > 80% |
Correlation¶
Ensure all logs include correlation identifiers:
callConnectionId- ACS call identifiersessionId- Application sessionrequestId- Individual request tracking
Deployment Checklist¶
Pre-Production¶
- [ ] Private endpoints configured for all services
- [ ] Application Gateway with WAF deployed
- [ ] API Management configured with rate limiting
- [ ] Managed Identity enabled (no connection strings)
- [ ] Key Vault integration for all secrets
- [ ] Monitoring dashboards and alerts configured
- [ ] Load testing completed at target scale
- [ ] Security assessment performed
- [ ] Disaster recovery procedures documented
Go-Live¶
- [ ] Production environment validated
- [ ] Rollback procedures tested
- [ ] Support team trained and on standby
- [ ] Incident response playbooks ready
- [ ] Performance baselines established
Post-Launch¶
- [ ] Metrics reviewed against targets
- [ ] Cost optimization opportunities identified
- [ ] Continuous improvement roadmap updated
Cost Optimization¶
| Strategy | Implementation |
|---|---|
| Right-sizing | Start with minimum SKUs, scale based on usage |
| Reserved Capacity | 1-year reservations for predictable workloads |
| Auto-scaling | Scale down during off-peak hours |
| Caching | Cache TTS responses and common AI completions |
| Idle Timeout | End idle sessions after 60 seconds |
Cascade Architecture with Fine-Tuned SLMs¶
For production workloads, consider implementing a cascade architecture that routes requests through progressively capable models:
| Tier | Model | Use Case | Cost Profile |
|---|---|---|---|
| Tier 1 | Fine-tuned Phi-3/SLM | Common intents, FAQ responses | Low |
| Tier 2 | GPT-4o-mini | Moderate complexity, structured outputs | Medium |
| Tier 3 | GPT-4o | Complex reasoning, multi-turn context | High |
Benefits of cascade routing:
- 80-90% cost reduction by handling common queries with SLMs
- Lower latency for simple requests (SLMs respond faster)
- Graceful fallback when SLM confidence is low
- Domain specialization through fine-tuning on your conversation data
Context Management¶
Structured Memory and Context Optimization¶
Large language models have finite context windows. Production deployments should implement structured memory patterns to optimize token usage and maintain conversation coherence:
| Pattern | Description | Benefit |
|---|---|---|
| Hierarchical Summarization | Summarize older turns, keep recent turns verbatim | Reduces tokens while preserving context |
| Semantic Chunking | Store conversation segments by topic/intent | Enables selective retrieval |
| Memory Tiering | Hot (Redis) → Warm (Cosmos) → Cold (Blob) | Cost-effective storage with fast access |
| Entity Extraction | Extract and persist key entities separately | Compact context representation |
Consider exploring:
- Azure AI Foundry for managed memory and agent orchestration
- Semantic Kernel for structured memory plugins
- Custom embeddings for conversation similarity search
Documentation
Evaluations¶
Quality Assurance for AI Agents¶
Production AI agents require systematic evaluation to ensure quality, safety, and alignment with business objectives.
Current State
The accelerator includes evaluation scaffolding in samples/labs/dev/evaluation_playground.ipynb. This provides a starting point for building comprehensive evaluation pipelines.
Evaluation Dimensions¶
| Dimension | Metrics | Tools |
|---|---|---|
| Fluency | Coherence, grammar, naturalness | Azure AI Evaluation SDK |
| Groundedness | Factual accuracy, hallucination detection | Promptflow evaluators |
| Relevance | Response appropriateness, intent alignment | Custom evaluators |
| Safety | Content filtering, PII detection | Azure AI Content Safety |
| Latency | End-to-end response time, P95 targets | Application Insights |
Recommended Approach¶
- Baseline Metrics - Establish performance benchmarks before changes
- Golden Dataset - Curate representative conversations with expected outputs
- Automated Pipelines - Run evaluations on every deployment
- Human-in-the-Loop - Periodic manual review of edge cases
- A/B Testing - Compare model/prompt variations in production