Telemetry
π Telemetry & Observability Plan for Voice-to-Voice AgentΒΆ
Status: DRAFT | Created: 2025 | Audience: Engineering Team
This document outlines a structured approach to instrumentation, metrics, and logging for our real-time voice agent application. The goal is to provide actionable observability without overwhelming noise, aligned with OpenTelemetry GenAI semantic conventions and optimized for Azure Application Insights Application Map.
Official Guidance
This implementation follows Microsoft's recommended patterns for tracing AI agents in production and Azure Monitor OpenTelemetry integration.
π― GoalsΒΆ
- Application Map Visualization - Show end-to-end topology with componentβdependency relationships
- Measure Latency Per Turn - Track time-to-first-byte (TTFB) at each integration point
- Instrument LLM Interactions - Follow OpenTelemetry GenAI semantic conventions
- Monitor Speech Services - STT/TTS latencies and error rates
- Support Debugging - Correlate logs/traces across call sessions
- Avoid Noise - Filter out high-frequency WebSocket frame logs
πΊοΈ Application Map DesignΒΆ
The Application Map shows components (your code) and dependencies (external services). Per Microsoft's documentation, proper visualization requires correct resource attributes and span kinds.
π Reference: Application Map: Triage Distributed Applications
Target Application Map TopologyΒΆ
azure.ai.openai"] speech["Azure Speech
azure.speech"] acsvc["Azure Communication Services
azure.communication"] redis["Azure Redis
redis"] cosmos["Azure Cosmos DB
cosmosdb"] end browser --> voice & acs & events voice & acs & events --> media media --> aoai & speech & acsvc aoai --> redis speech --> cosmos
Critical Application Map RequirementsΒΆ
| Requirement | How It's Achieved | App Map Impact |
|---|---|---|
| Cloud Role Name | service.name resource attribute |
Creates node on map |
| Cloud Role Instance | service.instance.id resource attribute |
Drill-down for load balancing |
| Dependencies | Spans with kind=CLIENT + peer.service |
Creates edges to external services |
| Requests | Spans with kind=SERVER |
Shows inbound traffic |
| Correlation | W3C traceparent header propagation |
Connects distributed traces |
Resource Attributes (Set at Startup)ΒΆ
Per Azure Monitor OpenTelemetry configuration, service.name maps to Cloud Role Name and service.instance.id maps to Cloud Role Instance:
# In telemetry_config.py
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "artagent-api", # β Cloud Role Name
"service.namespace": "voice-agent", # β Groups related services
"service.instance.id": os.getenv("HOSTNAME", socket.gethostname()), # β Instance
"service.version": os.getenv("APP_VERSION", "1.0.0"),
"deployment.environment": os.getenv("ENVIRONMENT", "development"),
})
π Architecture Layers & Instrumentation PointsΒΆ
(Browser)
SERVER β"] media_acs["/media/acs
(ACS calls)
SERVER β"] api_events["/api/events
(webhooks)
SERVER β"] end subgraph HANDLERS ["βοΈ HANDLERS (MediaHandler)"] cascade["SpeechCascadeHandler
INTERNAL spans"] cascade_methods["_on_user_transcript()
_on_partial_transcript()
_on_vad_event()"] end subgraph ORCHESTRATION ["π ORCHESTRATION LAYER"] direction LR agent["ArtAgentFlow
INTERNAL"] tools["ToolExecution
INTERNAL"] response["ResponseOrchestrator
INTERNAL"] end subgraph EXTERNAL ["π‘ EXTERNAL SERVICES (CLIENT spans)"] direction TB subgraph row1 [" "] direction LR aoai["Azure OpenAI
peer.service=azure.ai.openai
CLIENT β"] speech["Azure Speech
peer.service=azure.speech
CLIENT β"] acs["Azure Communication Services
peer.service=azure.communication
CLIENT β"] end subgraph row2 [" "] direction LR redis["Azure Redis
peer.service=redis
CLIENT β"] cosmos["Azure CosmosDB
peer.service=cosmosdb
CLIENT β"] end end dashboard -->|WebSocket| API ws_voice & media_acs & api_events --> HANDLERS cascade --> cascade_methods HANDLERS --> ORCHESTRATION agent & tools & response --> EXTERNAL
Legend:
| Span Kind | Description | App Insights |
|---|---|---|
| SERVER β | SpanKind.SERVER - inbound request |
Creates "request" |
| CLIENT β | SpanKind.CLIENT - outbound call |
Creates "dependency" |
| INTERNAL | SpanKind.INTERNAL - internal processing |
Shows in trace details |
π Key Metrics to CaptureΒΆ
1. Per-Turn Metrics (Conversation Flow)ΒΆ
| Metric | Description | Collection Point |
|---|---|---|
turn.user_speech_duration |
Time user was speaking | VAD β end-of-speech |
turn.stt_latency |
STT final result latency | _on_user_transcript() |
turn.llm_ttfb |
Time to first LLM token | ArtAgentFlow.run() |
turn.llm_total |
Total LLM response time | ArtAgentFlow.run() |
turn.tts_ttfb |
Time to first TTS audio | speech_synthesizer |
turn.tts_total |
Total TTS synthesis time | speech_synthesizer |
turn.total_latency |
Full turn round-trip | Start VAD β audio playback begins |
2. LLM Metrics (OpenTelemetry GenAI Conventions)ΒΆ
These attributes follow the OpenTelemetry Semantic Conventions for Generative AI, which define standardized telemetry for LLM operations:
| Attribute | OTel Attribute Name | Example |
|---|---|---|
| Provider | gen_ai.provider.name |
azure.ai.openai |
| Operation | gen_ai.operation.name |
chat |
| Model Requested | gen_ai.request.model |
gpt-4o |
| Model Used | gen_ai.response.model |
gpt-4o-2024-05-13 |
| Input Tokens | gen_ai.usage.input_tokens |
150 |
| Output Tokens | gen_ai.usage.output_tokens |
75 |
| Finish Reason | gen_ai.response.finish_reasons |
["stop"] |
| Duration | gen_ai.client.operation.duration |
0.823s |
| TTFB | gen_ai.server.time_to_first_token |
0.142s |
3. Speech Services MetricsΒΆ
| Metric | Attribute | Unit |
|---|---|---|
| STT Recognition Time | speech.stt.recognition_duration |
seconds |
| STT Confidence | speech.stt.confidence |
0.0-1.0 |
| TTS Synthesis Time | speech.tts.synthesis_duration |
seconds |
| TTS Audio Size | speech.tts.audio_size_bytes |
bytes |
| TTS Voice | speech.tts.voice |
string |
4. Session/Call MetricsΒΆ
| Metric | Description |
|---|---|
session.turn_count |
Total turns in session |
session.total_duration |
Session length |
session.avg_turn_latency |
Average turn latency |
call.connection_id |
ACS call correlation ID |
transport.type |
ACS or BROWSER |
ποΈ Span Hierarchy (Trace Structure)ΒΆ
Following OpenTelemetry GenAI semantic conventions with proper SpanKind for Application Map. The span hierarchy below aligns with Azure AI Foundry tracing patterns:
[ROOT] voice_session (SERVER) β Shows as REQUEST in App Insights
βββ call.connection_id, session.id, transport.type
β
βββΊ [CHILD] conversation_turn (INTERNAL) β Shows in trace timeline
β βββ turn.number, turn.user_intent_preview
β β
β βββΊ [CHILD] stt.recognition (CLIENT) β Shows as DEPENDENCY to "azure.speech"
β β βββ peer.service="azure.speech"
β β βββ server.address="<region>.api.cognitive.microsoft.com"
β β βββ speech.stt.*, gen_ai.provider.name="azure.speech"
β β
β βββΊ [CHILD] chat {model} (CLIENT) β Shows as DEPENDENCY to "azure.ai.openai"
β β βββ peer.service="azure.ai.openai"
β β βββ server.address="<resource>.openai.azure.com"
β β βββ gen_ai.operation.name="chat"
β β βββ gen_ai.provider.name="azure.ai.openai"
β β βββ gen_ai.request.model, gen_ai.response.model
β β βββ gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
β β βββ [EVENT] gen_ai.content.prompt (opt-in)
β β βββ [EVENT] gen_ai.content.completion (opt-in)
β β
β βββΊ [CHILD] execute_tool {tool_name} (INTERNAL) β if function calling
β β βββ gen_ai.operation.name="execute_tool"
β β βββ gen_ai.tool.name, gen_ai.tool.call.id
β β βββ gen_ai.tool.call.result (opt-in)
β β
β βββΊ [CHILD] tts.synthesis (CLIENT) β Shows as DEPENDENCY to "azure.speech"
β βββ peer.service="azure.speech"
β βββ server.address="<region>.api.cognitive.microsoft.com"
β βββ speech.tts.*, gen_ai.provider.name="azure.speech"
β
βββΊ [CHILD] redis.operation (CLIENT) β Shows as DEPENDENCY to "redis"
β βββ peer.service="redis"
β βββ db.system="redis"
β βββ db.operation="SET/GET/HSET"
β
βββΊ [CHILD] cosmosdb.operation (CLIENT) β Shows as DEPENDENCY to "cosmosdb"
βββ peer.service="cosmosdb"
βββ db.system="cosmosdb"
βββ db.operation="query/upsert"
π Dependency Tracking for Application MapΒΆ
For each external service call, create a CLIENT span with these attributes:
Azure OpenAI (LLM)ΒΆ
from opentelemetry import trace
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span(
name=f"chat {model}", # Span name format: "{operation} {target}"
kind=SpanKind.CLIENT,
) as span:
# Required for Application Map edge
span.set_attribute("peer.service", "azure.ai.openai")
span.set_attribute("server.address", f"{resource_name}.openai.azure.com")
span.set_attribute("server.port", 443)
# GenAI semantic conventions
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("gen_ai.provider.name", "azure.ai.openai")
span.set_attribute("gen_ai.request.model", model)
# After response
span.set_attribute("gen_ai.response.model", response.model)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
span.set_attribute("gen_ai.response.finish_reasons", [choice.finish_reason])
Azure Speech (STT/TTS)ΒΆ
with tracer.start_as_current_span(
name="stt.recognize_once", # or "tts.synthesize"
kind=SpanKind.CLIENT,
) as span:
# Required for Application Map edge
span.set_attribute("peer.service", "azure.speech")
span.set_attribute("server.address", f"{region}.api.cognitive.microsoft.com")
span.set_attribute("server.port", 443)
# Speech-specific attributes
span.set_attribute("speech.stt.language", "en-US")
span.set_attribute("speech.tts.voice", voice_name)
span.set_attribute("speech.tts.output_format", "audio-24khz-48kbitrate-mono-mp3")
Azure Communication ServicesΒΆ
with tracer.start_as_current_span(
name="acs.answer_call", # or "acs.play_media", "acs.stop_media"
kind=SpanKind.CLIENT,
) as span:
span.set_attribute("peer.service", "azure.communication")
span.set_attribute("server.address", f"{resource_name}.communication.azure.com")
span.set_attribute("acs.call_connection_id", call_connection_id)
span.set_attribute("acs.operation", "answer_call")
RedisΒΆ
with tracer.start_as_current_span(
name="redis.hset",
kind=SpanKind.CLIENT,
) as span:
span.set_attribute("peer.service", "redis")
span.set_attribute("db.system", "redis")
span.set_attribute("db.operation", "HSET")
span.set_attribute("server.address", redis_host)
span.set_attribute("server.port", 6379)
Cosmos DBΒΆ
with tracer.start_as_current_span(
name="cosmosdb.query_items",
kind=SpanKind.CLIENT,
) as span:
span.set_attribute("peer.service", "cosmosdb")
span.set_attribute("db.system", "cosmosdb")
span.set_attribute("db.operation", "query")
span.set_attribute("db.cosmosdb.container", container_name)
span.set_attribute("server.address", f"{account_name}.documents.azure.com")
π Noise Reduction StrategyΒΆ
What to FILTER OUT (too noisy):ΒΆ
| Source | Reason | Implementation |
|---|---|---|
Individual WebSocket send()/recv() |
High frequency, no signal | NoisySpanFilterSampler in telemetry_config.py |
| Per-audio-frame logs | Creates 50+ log entries per second | Sampler drops spans matching patterns |
| Azure credential retry logs | Noise during auth fallback | Logger level set to WARNING |
| Health check pings | /health, /ready endpoints |
Can add to sampler patterns |
Span Filtering Patterns (Implemented):ΒΆ
The NoisySpanFilterSampler drops spans matching these patterns:
NOISY_SPAN_PATTERNS = [
r".*websocket\s*(receive|send).*", # WebSocket frame operations
r".*ws[._](receive|send).*", # Alternative WS naming
r"HTTP.*websocket.*", # HTTP spans for WS endpoints
r"^(GET|POST)\s+.*(websocket|/ws/).*", # Method + WebSocket path
]
NOISY_URL_PATTERNS = [
"/api/v1/browser/conversation", # Browser WebSocket endpoint
"/api/v1/acs/media", # ACS media streaming endpoint
"/ws/", # Generic WebSocket paths
]
What to SAMPLE (reduce volume):ΒΆ
| Source | Sampling Rate | Reason |
|---|---|---|
| Partial STT transcripts | 10% | Still need visibility |
| VAD frame events | 1% | Only need aggregate |
| WebSocket keepalive | 0% | No value |
Logger Suppression (Implemented):ΒΆ
# In telemetry_config.py - suppressed at module import
NOISY_LOGGERS = [
"azure.identity",
"azure.core.pipeline",
"websockets.protocol",
"websockets.client",
"aiohttp.access",
"httpx", "httpcore",
"redis.asyncio.connection",
"opentelemetry.sdk.trace",
]
for name in NOISY_LOGGERS:
logging.getLogger(name).setLevel(logging.WARNING)
π Structured Log Format & Session ContextΒΆ
Automatic Correlation with session_contextΒΆ
The project uses contextvars-based session context for automatic correlation propagation. Set context once at the connection level, and all nested logs/spans inherit the correlation IDs:
from utils.session_context import session_context
# At WebSocket entry point - set ONCE:
async with session_context(
call_connection_id=call_connection_id,
session_id=session_id,
transport_type="BROWSER", # or "ACS"
):
# ALL logs and spans within this block automatically get correlation
await handler.run()
Inside nested functions - NO extra params needed:
# In speech_cascade_handler.py, media_handler.py, etc.
logger.info("Processing speech") # Automatically includes session_id, call_connection_id
# Spans also get correlation automatically via SessionContextSpanProcessor
with tracer.start_as_current_span("my_operation"):
pass # Span has session.id, call.connection.id attributes
ArchitectureΒΆ
(logs auto-correlated)"] MH --> STT["π STT callbacks
(logs auto-correlated)"] MH --> ORCH["π€ Orchestrator
(spans auto-correlated)"] MH --> DB["πΎ All Redis/CosmosDB spans
(auto-correlated)"] end end style SC fill:#e8f5e9,stroke:#4caf50 style MH fill:#2196f3,stroke:#1976d2,color:#fff
How It WorksΒΆ
SessionCorrelationdataclass holdscall_connection_id,session_id,transport_type,agent_namesession_contextasync context manager sets thecontextvars.ContextVarTraceLogFilterinml_logging.pyreads from context and adds to log recordsSessionContextSpanProcessorintelemetry_config.pyinjects attributes into all spans
Legacy Explicit Logging (Still Supported)ΒΆ
For cases outside a session context, explicit extra dict still works:
logger.info(
"Turn completed",
extra={
"call_connection_id": call_connection_id,
"session_id": session_id,
"turn_number": turn_number,
"turn_latency_ms": turn_latency_ms,
}
)
Log Levels by Purpose:ΒΆ
| Level | Use Case |
|---|---|
DEBUG |
Frame-level, internal state (disabled in prod) |
INFO |
Turn boundaries, session lifecycle, latency summaries |
WARNING |
Retry logic, degraded performance |
ERROR |
Failed operations, exceptions |
π¦ Storage StrategyΒΆ
1. Real-Time Dashboard (Redis)ΒΆ
Store in CoreMemory["latency"] via existing LatencyTool:
# Current implementation in latency_helpers.py
corememory["latency"] = {
"current_run_id": "abc123",
"runs": {
"abc123": {
"samples": [
{"stage": "llm_ttfb", "dur": 0.142, "meta": {...}},
{"stage": "tts_ttfb", "dur": 0.089, "meta": {...}},
]
}
}
}
2. Historical Analysis (Application Insights)ΒΆ
Export via OpenTelemetry β Azure Monitor:
# Already configured in telemetry_config.py
configure_azure_monitor(
connection_string=APPLICATIONINSIGHTS_CONNECTION_STRING,
instrumentation_options={
"azure_sdk": {"enabled": True},
"fastapi": {"enabled": True},
},
)
3. Per-Session Summary (Redis β Cosmos DB)ΒΆ
At session end, persist aggregated metrics:
session_summary = latency_tool.session_summary()
# Returns: {"llm_ttfb": {"avg": 0.15, "min": 0.12, "max": 0.21, "count": 5}}
π― Service Level Objectives (SLOs)ΒΆ
Voice Agent SLO DefinitionsΒΆ
| Metric | Target | Warning | Critical | Measurement |
|---|---|---|---|---|
| Turn Latency (P95) | < 2,000 ms | > 2,500 ms | > 4,000 ms | End-to-end from user speech end to agent speech start |
| Turn Latency (P50) | < 800 ms | > 1,200 ms | > 2,000 ms | Median response time |
| Azure OpenAI Latency (P95) | < 1,500 ms | > 2,000 ms | > 3,000 ms | LLM inference time per call |
| STT Latency (P95) | < 500 ms | > 800 ms | > 1,200 ms | Speech recognition final result |
| TTS Latency (P95) | < 600 ms | > 1,000 ms | > 1,500 ms | Time to first audio byte |
| Error Rate | < 1% | > 2% | > 5% | Failed requests / total requests |
| Availability | 99.9% | < 99.5% | < 99% | Successful health checks |
SLO Monitoring KQL QueriesΒΆ
// Real-Time SLO Dashboard - Turn Latency
dependencies
| where timestamp > ago(1h)
| where isnotempty(customDimensions["turn.total_latency_ms"])
| extend turn_latency_ms = todouble(customDimensions["turn.total_latency_ms"])
| summarize
p50 = percentile(turn_latency_ms, 50),
p95 = percentile(turn_latency_ms, 95),
p99 = percentile(turn_latency_ms, 99),
total = count()
by bin(timestamp, 5m)
| extend
p95_slo_met = p95 < 2000,
p50_slo_met = p50 < 800
| project timestamp, p50, p95, p99, p95_slo_met, p50_slo_met, total
// SLO Compliance Summary (Last 24h)
dependencies
| where timestamp > ago(24h)
| where isnotempty(customDimensions["turn.total_latency_ms"])
| extend turn_latency_ms = todouble(customDimensions["turn.total_latency_ms"])
| summarize
total_turns = count(),
turns_under_2s = countif(turn_latency_ms < 2000),
turns_under_800ms = countif(turn_latency_ms < 800),
p95_latency = percentile(turn_latency_ms, 95)
| extend
p95_slo_compliance = round(100.0 * turns_under_2s / total_turns, 2),
p50_slo_compliance = round(100.0 * turns_under_800ms / total_turns, 2)
| project
total_turns,
p95_latency,
p95_slo_compliance,
p50_slo_compliance,
slo_status = iff(p95_latency < 2000, "β
Met", "β Breached")
π¨ Alert ConfigurationΒΆ
Azure Monitor Alert RulesΒΆ
Create these alert rules in Azure Portal β Application Insights β Alerts:
1. Turn Latency P95 Breach (Critical)ΒΆ
// Alert when P95 turn latency exceeds 4 seconds (Critical threshold)
dependencies
| where timestamp > ago(15m)
| where isnotempty(customDimensions["turn.total_latency_ms"])
| extend turn_latency_ms = todouble(customDimensions["turn.total_latency_ms"])
| summarize p95_latency = percentile(turn_latency_ms, 95)
| where p95_latency > 4000
2. Turn Latency P95 WarningΒΆ
// Alert when P95 turn latency exceeds 2.5 seconds (Warning threshold)
dependencies
| where timestamp > ago(15m)
| where isnotempty(customDimensions["turn.total_latency_ms"])
| extend turn_latency_ms = todouble(customDimensions["turn.total_latency_ms"])
| summarize p95_latency = percentile(turn_latency_ms, 95)
| where p95_latency > 2500 and p95_latency <= 4000
3. Azure OpenAI High LatencyΒΆ
// Alert when OpenAI response time exceeds 3 seconds
dependencies
| where timestamp > ago(15m)
| where target contains "openai" or name startswith "chat"
| summarize
p95_duration = percentile(duration, 95),
call_count = count()
| where p95_duration > 3000 and call_count > 5
4. High Error RateΒΆ
// Alert when error rate exceeds 5%
dependencies
| where timestamp > ago(15m)
| summarize
total = count(),
failed = countif(success == false)
| extend error_rate = round(100.0 * failed / total, 2)
| where error_rate > 5 and total > 10
5. Service Health Check FailureΒΆ
// Alert when /api/v1/readiness returns non-200
requests
| where timestamp > ago(10m)
| where name contains "readiness"
| summarize
total = count(),
failures = countif(success == false)
| where failures > 3
Alert Rule Bicep TemplateΒΆ
Deploy alerts via Infrastructure as Code:
// infra/bicep/modules/alerts.bicep
param appInsightsName string
param actionGroupId string
param location string = resourceGroup().location
resource appInsights 'Microsoft.Insights/components@2020-02-02' existing = {
name: appInsightsName
}
resource turnLatencyAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'Turn-Latency-P95-Critical'
location: location
properties: {
displayName: 'Voice Agent Turn Latency P95 > 4s'
severity: 1
enabled: true
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
scopes: [appInsights.id]
criteria: {
allOf: [
{
query: '''
dependencies
| where isnotempty(customDimensions["turn.total_latency_ms"])
| extend turn_latency_ms = todouble(customDimensions["turn.total_latency_ms"])
| summarize p95_latency = percentile(turn_latency_ms, 95)
| where p95_latency > 4000
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
failingPeriods: {
minFailingPeriodsToAlert: 1
numberOfEvaluationPeriods: 1
}
}
]
}
actions: {
actionGroups: [actionGroupId]
}
}
}
π Intelligent View (Smart Detection)ΒΆ
Application Insights Smart Detection automatically identifies anomalies in your application telemetry using machine learning algorithms.
Enabling Smart DetectionΒΆ
- Navigate to Application Insights β Smart Detection in Azure Portal
- Enable the following rules:
| Rule | Purpose | Recommended Setting |
|---|---|---|
| Failure Anomalies | Detect unusual spike in failed requests | β Enabled |
| Performance Anomalies | Detect response time degradation | β Enabled |
| Memory Leak | Detect gradual memory increase | β Enabled |
| Dependency Duration | Detect slow external calls | β Enabled |
Custom Anomaly Detection QueryΒΆ
// Detect latency anomalies using dynamic thresholds
let baseline = dependencies
| where timestamp between(ago(7d) .. ago(1d))
| where target contains "openai"
| summarize avg_duration = avg(duration), stdev_duration = stdev(duration);
dependencies
| where timestamp > ago(1h)
| where target contains "openai"
| summarize current_avg = avg(duration) by bin(timestamp, 5m)
| extend threshold = toscalar(baseline | project avg_duration + 2 * stdev_duration)
| where current_avg > threshold
| project timestamp, current_avg, threshold, anomaly = true
π₯ Health Check EndpointsΒΆ
The application provides comprehensive health monitoring via REST endpoints:
Liveness Probe: GET /api/v1/healthΒΆ
Returns 200 OK if the server process is running. Used by Kubernetes/load balancers for liveness checks.
Response includes: - Basic service status - Active session count - WebSocket connection metrics
Readiness Probe: GET /api/v1/readinessΒΆ
Returns 200 OK only if all critical dependencies are healthy. Returns 503 Service Unavailable if any are unhealthy.
Dependencies checked (with 1s timeout each): - β Redis - Connectivity and ping response - β Azure OpenAI - Client initialization - β Speech Services - STT/TTS pool readiness - β ACS Caller - Phone number configuration - β RT Agents - All agents initialized - β Auth Configuration - GUID validation (when enabled)
Health Check IntegrationΒΆ
Health probes follow Azure Container Apps health probe configuration and Kubernetes probe patterns.
Kubernetes Deployment:
livenessProbe:
httpGet:
path: /api/v1/health
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/v1/readiness
port: 8000
initialDelaySeconds: 15
periodSeconds: 15
failureThreshold: 2
Azure Container Apps:
probes: [
{
type: 'Liveness'
httpGet: {
path: '/api/v1/health'
port: 8000
}
periodSeconds: 10
}
{
type: 'Readiness'
httpGet: {
path: '/api/v1/readiness'
port: 8000
}
periodSeconds: 15
}
]
π Application Insights Queries (KQL)ΒΆ
Note: These queries use the classic Application Insights table names (
dependencies,traces,requests). For Log Analytics workspaces, useAppDependencies,AppTraces,AppRequestsinstead.
Application Map Dependencies OverviewΒΆ
// See all dependencies grouped by target (peer.service)
// Validated against Azure Monitor documentation 2024
dependencies
| where timestamp > ago(24h)
| summarize
call_count = count(),
avg_duration_ms = avg(duration),
failure_rate = round(100.0 * countif(success == false) / count(), 2)
by target, type, cloud_RoleName
| order by call_count desc
GenAI (LLM) Performance by ModelΒΆ
// Track Azure OpenAI performance with GenAI semantic conventions
dependencies
| where timestamp > ago(24h)
| where target contains "openai" or name startswith "chat"
| extend model = tostring(customDimensions["gen_ai.request.model"])
| extend input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"])
| extend output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"])
| where isnotempty(model)
| summarize
calls = count(),
avg_duration_ms = avg(duration),
p50_duration = percentile(duration, 50),
p95_duration = percentile(duration, 95),
p99_duration = percentile(duration, 99),
total_input_tokens = sum(input_tokens),
total_output_tokens = sum(output_tokens),
failure_rate = round(100.0 * countif(success == false) / count(), 2)
by model, bin(timestamp, 1h)
| order by timestamp desc
GenAI Token Usage Over Time (Cost Tracking)ΒΆ
// Track token consumption for cost analysis
dependencies
| where timestamp > ago(7d)
| where target contains "openai"
| extend model = tostring(customDimensions["gen_ai.request.model"])
| extend input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"])
| extend output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"])
| where input_tokens > 0 or output_tokens > 0
| summarize
total_input = sum(input_tokens),
total_output = sum(output_tokens),
total_tokens = sum(input_tokens) + sum(output_tokens),
request_count = count()
by bin(timestamp, 1d), model
| order by timestamp desc
| render columnchart
Speech Services Latency (STT + TTS)ΒΆ
// Monitor Azure Speech service performance
dependencies
| where timestamp > ago(24h)
| where target contains "speech" or name startswith "stt" or name startswith "tts"
| extend operation = case(
name contains "stt" or name contains "recognition", "STT",
name contains "tts" or name contains "synthesis", "TTS",
"Other"
)
| summarize
calls = count(),
avg_duration_ms = avg(duration),
p95_duration = percentile(duration, 95),
failure_rate = round(100.0 * countif(success == false) / count(), 2)
by operation, bin(timestamp, 1h)
| render timechart
Turn Latency DistributionΒΆ
// Analyze conversation turn latency from span attributes
// Note: Turn metrics are stored in span customDimensions
dependencies
| where timestamp > ago(24h)
| where isnotempty(customDimensions["turn.total_latency_ms"])
| extend turn_latency_ms = todouble(customDimensions["turn.total_latency_ms"])
| extend session_id = tostring(customDimensions["session.id"])
| summarize
avg_latency = avg(turn_latency_ms),
p50 = percentile(turn_latency_ms, 50),
p95 = percentile(turn_latency_ms, 95),
p99 = percentile(turn_latency_ms, 99),
turn_count = count()
by bin(timestamp, 1h)
| render timechart
Token Usage by SessionΒΆ
// Aggregate token usage per conversation session
dependencies
| where timestamp > ago(24h)
| where isnotempty(customDimensions["gen_ai.usage.input_tokens"])
| extend
session_id = tostring(customDimensions["session.id"]),
input_tokens = toint(customDimensions["gen_ai.usage.input_tokens"]),
output_tokens = toint(customDimensions["gen_ai.usage.output_tokens"])
| summarize
total_input = sum(input_tokens),
total_output = sum(output_tokens),
turns = count()
by session_id
| extend total_tokens = total_input + total_output
| order by total_tokens desc
| take 50
End-to-End Trace CorrelationΒΆ
// Find all telemetry for a specific call/session
// Replace <your-session-id> with actual session ID
let target_session = "<your-session-id>";
union requests, dependencies, traces
| where timestamp > ago(24h)
| where customDimensions["session.id"] == target_session
or customDimensions["call.connection_id"] == target_session
or operation_Id == target_session
| project
timestamp,
itemType,
name,
duration,
success,
operation_Id,
target = coalesce(target, ""),
message = coalesce(message, "")
| order by timestamp asc
Application Map Health CheckΒΆ
// Verify all expected service dependencies are reporting
dependencies
| where timestamp > ago(1h)
| summarize
last_seen = max(timestamp),
call_count = count(),
avg_duration = avg(duration),
error_count = countif(success == false)
by target, cloud_RoleName
| extend minutes_since_last = datetime_diff('minute', now(), last_seen)
| extend health_status = case(
minutes_since_last > 30, "β οΈ Stale",
error_count > call_count * 0.1, "π΄ High Errors",
avg_duration > 5000, "π‘ Slow",
"π’ Healthy"
)
| project target, cloud_RoleName, call_count, avg_duration, error_count, last_seen, health_status
| order by call_count desc
Error Analysis by ServiceΒΆ
// Identify failing dependencies and error patterns
dependencies
| where timestamp > ago(24h)
| where success == false
| extend error_code = tostring(resultCode)
| summarize
error_count = count(),
first_seen = min(timestamp),
last_seen = max(timestamp)
by target, name, error_code
| order by error_count desc
| take 20
π€ OpenAI Client Auto-InstrumentationΒΆ
The project uses the opentelemetry-instrumentation-openai-v2 package for automatic tracing of OpenAI API calls with GenAI semantic conventions. This follows Microsoft's recommended approach for tracing generative AI applications.
π Reference: Enable tracing for Azure AI Agents SDK
What Gets Instrumented AutomaticallyΒΆ
When enabled, the OpenAIInstrumentor creates spans for:
| Operation | Span Name Pattern | Attributes |
|---|---|---|
| Chat Completions | chat {model} |
gen_ai.usage.*, gen_ai.request.model |
| Streaming | chat {model} |
Token streaming with usage tracking |
| Tool Calls | Child of chat span | gen_ai.tool.name, arguments |
How It's ConfiguredΒΆ
Enabled automatically in telemetry_config.py:
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
from opentelemetry import trace
# Called during setup_azure_monitor() after TracerProvider is set
tracer_provider = trace.get_tracer_provider()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Content Recording (Prompt/Completion Capture)ΒΆ
To capture gen_ai.request.messages and gen_ai.response.choices in traces:
# Environment variable (.env or deployment config)
AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED=true
Warning: This captures full prompt and completion text, which may contain PII. Only enable in development or with proper data handling.
Verifying InstrumentationΒΆ
Check if instrumentation is active:
from utils.telemetry_config import is_openai_instrumented
if is_openai_instrumented():
print("OpenAI client auto-instrumentation enabled")
InstallationΒΆ
The package is included in requirements.txt:
GenAI Semantic ConventionsΒΆ
The instrumentor follows OpenTelemetry GenAI semantic conventions:
Attributes captured:
- gen_ai.request.model - Model deployment ID
- gen_ai.request.max_tokens - Max tokens requested
- gen_ai.request.temperature - Sampling temperature
- gen_ai.usage.input_tokens - Prompt tokens used
- gen_ai.usage.output_tokens - Completion tokens generated
- gen_ai.response.finish_reason - Why generation stopped
π ReferencesΒΆ
Azure AI & AgentsΒΆ
| Topic | Documentation |
|---|---|
| Tracing AI Agents | Enable tracing for Azure AI Agents SDK |
| Production Tracing | Tracing in production with the Azure AI SDK |
| Visualize Traces | Visualize your traces in Azure AI Foundry |
Azure Monitor & Application InsightsΒΆ
| Topic | Documentation |
|---|---|
| Application Map | Application Map: Triage Distributed Applications |
| OpenTelemetry Setup | Enable Azure Monitor OpenTelemetry |
| Cloud Role Configuration | Set Cloud Role Name and Instance |
| Add/Modify Telemetry | Add and modify OpenTelemetry |
| Smart Detection | Proactive Diagnostics |
| Log-based Alerts | Create log alerts |
OpenTelemetry StandardsΒΆ
| Topic | Documentation |
|---|---|
| GenAI Semantic Conventions | Generative AI Spans |
| GenAI Metrics | Generative AI Metrics |
| Span Kinds | Span Kind |
| Context Propagation | Context and Propagation |
Azure ServicesΒΆ
| Topic | Documentation |
|---|---|
| Azure Speech Telemetry | Speech SDK logging |
| Azure OpenAI Monitoring | Monitor Azure OpenAI |
| Container Apps Health Probes | Health probes in Azure Container Apps |
| Redis Monitoring | Monitor Azure Cache for Redis |
| Cosmos DB Monitoring | Monitor Azure Cosmos DB |
Project ImplementationΒΆ
- Telemetry Configuration:
utils/telemetry_config.py - Latency Tracking Tool:
src/tools/latency_tool.py - Session Context:
utils/session_context.py - Logging Configuration:
utils/ml_logging.py