Customer runbook — day-2 operations¶
Walkthrough version: Deliver to a customer → 7. Operate (Day 2) is the partner-facing summary that points back to this runbook. This page is the canonical day-2 reference handed to the customer's ops team.
This runbook is for the customer's ops team after the partner has deployed the accelerator into the customer's Azure and handed off. It covers the operations the customer owns: monitoring, killswitch, evals re-run, model swap, incident response, cost tracking, and scaling.
The partner's engagement-specific handover packet (endpoint URLs,
HITL approver wiring, customer-specific alert rules, rollback path,
SLA details, deviations from shipped defaults) complements this
runbook. Template lives at
docs/handover/handover-packet-template.md.
When the two conflict, the partner's packet wins — it describes the
customer-specific wiring this generic runbook cannot.
Audience: customer SRE / platform / AI-ops on-call. Assumes access to the Azure subscription hosting the deployment, App Insights, Azure AI Foundry, and (if the partner used the forked-template pattern) the customer's GitHub fork.
Acronyms used in this runbook:
- HITL — human-in-the-loop (a partner-authored approver webhook gates side-effect tool calls)
- MI — Managed Identity (the user-assigned identity that holds RBAC on Foundry, Search, Key Vault)
- RBAC — role-based access control (Azure's auth model — no keys; identity + role assignments)
- CCoE — Cloud Center of Excellence (the customer team that owns hub networking, private DNS, and Azure policy in Tier 3 deployments)
Not in scope here:
- Commercial / SOW / SLA questions → partner delivery lead
- Redesigning the scenario or adding a new one → new engagement,
back to
/discover-scenario+/scaffold-from-brief - Source-code changes to the accelerator template itself → upstream Issues; the customer's fork is the day-2 change surface
On-call now (do these 3 things)¶
If you've been paged for a P1 (bad output, unsafe tool behavior, model outage), do these in order. Detail and other incident types are in Section 9 — Incident playbook.
- Flip the killswitch — halts every side-effect tool call; read-only retrieval and inference keep running so in-flight sessions don't error.
az containerapp update \
--name <api-container-app-name> \
--resource-group <rg> \
--set-env-vars KILLSWITCH_TOOLS=on
Detail: Section 3 — Killswitch. Re-enable by setting the var back to empty or off.
-
Check App Insights — query the
tracestable (the canonical event surface; see Section 2 — Monitoring for why) formessage == 'response.returned'withtostring(customDimensions.ok) == 'false'and formessage == 'tool.hitl_misconfigured'over the incident window. Correlate to a specific agent / tool. Workbook panels are in Section 2 — Monitoring. -
Page the partner approver / delivery lead — the partner's handover packet lists the named contact and SLA. HITL approver reachability and the customer-specific rollback path live there, not here.
Disengage the killswitch only after evals pass (Section 5 — Re-running evals).
Daily ops¶
- Open the Azure Monitor workbook built from
infra/dashboards/roi-kpis.json(deploy once via App Insights → Workbooks → New → Advanced editor; see Section 2 — Dashboard). - Triage three signals: error rate (
response.returned ok=false), HITL misconfiguration (tool.hitl_misconfiguredshould be 0), P95 latency vs the threshold inaccelerator.yaml.acceptance.p95_latency_ms. - Confirm HITL approver rota is current — the partner's handover packet lists the on-call rotation. Stale rota = blocked side-effect tool calls.
Weekly ops¶
- Re-run the eval suites the partner shipped (quality + redteam). Detail: Section 5 — Re-running evals. Investigate any regression before the next prompt or model change ships.
- Review the cost trend panel (Section 4 — Cost). Investigate any week-over-week jump > 20% — usually a prompt regression inflating output tokens or a usage-pattern shift.
- Confirm killswitch and secret-rotation drills are still in muscle memory — run the drill once per quarter at minimum.
Handover acceptance checklist¶
Before accepting handover from the partner, confirm:
- Alerts wired — error-rate, P95 latency, and HITL-misconfigured alert rules exist in Azure Monitor and route to the customer on-call channel (Section 2 — Alerts).
- Approver rota current — the HITL approver service responds, and the partner packet lists the named on-call rotation with an SLA.
- Killswitch tested — you have flipped
KILLSWITCH_TOOLS=onagainst the deployed Container App and confirmed side-effect tools halt. - Runbook walked — this document plus the partner's handover packet have been read by the named on-call team; questions raised during walkthrough are resolved.
- Eval URL set — you know how to trigger the eval suites and where the results land (workflow URL or local
results.jsonlpath). - Acceptance signed — partner delivery lead and customer ops lead both sign the handover packet's acceptance section.
If any of the above is missing, refuse handover — day-2 ops without these is unsupported.
What you inherited¶
At handover, the customer owns a deployed environment provisioned by
azd up against infra/main.bicep (resource-group scope). The
resource-group name is set by the partner during provisioning — see
the partner's handover notes for the exact value. Inside the group:
- Foundry (AIServices) account + project — hosts agent definitions
and model deployment(s). Model deployment names come from
accelerator.yaml.models[]via BiceploadYamlContentat compile time. - Azure AI Search — index(es) declared in
accelerator.yaml → scenario.retrieval.indexes[]. Seeded bysrc/bootstrap.pyat FastAPI startup (replaces the previous postprovision azd hook). - Key Vault — present for partner-added secrets, accessed via RBAC + Managed Identity.
- Container App (API) — runs the scenario workflow and exposes the
endpoint. Env vars (
APPLICATIONINSIGHTS_CONNECTION_STRING,AZURE_AI_FOUNDRY_ENDPOINT,AZURE_AI_FOUNDRY_MODEL,AZURE_AI_SEARCH_ENDPOINT) are set from Bicep outputs as plain env values — there are no Key Vault secret references wired by default. - User-assigned Managed Identity — shared principal that holds RBAC on Foundry, Search, and Key Vault.
- Application Insights + Log Analytics — telemetry sink.
Tier 3 (landing_zone.mode: alz-integrated) additions:
- Private endpoints on Foundry / AI Search / Key Vault are created by
the workload modules (
infra/modules/{foundry,ai-search, key-vault}.bicep) whenenablePrivateLink = true. infra/alz-overlay/is a separate subscription-scope deploy the platform team runs once to provision the spoke (vNet, workload subnet, peering to the hub). It consumes existing hub private DNS zone resource IDs supplied by the customer's CCoE and optionally creates vNet links from those zones to the spoke (controlled bycreateDnsZoneLinks). It does not create the hub zones themselves — those are hub-owned. Zone IDs and the subnet ID flow intoinfra/main.bicepvia azd env vars.tier3InputGuardininfra/main.bicepfails provision fast if any of those env vars are missing; it does not monitor for post-deploy drift.
Everything above is redeployed idempotently by azd up against a
given commit. Rollback is azd down --purge + azd up at a prior
commit.
2. Monitoring¶
Telemetry plane¶
Every workflow, worker, tool, and HITL checkpoint emits typed
OpenTelemetry events to Application Insights via the
APPLICATIONINSIGHTS_CONNECTION_STRING env var on the Container App.
Event definitions live in src/accelerator_baseline/telemetry.py.
Events emitted today:
| Event | Emitted from | Fires when |
|---|---|---|
request.received |
src/scenarios/sales_research/workflow.py |
a request enters the scenario workflow |
supervisor.routed |
src/workflow/supervisor.py |
supervisor chose which workers to run |
worker.completed |
src/workflow/supervisor.py |
a worker returned successfully |
worker.skipped |
src/workflow/supervisor.py |
a worker skipped (e.g., dependency failed) |
tool.executed |
src/tools/*.py |
any tool (read-only or side-effect) executed |
tool.hitl_approved |
src/accelerator_baseline/hitl.py |
human reviewer approved a tool call |
tool.hitl_rejected |
src/accelerator_baseline/hitl.py |
human reviewer rejected a tool call |
tool.hitl_misconfigured |
src/accelerator_baseline/hitl.py |
production had no HITL_APPROVER_ENDPOINT |
retrieval.returned |
src/retrieval/ai_search.py + src/scenarios/sales_research/workflow.py |
AI Search call completed; workflow also emits this with ok=False on exception |
response.returned |
src/main.py + src/scenarios/sales_research/workflow.py |
workflow returned to caller (with ok: true\|false) |
The registry also declares tool.hitl_skipped (emitted when a tool
call passes policy="never" into checkpoint()), but both
side-effect tools shipped with the flagship (crm_write_contact,
send_email) use HITL_POLICY = "always", so this event does not
fire in the out-of-the-box flagship. A scenario whose tool code
explicitly passes policy="never" (typically aligned with
accelerator.yaml.solution.hitl = none for reversible actions) would
produce it.
Registered but not emitted by the flagship scenario today:
aggregator.composed— reserved for scenarios that compose worker outputs in an aggregator stage. The flagship doesn't use that pattern; a scenario that does mustemit_eventmanually.cost.call— the emitterrecord_call_cost(agent, UsageSample(…))lives insrc/accelerator_baseline/cost.pybut is not called from the flagship workflow by default. Partners wire it into their Foundry call path when cost-per-call reporting is in scope.eval.result— emitted only if the partner wired the eval runner to push each case's result into App Insights (the shipped runners write toresults.jsonlonly).
The partner's handover packet should list which of these additional events they wired, if any.
Dashboard¶
infra/dashboards/roi-kpis.json is an Azure Monitor workbook template
(ARM JSON). It is not auto-deployed. To use it: Application Insights
→ Workbooks → New → Advanced editor → paste the JSON → Save.
It ships 5 panels. Which ones show data depends on what your partner wired:
| Panel | Works out of the box? | Depends on |
|---|---|---|
| Successful responses per day | Yes | response.returned (always emitted) |
| HITL approval rate | Yes when HITL is in use | tool.hitl_approved, tool.hitl_rejected |
| P95 request latency | Yes | Container App request telemetry (cloud_RoleName endswith 'api') |
| $ per call (estimated) | Only if partner wired cost.call |
cost.call events |
| Groundedness eval score trend | Only if partner wired eval.result |
eval.result events |
Only the latency panel filters by cloud_RoleName today. The event
panels query traces across the whole App Insights resource (events emitted by
src/accelerator_baseline/telemetry.py::emit_event land there with
message == event.name and attributes in customDimensions) —
if the resource is shared with other workloads, add a
cloud_RoleName filter before operationalizing.
Alerts¶
The accelerator does not ship Azure Monitor alert rules — you wire them. Starting points (copy the KQL from the workbook):
- Error-rate alert —
response.returnedwithcustomDimensions.ok == 'false'exceeding N% over 15 minutes - P95 latency alert — P95 over 1h above the threshold in
accelerator.yaml.acceptance.p95_latency_ms - HITL misconfiguration alert — any
tool.hitl_misconfiguredevent (this indicates production is running without an approver) - HITL rejection spike —
tool.hitl_rejected> N / hour
Thresholds are customer-owned. The accelerator does not opine.
3. Operational dials at runtime¶
Killswitch¶
src/accelerator_baseline/killswitch.py::assert_enabled(scope) raises
KillSwitchEngaged when the env var KILLSWITCH_<SCOPE> is
"1" | "true" | "on" (case-insensitive). Default scope is tools.
Flip the switch on the Container App:
az containerapp update \
--name <api-container-app-name> \
--resource-group <rg> \
--set-env-vars KILLSWITCH_TOOLS=on
This halts every side-effect tool call (read-only retrieval + agent
inference keep working). Re-enable by setting the var back to empty
or off.
Important: env vars set via az containerapp update will be
overwritten the next time azd provision runs (the Bicep template
is the source of truth). For anything longer than an incident
mitigation, add the variable to infra/modules/container-app.bicep
in the customer's fork and redeploy. For partners who need a
portal-style toggle, the killswitch docstring points at Azure App
Configuration feature flags as an extension pattern.
HITL approver¶
HITL is a partner-authored webhook, not a UI this accelerator
ships. The contract (from src/accelerator_baseline/hitl.py):
HITL_APPROVER_ENDPOINT= URL of the partner's approver service.checkpoint()posts the action + args and blocks until a decision returns.HITL_DEV_MODE=1= local development auto-approve with a loud warning. Never set this in production.- Neither set =
checkpoint()raisesHITLMisconfiguredand emitstool.hitl_misconfigured— a deliberately loud failure so a misconfigured production does not side-effect without a human.
The partner's handover packet lists the actual approver URL, the approvers, the SLA, and the escalation path. Day-2 ops owns reachability of that endpoint; the partner owns its logic.
Content filter¶
Azure AI content filter is bound to the model deployment in
infra/modules/foundry.bicep via IaC. Severity thresholds live in
the Bicep — changing them requires a Bicep edit and azd provision.
Editing the filter in the Foundry portal is not supported — the
the in-app FastAPI startup bootstrap (src/bootstrap.py) will still verify
a filter is bound, but portal drift from the IaC is undefined
behavior. Do not disable the filter; the RAI pattern doc
(docs/patterns/rai/README.md) calls this out as out-of-support.
4. Cost¶
Where the numbers come from¶
MODEL_PRICE_USD_PER_1K_TOKENS in src/accelerator_baseline/cost.py
ships rough defaults for gpt-5.2 and gpt-5-mini. The shipped
template default deployment is gpt-5-mini (from
infra/main.parameters.json), which is not in that price table.
Crucially, cost.call telemetry is not emitted by the shipped
flagship — the emitter record_call_cost(agent,
UsageSample(model, input_tokens, output_tokens)) is exported from
src/accelerator_baseline/cost.py but is not invoked from the
flagship workflow. To get cost signal in production a partner
must do both:
- Wire
record_call_cost(...)into the Foundry call site so the event actually emits, AND - Populate
MODEL_PRICE_USD_PER_1K_TOKENSwith the deployed model's pricing soestimate_call_cost()returns a non-zerovalue.
Confirm with the partner whether they wired both. If they did not,
cost.call events will not appear in App Insights, the cost-per-call
dashboard panel will be empty, and the cost_per_call_usd acceptance
gate will trip a loud failure on every eval run (see Section 5).
Azure-side cost monitoring¶
infra/main.bicep tags every resource with azd-env-name=<envName>
and workload=<scenarioId>-accelerator. Use those tags as filters
in Azure Cost Management views. Additional chargeback tags
(costcenter, businessunit) are a partner-scope Bicep edit.
5. Re-running evals against the deployed environment¶
Quality evals¶
Hits the deployed endpoint for each case in
evals/quality/golden_cases.jsonl, writes
evals/quality/results.jsonl, and returns non-zero if individual
cases fail their per-case check.
Redteam evals¶
Same shape — cases in evals/redteam/cases.jsonl, output in
evals/redteam/results.jsonl.
Acceptance enforcement¶
Threshold enforcement (from accelerator.yaml.acceptance) is a
second step. After both runners finish:
Reads both results.jsonl files and
src/accelerator_baseline/evals.py::evaluate_acceptance enforces:
- average quality score ≥
quality_threshold - average groundedness ≥
groundedness_threshold - P95 latency ≤
p95_latency_ms - average
cost_usd≤cost_per_call_usd(fails if nocost_usdvalues were recorded — a silent inert gate is treated as a failure) - if
redteam_must_pass: true, zero redteam cases may fail
The same two-step runs automatically in CI:
.github/workflows/evals.yml(PR gate) — requires the repo variableEVALS_API_URLpointing at the long-lived deployed environment. See the "Required GitHub secrets and variables" section ofdocs/getting-started/setup-and-prereqs.md..github/workflows/deploy.yml(post-deploy check) — passesneeds.azd-up.outputs.api_url(the URL just deployed) directly to the eval runners; noEVALS_API_URLvariable is required for this workflow.
When to run manually¶
- After a model swap
- After a prompt edit (see Section 7)
- Before a monthly value review
- On any production incident where output quality is suspected
6. Model swap¶
The authoritative source of truth for which model(s) are deployed
is accelerator.yaml.models[], not infra/main.bicep param defaults.
On every azd provision or azd up, the preprovision hook
infra/main.bicep parses the manifest at compile time via loadYamlContent, then rewrites the managed model
env vars (AZURE_AI_FOUNDRY_MODEL_NAME, _MODEL_VERSION, _MODEL,
_MODEL_CAPACITY, _EXTRA_DEPLOYMENTS_JSON) from the manifest. Raw
azd env set AZURE_AI_FOUNDRY_MODEL_NAME=… overrides will be
clobbered on the next provision.
Swap procedure¶
- In the customer's fork, edit the
default: trueentry inaccelerator.yaml → models[]: model— OpenAI model nameversion— model version stringdeployment_name— Foundry deployment resource namecapacity— TPM in thousands- Confirm the target region has quota for the new model (Azure portal → Foundry account → Quotas).
- (Optional) Update
MODEL_PRICE_USD_PER_1K_TOKENSincost.pyso the cost gate isn't inert for the new model. azd provision— preprovision syncs env vars, Bicep creates the new deployment; on the Container App's next startupsrc/bootstrap.pyre-verifies the agents against the new model.- Re-run quality + redteam evals (Section 5) and
enforce-acceptance.py. - If acceptance holds, merge; if not, revert the manifest PR.
To run two models side-by-side (canary), add a second entry (not
default: true) and set its slug — scenario agents can point at it
via the scenario.agents[].model field.
Do not swap models in production without re-running evals. A model that benchmarks equivalently often shifts quality / cost / latency in the specific scenario.
7. Prompt / agent-instruction rollback¶
Agent instructions are stored in Foundry, but their repo-side source
of truth is docs/agent-specs/<foundry_name>.md. Every azd
up triggers a Container App revision restart that runs src/bootstrap.py as the FastAPI startup
hook, which overwrites each agent's portal-side instructions from the
matching spec file. Consequences:
- Direct Foundry portal edits to an agent's instructions are
transient — they will be reverted on the next
azd provision. The portal is the runtime source of truth between provisions but not a durable authoring surface. - The supported rollback path is: revert the spec file in the
customer's fork →
azd provision→ re-run evals.
If the partner's handover packet documents a different authoring
workflow (e.g., "prompts are portal-managed for this engagement and
src/bootstrap.py is disabled" via BOOTSTRAP_SKIP=1), follow the packet.
8. Secret rotation¶
Key Vault¶
The shipped accelerator does not read secrets from Key Vault at
runtime. src/config/settings.py loads configuration from env vars
only (AZURE_AI_FOUNDRY_ENDPOINT, AZURE_AI_SEARCH_ENDPOINT,
APPLICATIONINSIGHTS_CONNECTION_STRING, HITL_APPROVER_ENDPOINT,
etc.); Bicep sets those values on the Container App as plain env
values from provisioning-time outputs. The Key Vault is provisioned
and MI-accessible, but no code path fetches secrets from it by
default.
If the partner wired a Key Vault-backed secret (either via Container
Apps secret references in infra/modules/container-app.bicep, or via
SecretClient / DefaultAzureCredential added to the scenario
code), rotation follows the standard Azure pattern for whichever path
they used. The partner's handover packet describes this if
applicable. Without such wiring, there are no runtime secrets to
rotate on the accelerator itself — rotation applies to partner-added
integrations.
Managed Identity¶
Managed identities do not rotate — assignments are stable. Rotating
the MI itself is a full re-provision (azd down --purge + azd
up); new MI principal ID flows through Bicep role assignments.
Partner-provided external secrets¶
If the partner wired an HITL approver with signed webhooks, a Bing Grounding resource, or any other external service, secret rotation for those follows the partner's runbook, not this one.
9. Incident playbook¶
P1 — bad output or unsafe tool behavior in production¶
- Flip the killswitch (Section 3). Side-effect tools halt immediately; read-only paths keep working so in-flight sessions don't error.
- In App Insights, query
tracesformessage in ('tool.executed','tool.hitl_approved','tool.hitl_rejected','tool.hitl_misconfigured','response.returned')andtostring(customDimensions.ok) == 'false'over the incident window. Correlate to a specific agent / tool. - If a prompt regression is suspected, do not trust Foundry
portal history as a durable record — check
docs/agent-specs/<foundry_name>.mdin the fork's git history. Revert the spec file +azd provisionto roll back. - If a code regression is suspected, revert the offending commit
in the fork +
azd deploy. - Re-run evals (Section 5). Disengage the killswitch only when they pass.
P1 — model outage¶
Manifests as elevated response.returned with ok == 'false' and
error strings mentioning Foundry. Confirm via Azure Service Health.
Mitigation: swap to a backup model (Section 6), re-run evals, deploy. Revert when the primary region recovers. This is a partner-coordinated change if they own the fork's release process.
P1 — DNS / private-endpoint failure (Tier 3 only)¶
Tier 3 resolves Foundry / Search / Key Vault through hub-provided
private DNS zones. If the hub team removes or re-keys a zone link,
the Container App hits connection or DNS errors. Symptoms: sudden
burst of response.returned errors with transport-level messages.
Mitigation: the platform / networking team that owns the hub restores
the zone link. tier3InputGuard only fires at provision time — it
does not detect post-deploy drift, so this is a shared-ownership
incident.
P2 — cost regression¶
Cost alerts fire (Section 2). Likely causes:
- Model swap without refreshing
MODEL_PRICE_USD_PER_1K_TOKENS - Prompt regression inflating output tokens
- Usage-pattern shift that increases retrieval frequency
Remediate by rolling back whichever change correlates. Do not
"fix" a cost regression by relaxing cost_per_call_usd.
Security incidents¶
Follow SECURITY.md — vulnerabilities go to MSRC; customer-side
security incidents follow the customer's own IR process.
10. Scaling¶
Container App¶
infra/modules/container-app.bicep ships minReplicas: 1,
maxReplicas: 3 on the Consumption profile. Tune those values in
the fork and azd provision. Changing autoscale rules (CPU-based,
queue-based, cron-based) is a Bicep edit.
Model capacity¶
Capacity is a regional TPM quota on the Foundry account.
Increase via accelerator.yaml.models[].capacity (Section 6). If the
region is at quota, request an increase in Azure portal → Foundry →
Quotas, or add a second deployment in another region (requires
partner-authored routing — not shipped in the flagship).
AI Search¶
Change the SKU in infra/modules/ai-search.bicep (and replicaCount
/ partitionCount if tuned in the fork) then azd provision.
Some SKU transitions require re-indexing; confirm with
src/bootstrap.py before going live.
11. Re-provisioning and rollback¶
azd provision¶
Idempotent. Re-applies infra/main.bicep, then runs the preprovision
and the in-app FastAPI startup bootstrap (src/bootstrap.py
). It does touch:
- Foundry agent instructions (overwritten from
docs/agent-specs/) - AI Search index schema and, if the index is empty or the seed script is written to re-seed, the index contents
Plan azd provision windows accordingly — it is not purely an
infra-plane operation.
azd deploy¶
Pushes a new Container App image only. Does not touch Foundry, Search, or Key Vault.
azd down --purge¶
Destructive — tears down every resource in the environment. Use only when decommissioning or recovering from a corrupted environment.
Rolling back code¶
git revert in the fork + azd deploy. Prompt/spec changes
require azd provision (see Section 7).
12. Monthly value review¶
Pull numbers from App Insights directly. accelerator.yaml.kpis[]
enumerates the KPIs the engagement committed to in discovery; each
entry has {name, type, baseline, target} only — it does not
auto-wire telemetry events. Partners wire specific events per KPI
when they implement the scenario. Confirm the mapping with the
partner and reuse the workbook panels (Section 2) plus custom KQL for
scenario-specific KPIs.
At the review, compare 30-day trends to the baseline and target in
docs/discovery/solution-brief.md. If a KPI drifts from target,
trigger a partner engagement for prompt / retrieval / tool tuning —
that work is out of day-2 scope.
13. Out of scope for this runbook¶
- Editing the upstream accelerator template itself
- Scaffolding a new scenario
- Partner-authored custom integrations (the partner's runbook owns those)
- Changes to the customer's landing zone, hub networking, or Azure policy outside this engagement's resource group
- End-user training material
For anything not covered: escalate to the partner delivery team, or file a GitHub Issue on the template repo for upstream fixes.
← Back to the partner walkthrough
This page is the deep day-2 runbook. The walkthrough version with the most common loops lives at 10. Operate (Day 2). The engagement-specific handover packet supersedes both for any one customer.