10. Operate (Day 2)¶
Step 10 of 10 ยท Deliver to a customer
Step at a glance
๐ฏ Goal โ Monthly KPI review, alert tuning, drift checks, regression evals against main. The accelerator runs in production; this step is what keeps it healthy.
๐ Prerequisite โ 9. UAT & handover complete โ customer ops has the packet; production is live.
๐ป Where you'll work โ App Insights + GitHub Actions (scheduled evals) + the customer's PR review surface.
โ
Done when โ First monthly KPI review held; first alert tuned; first regression-eval run green on main. After that, this is a recurring loop, not a one-shot step.
This page is the generic Day-2 reference. The engagement-specific handover packet supersedes it for any customer that has one (see 9. UAT & handover).
What runs on its own¶
After azd up the accelerator emits and gates without partner intervention:
- Telemetry โ every typed event declared in
src/accelerator_baseline/telemetry.pyflows into App Insights via OpenTelemetry. KPI events are dashboard-wired. - Content filters โ Bicep-attached
accelerator-default-policyblocks Medium+ on Hate / Sexual / Violence / Self-harm. Drift in the portal is overwritten on nextazd provision. - Post-deploy regression evals โ
.github/workflows/post-deploy-eval.ymlruns the quality + redteam suites on the deployed environment after every merge tomain. - HITL gates โ every side-effect tool routes through
checkpoint(...). Failure to reachHITL_APPROVER_ENDPOINTis fail-closed.
What customer ops owns¶
| Loop | Cadence | What |
|---|---|---|
| KPI review | Monthly | Pull dashboard panels declared in accelerator.yaml -> kpis. Compare against the brief's hypothesis numbers and the prior month. Flag drift to the partner team. |
| Alert tuning | As needed | Latency, error rate, eval-suite drift. Adjust thresholds based on observed baselines after the first 30 days. |
| Regression evals | Per release + nightly | Confirm evals/quality/ and evals/redteam/ are green on main against the production API URL. |
| Secret rotation | Per partner-practice schedule | AZURE_CLIENT_ID federated cred (Entra), HITL_APPROVER_ENDPOINT if the approver moves. |
| Model swap | When a new model is qualified | Edit accelerator.yaml -> models[], re-run azd up. The lint rules models_block_shape + agent_model_refs_exist block malformed manifests at PR time. |
| Killswitch drills | Quarterly | Practice flipping KILLSWITCH=1 in a non-prod environment; confirm the API returns 503 cleanly and the alert fires. |
When something breaks¶
- Open App Insights. Filter on
severityLevel >= 3for the failing time window. - Find the trace. Each end-to-end request emits a trace with the supervisor decision record + every worker invocation + every tool call (with HITL outcome).
- Check the lint + eval status on
main. If the post-deploy regression suite is red, that's where the regression entered. - Roll back if needed.
azd deployagainst a tagged commit; document in the packet's rollback section. - File a PR with the fix. PR-gated CI (lint + quality evals + redteam) blocks merge until green.
Looping back¶
When the customer asks for a new capability:
- Small additions (a new tool, a new worker, a model swap) โ back to 8. Iterate & evaluate.
- A new business scenario โ back to 5. Discover with the customer for that scenario, then through scaffold โ provision โ iterate โ UAT โ handover. Multiple scenarios coexist under
src/scenarios/<id>/.
The detailed runbook content (model swap procedures, secret rotation walkthroughs, killswitch drills, full alert reference) lives in the legacy customer runbook, which remains the deep reference under Reference โ Delivery context.
End of walkthrough. For the next engagement, Track 1 (Get ready) stays done โ return directly to 4. Clone for the customer and run Track 2 with the new customer's short-name.