8. Iterate & evaluate¶

Step 8 of 10 · Deliver to a customer

Step at a glance

🎯 Goal — Customise prompts, tools, and retrieval; grow the eval suite; ship through PR-gated CI until acceptance thresholds from accelerator.yaml are green and KPI events are emitting in App Insights.

📋 Prerequisite — 7. Provision the customer's Azure complete — /healthz returns 200; API URL captured.

💻 Where you'll work — VS Code (Copilot Chat for the agent edits, integrated terminal for git push); GitHub web (PRs + Actions runs).

✅ Done when — Quality evals ≥ acceptance thresholds in accelerator.yaml; redteam green; lint green; KPI events emitting in App Insights against real traffic.

Chatmodes used here

/add-tool · /add-worker-agent · /explain-change · /switch-to-variant

Full reference: Chatmodes overview.

What success looks like

python scripts/enforce-acceptance.py against the customer environment finishes with:

✅ All acceptance thresholds met for env=<customer>-dev
   quality       0.92  ≥ 0.85
   groundedness  0.96  ≥ 0.90
   safety        1.00  ≥ 1.00
   latency_p95   2.1s  ≤ 3.0s
   cost_per_call $0.018 ≤ $0.025

Your PR's GitHub Actions tab shows four green checks: accelerator-lint · evals/quality · evals/redteam · build.

App Insights → Workbooks → ROI KPIs panels are populated against real /research/stream traffic (no empty cards).

Establish the acceptance baseline first¶

Before any custom changes, run the acceptance chain once against the freshly deployed flagship. Those numbers are the engagement's known-good starting point — every subsequent PR has to clear this same bar.

# Replace <api-url> with the URL azd up printed in step 7
python evals/quality/run.py --api-url <api-url>
python evals/redteam/run.py --api-url <api-url>
python scripts/enforce-acceptance.py

enforce-acceptance.py reports pass / fail against every threshold in accelerator.yaml.acceptance (quality, groundedness, safety, P50/P95 latency, cost per call). If a threshold fails on the unmodified flagship, fix the deploy first — quotas, model region, or grounding seed are the usual culprits — before authoring scenario-specific changes.

Capture the output (a screenshot or > baseline.txt in the customer fork) so the team has a reference when later PRs move a number.

Iterate with Copilot¶

In VS Code, just talk to Copilot:

"Add a tool to create a ticket in ServiceNow; it should require HITL for anything with priority high."

Copilot follows .github/copilot-instructions.md — creates src/tools/servicenow_ticket.py with HITL scaffolding, registers it on the right worker agent, adds a unit test, and adds a redteam case.

For agent edits, edit the spec markdown:

docs/agent-specs/<agent>.md   # ## Instructions section

…then azd provision (or the next azd up) syncs the spec to Foundry.

Never edit instructions in the Foundry portal

bootstrap.py overwrites portal drift on next start. Edit docs/agent-specs/<agent>.md and re-provision instead.

For new specialist workers, use the scaffolder:

python scripts/scaffold-agent.py <agent_id> --scenario <scenario-id> \
  --capability "<one-sentence capability>" [--depends-on a,b]

The scaffolder appends to the declarative WORKERS registry in src/scenarios/<id>/workflow.py, creates the three-layer files (prompt.py, transform.py, validate.py), and writes a Foundry agent spec stub. It is transactional and re-run safe.

Ship through CI¶

git checkout -b feat/servicenow-tool
git add -A && git commit -m "Add ServiceNow tool"
gh pr create

The PR triggers four gates:

scripts/accelerator-lint.py — 30 deterministic rules.
evals/quality/ — must clear thresholds in accelerator.yaml -> acceptance.
evals/redteam/ — XPIA + jailbreak must pass; new tools trigger new cases.
build + type check — ruff + pyright.

Any red light blocks merge. Green = azd deploy against the customer environment via the GitHub Environment registered in step 7.

Watch the dashboard¶

Open Azure portal → the customer's resource group → Application Insights → Workbooks. The KPI events declared in accelerator.yaml -> kpis are pre-wired to dashboard panels (infra/dashboards/roi-kpis.json). Send real traffic against /research/stream (or your scenario's endpoint) and confirm the panels light up.

If a panel stays empty, check that src/accelerator_baseline/telemetry.py actually emits the event name declared in the manifest. The lint rule kpis_emitted_in_code catches missing emitters at PR time.

Optional — ship a UI for UAT demos¶

The shipped API is SSE-only. Many partner teams stand up a quick reference UI for UAT walkthroughs:

cd patterns/sales-research-frontend
npm install
npm run dev
# or `swa deploy` to Azure Static Web Apps

The starter is a minimal React + Vite + TypeScript SSE consumer — reference material, not a finished product. The customer's real UX is the partner's value-add. Before customer-facing UI ships, also wire end-user auth (Easy Auth / App Gateway / Front Door), state persistence (Cosmos / Postgres / Redis) and the HITL approval surface (Logic Apps / Teams / ServiceNow that HITL_APPROVER_ENDPOINT resolves to). The full ownership boundary lives in Reference → Delivery context → Partner playbook.

→ Reference → Frontend starter

Need a different shape?¶

The variants are manual re-authoring walkthroughs (documented in patterns/<variant>/README.md), not drop-in packages:

/switch-to-variant

…in Copilot Chat — pick single-agent (no supervisor) or chat-with-actioning (conversational front-end). For a different business scenario, see Reference → Reference scenarios → Customer service actioning or RFP response for full walkthroughs.

Continue → 9. UAT & handover