8. Iterate & evaluate¶
Step 8 of 10 Β· Deliver to a customer
Step at a glance
π― Goal β Customise prompts, tools, and retrieval; grow the eval suite; ship through PR-gated CI until acceptance thresholds from accelerator.yaml are green and KPI events are emitting in App Insights.
π Prerequisite β 7. Provision the customer's Azure complete β /healthz returns 200; API URL captured.
π» Where you'll work β VS Code (Copilot Chat for the agent edits, integrated terminal for git push); GitHub web (PRs + Actions runs).
β
Done when β Quality evals β₯ acceptance thresholds in accelerator.yaml; redteam green; lint green; KPI events emitting in App Insights against real traffic.
Chatmodes used here
/add-tool Β· /add-worker-agent Β· /explain-change Β· /switch-to-variant
Full reference: Chatmodes overview.
What success looks like
python scripts/enforce-acceptance.py against the customer environment finishes with:
β
All acceptance thresholds met for env=<customer>-dev
quality 0.92 β₯ 0.85
groundedness 0.96 β₯ 0.90
safety 1.00 β₯ 1.00
latency_p95 2.1s β€ 3.0s
cost_per_call $0.018 β€ $0.025
Your PR's GitHub Actions tab shows four green checks: accelerator-lint Β· evals/quality Β· evals/redteam Β· build.
App Insights β Workbooks β ROI KPIs panels are populated against real /research/stream traffic (no empty cards).
Establish the acceptance baseline first¶
Before any custom changes, run the acceptance chain once against the freshly deployed flagship. Those numbers are the engagement's known-good starting point β every subsequent PR has to clear this same bar.
# Replace <api-url> with the URL azd up printed in step 7
python evals/quality/run.py --api-url <api-url>
python evals/redteam/run.py --api-url <api-url>
python scripts/enforce-acceptance.py
enforce-acceptance.py reports pass / fail against every threshold in accelerator.yaml.acceptance (quality, groundedness, safety, P50/P95 latency, cost per call). If a threshold fails on the unmodified flagship, fix the deploy first β quotas, model region, or grounding seed are the usual culprits β before authoring scenario-specific changes.
Capture the output (a screenshot or > baseline.txt in the customer fork) so the team has a reference when later PRs move a number.
Iterate with Copilot¶
In VS Code, just talk to Copilot:
"Add a tool to create a ticket in ServiceNow; it should require HITL for anything with priority high."
Copilot follows .github/copilot-instructions.md β creates src/tools/servicenow_ticket.py with HITL scaffolding, registers it on the right worker agent, adds a unit test, and adds a redteam case.
For agent edits, edit the spec markdown:
β¦then azd provision (or the next azd up) syncs the spec to Foundry.
Never edit instructions in the Foundry portal
bootstrap.py overwrites portal drift on next start. Edit docs/agent-specs/<agent>.md and re-provision instead.
For new specialist workers, use the scaffolder:
python scripts/scaffold-agent.py <agent_id> --scenario <scenario-id> \
--capability "<one-sentence capability>" [--depends-on a,b]
The scaffolder appends to the declarative WORKERS registry in src/scenarios/<id>/workflow.py, creates the three-layer files (prompt.py, transform.py, validate.py), and writes a Foundry agent spec stub. It is transactional and re-run safe.
Ship through CI¶
The PR triggers four gates:
scripts/accelerator-lint.pyβ 30 deterministic rules.evals/quality/β must clear thresholds inaccelerator.yaml -> acceptance.evals/redteam/β XPIA + jailbreak must pass; new tools trigger new cases.- build + type check β
ruff+pyright.
Any red light blocks merge. Green = azd deploy against the customer environment via the GitHub Environment registered in step 7.
Watch the dashboard¶
Open Azure portal β the customer's resource group β Application Insights β Workbooks. The KPI events declared in accelerator.yaml -> kpis are pre-wired to dashboard panels (infra/dashboards/roi-kpis.json). Send real traffic against /research/stream (or your scenario's endpoint) and confirm the panels light up.
If a panel stays empty, check that src/accelerator_baseline/telemetry.py actually emits the event name declared in the manifest. The lint rule kpis_emitted_in_code catches missing emitters at PR time.
Optional β ship a UI for UAT demos¶
The shipped API is SSE-only. Many partner teams stand up a quick reference UI for UAT walkthroughs:
cd patterns/sales-research-frontend
npm install
npm run dev
# or `swa deploy` to Azure Static Web Apps
The starter is a minimal React + Vite + TypeScript SSE consumer β reference material, not a finished product. The customer's real UX is the partner's value-add. Before customer-facing UI ships, also wire end-user auth (Easy Auth / App Gateway / Front Door), state persistence (Cosmos / Postgres / Redis) and the HITL approval surface (Logic Apps / Teams / ServiceNow that HITL_APPROVER_ENDPOINT resolves to). The full ownership boundary lives in Reference β Delivery context β Partner playbook.
β Reference β Frontend starter
Need a different shape?¶
The variants are manual re-authoring walkthroughs (documented in patterns/<variant>/README.md), not drop-in packages:
β¦in Copilot Chat β pick single-agent (no supervisor) or chat-with-actioning (conversational front-end). For a different business scenario, see Reference β Reference scenarios β Customer service actioning or RFP response for full walkthroughs.
Continue β 9. UAT & handover