📄️ Create Azure resources
Here we will create the needed Azure resources that we will use throughout this section.
📄️ Control cost and performance with token quotas and limits
Once you bring an LLM to production and exposes it as an API Endpoint, you need to consider how you "manage" such an API. There are many considerations to be made everything from caching, scaling, error management, rate limiting, monitoring and more.
📄️ Keep visibility into AI consumption with model monitoring
In this lesson we will use the Azure service, Azure API Management and show how by adding one of its policies to an LLM endpoint; you can monitor the usage of tokens.
📄️ Ensure resiliency and optimized resource consumption with load balancer & circuit breaker
When the number of users increase to the point that one region or server where the application have trouble responding to requests in a reasonable time. This creates a user experience where the app feels slow. To avoid this poor user experience, load balancing can be used. With "load balancing" you set up multiple endpoints capable of serving requests and additionaly configure a scheme for how the "balancing" should happen.