使用Microsoft.Extensions.AI.Evaluation测试你的AI应用

AI正在改变我们构建软件的方式，但也带来了新的挑战：你如何知道你的AI应用是否给出了正确答案并产生了正确的结果？评估（通常简称为"evals"）提供了一种结构化的质量测量方法，让你可以信任自己的结果。

在这篇博客中，我们将介绍AI评估的概念，展示Microsoft.Extensions.AI.Evaluation库如何支持Visual Studio中的.NET智能应用开发，并通过一个简单示例演示如何在你的项目中使用它们。

为什么需要评估？

作为开发者，我们依赖测试来验证代码在发布前的正确性。随着包含AI生成输出（来自LLM的响应）的智能应用变得越来越普遍，我们也需要一种方法来确保AI的行为符合预期。

这就是评估的作用所在。将评估视为AI的单元测试：它们根据正确性、相关性、安全性、用户意图等标准，甚至是你为独特场景定义的自定义领域特定标准来衡量模型输出。通过将定期评估集成到工作流程中，你可以在问题影响用户之前发现问题，对不同模型或提示进行基准测试，并在各个版本中持续提高质量，将你已从软件测试中期望的严谨性带到AI功能中。

为什么选择Microsoft.Extensions.AI.Evaluation库？

Microsoft.Extensions.AI.Evaluation库提供了构建.NET智能应用时编排评估所需的基础构建块。

与工作流程无缝集成

这些库自然地插入到你现有的.NET项目中，允许你利用熟悉的语法和测试基础设施来启用（离线）AI应用评估。你可以使用喜欢的测试框架（MSTest、xUnit、NUnit）、测试工具和工作流程（Test Explorer、dotnet test或CI/CD流水线）来评估应用。你还可以使用这些库在已部署的生产应用内执行（在线）评估，并将评估分数上传到遥测仪表板中进行实时应用监控。

丰富的研究支持指标

与Microsoft和GitHub研究人员合作开发，并在真实的GitHub Copilot体验中测试，这些库包含内置评估器来测量和改进AI应用。库包含以下NuGet包，其中包含用于以下方面的内置评估器：

内容安全：包含一组构建在Azure AI Foundry评估服务之上的评估器，可用于评估项目中AI响应的内容安全性，包括受保护材料、仇恨和不公平、暴力、代码漏洞等。通过整合这些评估，你可以将负责任AI实践直接嵌入开发工作流程，帮助你早期识别有害或风险输出，确保应用既高质量又对最终用户安全。所有安全评估器都依赖Azure AI Foundry评估服务（以及托管在此服务后面的微调模型）来执行评估。
质量：可用于评估项目中AI响应质量的评估器，包括相关性、连贯性、完整性等。我们还提供评估AI代理质量的评估器，衡量代理处理任务、解析意图和正确使用工具的能力。这些质量评估有助于确保AI应用不仅产生响应，而且产生可靠、有用且符合用户期望的响应。所有质量评估器都需要LLM连接来执行评估。
NLP（自然语言处理）：实现用于评估机器翻译和自然语言处理任务的常见算法的一组评估器。评估器目前包括较旧的经典文本相似性指标（BLEU、GLEU、F1）来评估文本相似性，并且不依赖LLM来执行评估。

虽然库包含所有这些内置评估器和报告功能，但如果你的场景需要特定内容，你还可以利用以下NuGet包中可用的核心抽象、构建块和扩展点来实现所需功能。例如，你可以定义并插入自己的自定义领域特定评估器和指标，或用于存储评估结果的自定义存储提供程序。

评估：定义评估的核心抽象和构建块（如IEvaluator和EvaluationMetric）。
报告：包含对缓存LLM响应、存储评估结果以及从该数据生成报告的支持。

开箱即用的交互式报告

评估库还包括一个CLI工具（dotnet aieval），你可以使用它生成详细的HTML报告，清晰展示AI的性能表现。报告包括场景结果的分层渲染，便于从高级指标深入到单个评估细节。你可以使用搜索或标签过滤场景以关注特定兴趣领域，甚至可以跟踪历史趋势以查看质量随时间的变化。

内置报告功能意味着你无需设置自己的仪表板即可开始理解评估数据。同时，也可以使用库中可用的构建块来实现自己的自定义报告/仪表板。

更快且更具成本效益的测试

得益于评估库中内置的响应缓存支持，使用相同输入重复调用相同AI模型不需要在每次评估运行时生成新的LLM请求。相反，缓存响应被重用，这既节省了开发时间，又避免了冗余模型调用的成本。

响应缓存使评估变得增量。当你在CI流水线中运行它们时，未更改提示的响应直接从缓存中提供，因此它们快速完成。只有自上次运行以来已修改的提示才会触发新的模型调用，保持评估运行效率而不牺牲准确性。

在下面的屏幕截图中，你可以看到生成的HTML报告中的一些示例诊断数据，显示此评估的所有聊天响应都是从缓存中完成的，减少了延迟和相关令牌成本。

具有Azure Blob集成的可扩展存储

库包含Azure Blob Storage的内置提供程序，让你可以在云中持久化评估数据，包括结果和缓存的模型响应。将结果存储在Azure中为团队提供了一个中心位置来跟踪评估历史、跨环境共享结果并支持长期监控或合规工作流程，所有这些都无需更改评估代码。查看dotnet/ai-samples GitHub存储库以获取如何配置Azure存储集成的示例。

库还包括可用于将评估数据持久化到本地磁盘的提供程序。如果Azure Blob Storage和本地磁盘都不适合，系统是可扩展的：你可以定义自己的提供程序将评估数据存储在选择的后端中。

将AI评估引入Azure DevOps流水线

Microsoft.Extensions.AI.Evaluation库也可以与Azure DevOps CI/CD流水线集成。你可以向流水线添加步骤来执行评估（就像执行单元测试一样），使AI质量检查成为CI/CD过程的一部分。此设置让你可以在每次构建时自动运行评估器，并发布详细报告，这些报告可以使用Marketplace中提供的插件直接在流水线内显示。通过将AI评估视为Azure DevOps中的一等检查，你可以确保智能应用在到达用户之前满足质量标准。

设计灵活且可扩展

库是模块化的，你只需要使用所需的部分。如果缓存不适合你的场景，可以跳过它，或自定义报告层以与团队的工具配合使用。你还可以使用自己的自定义评估器和领域特定指标扩展系统，让你控制评估如何适应特定项目。

如上所述，你还可以定义自定义存储提供程序来存储评估数据。由于评估数据是JSON可序列化的，如果内置报告不满足需求，你可以自由地在其之上构建自己的仪表板或报告流水线。

dotnet/ai-samples GitHub存储库中提供的全面示例演示了如何利用其中一些扩展点和自定义。

各部分如何协同工作

重要的是，Microsoft.Extensions.AI.Evaluation库构建在Microsoft.Extensions.AI（MEAI）抽象之上——这些是创建AI应用时会使用的相同API和构建块。这确保了将应用输出传递到评估库是直接的，并且评估器与你已依赖的更广泛的.NET生态系统兼容。

以下是各部分如何协同工作：

核心构建块——基础是抽象，如IEvaluator、EvaluationMetric（来自Microsoft.Extensions.AI.Evaluation包）。这些是核心构建块，允许你构建自己的自定义评估器，与库中包含的评估器一起运行它们，并生成报告——所有这些都使用一致API。
即用型评估器——在此基础之上，Microsoft提供了用于质量、安全和NLP的开箱即用评估器。这些让你快速检查代理是否遵循指令、避免不安全输出或产生符合期望的文本。
报告与集成——一旦运行评估，报告库（Microsoft.Extensions.AI.Evaluation.Reporting和Microsoft.Extensions.AI.Evaluation.Reporting.Azure）、dotnet aieval CLI工具和Azure DevOps插件帮助你本地或在云中存储结果，通过缓存响应节省时间和成本，并生成报告以在重要位置展示评估结果。

结果是一个熟悉的工作流程：在Visual Studio中编写代码，以运行其他测试的相同方式运行评估，并使用报告功能让团队在质量上保持一致。无论你是在本地实验还是在CI/CD流水线中自动化检查，评估都与你已经用于发布可靠软件的实践紧密结合。

下图显示了（离线）评估运行的端到端流程。

入门

作为示例，我们将在本节展示如何使用简单的代码示例设置和运行Microsoft.Extensions.AI.Evaluation.Quality评估器。

设置LLM连接

下面代码示例中使用的质量评估器需要LLM来执行评估。代码示例显示如何创建连接到Azure OpenAI上部署模型的IChatClient。有关如何在Azure AI Foundry模型中部署OpenAI模型的说明，请参阅：创建并部署Azure AI Foundry模型资源中的Azure OpenAI。

注意：当运行以下示例时，我们建议使用GPT-4o或更新系列模型（例如GPT-4.1、GPT-5）（最好是这些模型的完整版本，而不是"mini"版本）。虽然Microsoft.Extensions.AI.Evaluation库和Microsoft.Extensions.AI中的底层核心抽象支持各种不同模型和LLM提供程序，但Microsoft.Extensions.AI.Evaluation.Quality包中评估器内部使用的评估提示已针对OpenAI模型（如GPT-4o）进行了调整和测试。可以通过提供能够连接到所选模型的IChatClient来使用其他模型。但是，这些模型针对评估提示的性能可能会有所不同，对于较小或本地模型可能尤其差。

首先，让我们打开开发者命令提示符并设置以下所需的环境变量。为此，你需要Azure OpenAI资源的终结点和已部署模型的部署名称。你可以从Azure门户复制这些值并将其粘贴到以下环境变量中。

1
2


SET EVAL_SAMPLE_AZURE_OPENAI_ENDPOINT=https://<your azure openai resource name>.openai.azure.com/
SET EVAL_SAMPLE_AZURE_OPENAI_MODEL=<your model deployment name (e.g., gpt-4o)>

示例使用DefaultAzureCredential进行身份验证。你可以使用开发者工具（如Visual Studio或Azure CLI）登录Azure。

设置测试项目以运行示例代码

接下来，让我们创建一个新的测试项目来演示新的评估器。

在设置上述环境变量的开发者命令提示符中打开Visual Studio——你可以通过从命令提示符运行devenv来完成此操作
选择File > New > Project…
搜索并选择MSTest Test Project
选择名称和位置，然后单击Create

创建项目后，使用包管理器找到并添加以下最新版本的NuGet包：

Azure.AI.OpenAI
Azure.Identity
Microsoft.Extensions.AI.Evaluation
Microsoft.Extensions.AI.Evaluation.Quality
Microsoft.Extensions.AI.Evaluation.Reporting
Microsoft.Extensions.AI.OpenAI（选择最新的预发布版本）

接下来，将以下代码复制到项目中（在Test1.cs内部）。示例演示了如何在同一测试类中定义的两个独立单元测试中评估LLM响应的质量。

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131


using Azure.AI.OpenAI; 
using Azure.Identity; 
using Microsoft.Extensions.AI; 
using Microsoft.Extensions.AI.Evaluation; 
using Microsoft.Extensions.AI.Evaluation.Quality; 
using Microsoft.Extensions.AI.Evaluation.Reporting; 
using Microsoft.Extensions.AI.Evaluation.Reporting.Storage; 
 
namespace EvaluationTests; 
 
[TestClass] 
public class Test1 
{ 
    // Create a ReportingConfiguration. The ReportingConfiguration includes the IChatClient used to interact with the 
    // model, the set of evaluators to be used to assess the quality of the model's responses, and other configuration 
    // related to persisting the evaluation results and generating reports based on these results. 
    private static readonly ReportingConfiguration s_reportingConfig = CreateReportingConfiguration(); 
 
    [TestMethod] 
    public async Task Test1_DistanceBetweenEarthAndVenus() 
    { 
        // Create a ScenarioRun using the ReportingConfiguration created above. 
        await using ScenarioRun scenarioRun = await s_reportingConfig.CreateScenarioRunAsync("Scenarios.Venus"); 
 
        // Get a conversation that includes a query related to astronomy and the model's response to this query. 
        (IList<ChatMessage> messages, ChatResponse response) = 
            await GetAstronomyConversationAsync( 
                chatClient: scenarioRun.ChatConfiguration!.ChatClient, 
                query: "How far is the planet Venus from the Earth at its closest and furthest points?"); 
 
        // CoherenceEvaluator and FluencyEvaluator do not require any additional context beyond the messages and 
        // response to perform their assessments. GroundednessEvaluator on the other hand requires additional context 
        // to perform its assessment - it assesses how well the model's response is grounded in the grounding context 
        // provided below. 
        List<EvaluationContext> additionalContext = 
            [ 
                new GroundednessEvaluatorContext( 
                    """ 
                    Distance between Venus and Earth at inferior conjunction: Between 23 and 25 million miles approximately. 
                    Distance between Venus and Earth at superior conjunction: Between 160 and 164 million miles approximately. 
                    The exact distances can vary due to the specific orbital positions of the planets at any given time. 
                    """) 
            ]; 
 
        // Run the evaluators included in the ReportingConfiguration to assess the quality of the above response. 
        EvaluationResult result = await scenarioRun.EvaluateAsync(messages, response, additionalContext); 
 
        // Retrieve one of the metrics in the EvaluationResult (example: Groundedness). 
        NumericMetric groundedness = result.Get<NumericMetric>(GroundednessEvaluator.GroundednessMetricName); 
        Assert.IsFalse(groundedness.Interpretation!.Failed); 
 
        // Results are persisted to disk under the storageRootPath specified below once the scenarioRun is disposed. 
    } 
 
    [TestMethod] 
    public async Task Test2_DistanceBetweenEarthAndMars() 
    { 
        // Create another ScenarioRun using the same ReportingConfiguration created above. 
        await using ScenarioRun scenarioRun = await s_reportingConfig.CreateScenarioRunAsync("Scenarios.Mars"); 
 
        // Get another conversation for a different astronomy-related query. 
        (IList<ChatMessage> messages, ChatResponse response) = 
            await GetAstronomyConversationAsync( 
                chatClient: scenarioRun.ChatConfiguration!.ChatClient, 
                query: "How far is the planet Mars from the Earth at its closest and furthest points?"); 
 
        List<EvaluationContext> additionalContext = 
            [ 
                new GroundednessEvaluatorContext( 
                    """ 
                    Distance between Mars and Earth at inferior conjunction: Between 33.9 and 62.1 million miles approximately. 
                    Distance between Mars and Earth at superior conjunction: Between 249 and 250 million miles approximately. 
                    The exact distances can vary due to the specific orbital positions of the planets at any given time. 
                    """) 
            ]; 
 
        EvaluationResult result = await scenarioRun.EvaluateAsync(messages, response, additionalContext); 
 
        // Retrieve one of the metrics in the EvaluationResult (example: Coherence). 
        NumericMetric coherence = result.Get<NumericMetric>(CoherenceEvaluator.CoherenceMetricName); 
        Assert.IsFalse(coherence.Interpretation!.Failed); 
    } 
 
    private static ReportingConfiguration CreateReportingConfiguration() 
    { 
        // Create an IChatClient to interact with a model deployed on Azure OpenAI. 
        string endpoint = Environment.GetEnvironmentVariable("EVAL_SAMPLE_AZURE_OPENAI_ENDPOINT")!; 
        string model = Environment.GetEnvironmentVariable("EVAL_SAMPLE_AZURE_OPENAI_MODEL")!; 
        var client = new AzureOpenAIClient(new Uri(endpoint), new DefaultAzureCredential()); 
        IChatClient chatClient = client.GetChatClient(deploymentName: model).AsIChatClient(); 
 
        // Create a ReportingConfiguration for evaluating the quality of supplied responses. 
        return DiskBasedReportingConfiguration.Create( 
            storageRootPath: "./eval-results", // The evaluation results will be persisted to disk under this folder. 
            evaluators: [new CoherenceEvaluator(), new FluencyEvaluator(), new GroundednessEvaluator()], 
            chatConfiguration: new ChatConfiguration(chatClient), 
            enableResponseCaching: true); 
 
        // Since response caching is enabled above, all LLM responses produced via the chatClient above will also be 
        // cached under the storageRootPath so long as the inputs being evaluated stay unchanged, and so long as the 
        // cache entries do not expire (cache expiry is set at 14 days by default). 
    } 
 
    private static async Task<(IList<ChatMessage> Messages, ChatResponse ModelResponse)> GetAstronomyConversationAsync( 
        IChatClient chatClient, 
        string query) 
    { 
        const string SystemPrompt = 
            """ 
            You are an AI assistant that can answer questions related to astronomy. 
            Keep your responses concise staying under 100 words as much as possible. 
            Use the imperial measurement system for all measurements in your response. 
            """; 
 
        List<ChatMessage> messages = 
            [ 
                new ChatMessage(ChatRole.System, SystemPrompt), 
                new ChatMessage(ChatRole.User, query) 
            ]; 
 
        var chatOptions = 
            new ChatOptions 
            { 
                Temperature = 0.0f, 
                ResponseFormat = ChatResponseFormat.Text 
            }; 
 
        ChatResponse response = await chatClient.GetResponseAsync(messages, chatOptions); 
        return (messages, response); 
    } 
}

运行测试并生成评估报告

接下来，让我们运行上述单元测试。你可以使用Visual Studio的Test Explorer来运行测试。请注意，这些评估单元测试在第一次运行时需要更长时间（因为代码中的每个LLM调用都会导致对配置的AI模型的新请求）。但是，使用相同AI模型和输入的后续运行将显著更快，因为缓存响应被重用。

运行测试后，你可以使用dotnet aieval工具生成包含上述示例中两个场景结果的HTML报告。

首先，在项目文件夹下本地安装工具：

1

dotnet tool install Microsoft.Extensions.AI.Evaluation.Console --create-manifest-if-needed

然后生成并打开报告：

1

dotnet aieval report -p <path to 'eval-results' folder under the build output directory for the above project> -o .\report.html --open

--open标志将自动在默认浏览器中打开生成的报告，允许你交互式地探索评估结果。以下是生成报告的预览——此屏幕截图显示了在报告中单击第一个场景的"Groundedness"指标时显示的详细信息。

了解更多

本文对评估及其如何适应Visual Studio工作流程进行了高级介绍。另请查看.NET博客上的以下早期文章，它们更详细地介绍了Microsoft.Extensions.AI.Evaluation库中可用的特定评估器和功能：

探索.NET AI应用的新代理质量和NLP评估器
评估.NET应用中的内容安全性
解锁.NET AI评估器的新可能性
轻松评估AI应用的质量

Microsoft.Extensions.AI.Evaluation库本身是开源的，并从GitHub上的dotnet/extensions存储库发布。我们在Microsoft Learn上有文章，提供有关这些库为.NET提供的功能的背景信息，以及一些教程和快速入门指南。

有关最全面的示例，展示Microsoft.Extensions.AI.Evaluation库的各种API概念、功能、最佳实践评论和常见使用模式，请查看dotnet/ai-samples存储库中的API使用示例。这些示例结构化为一系列单元测试，其中每个单元测试演示后续示例构建的特定概念。

这些资源共同展示了Microsoft.Extensions.AI.Evaluation库如何应用于不同场景，为你提供构建可靠AI应用的灵活工具包。我们鼓励你在自己的AI应用中尝试这些评估器并分享反馈。如果遇到任何问题或有改进建议，请在GitHub上报告。你的反馈帮助我们继续增强评估库并为Visual Studio和.NET AI开发社区构建更好的工具。

评估愉快！