LLM响应评估与Spring AI：使用递归顾问构建LLM-as-a-Judge

评估大型语言模型（LLM）输出的挑战对于 notoriously 非确定性的AI应用至关重要，特别是当它们进入生产环境时。像ROUGE和BLEU这样的传统指标在评估现代LLM产生的细致入微、上下文相关的响应时显得不足。人工评估虽然准确，但成本高、速度慢且无法扩展。

引入LLM-as-a-Judge - 这是一种强大的技术，使用LLM本身来评估AI生成内容的质量。研究表明，复杂的评判模型可以与人类判断达到85%的一致性，这实际上高于人类之间的一致性（81%）。

在本文中，我们将探讨Spring AI的递归顾问如何为实施LLM-as-a-Judge模式提供优雅的框架，使您能够构建具有自动质量控制的自我改进AI系统。

💡 演示：在evaluation-recursive-advisor-demo中找到完整的示例实现。

理解LLM-as-a-Judge

LLM-as-a-Judge是一种评估方法，其中大型语言模型评估其他模型或自身生成的输出质量。不依赖人工评估者或传统的自动化指标，LLM-as-a-Judge利用LLM根据预定义标准对响应进行评分、分类或比较。

为什么这有效？评估从根本上比生成更容易。当您使用LLM作为评判者时，您要求它执行更简单、更专注的任务（评估现有文本的特定属性），而不是创建原始内容同时平衡多个约束的复杂任务。

一个很好的类比是：批评比创造更容易。发现问题比预防问题更简单。

有两种主要的LLM-as-a-judge评估模式：

直接评估（点式评分）：评判者评估单个响应，提供可通过自我改进优化提示的反馈
成对比较：评判者选择两个候选响应中更好的一个（常见于A/B测试）

LLM评判者评估质量维度，如相关性、事实准确性、对来源的忠实度、指令遵循度，以及在医疗保健、金融、RAG系统和对话等领域的整体连贯性和清晰度。

选择合适的评判模型

虽然像GPT-4和Claude这样的通用模型可以作为有效的评判者，但专门的LLM-as-a-Judge模型在评估任务中始终优于它们。Judge Arena排行榜专门跟踪各种模型在评判任务中的表现。

Spring AI：完美的基础

Spring AI的ChatClient提供了一个流畅的API，非常适合实现LLM-as-a-Judge模式。其顾问系统允许您以模块化、可重用的方式拦截、修改和增强AI交互。

最近引入的递归顾问通过启用循环模式进一步推进了这一功能，这非常适合自我改进的评估工作流：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


public class MyRecursiveAdvisor implements CallAdvisor {
    
    @Override
    public ChatClientResponse adviseCall(ChatClientRequest request, CallAdvisorChain chain) {
        
        // 初始调用链
        ChatClientResponse response = chain.nextCall(request);
        
        // 根据评估检查是否需要重试
        while (!evaluationPasses(response)) {

            // 根据评估反馈修改请求
            ChatClientRequest modifiedRequest = addEvaluationFeedback(request, response);
            
            // 创建子链并递归
            response = chain.copy(this).nextCall(modifiedRequest);
        }
        
        return response;
    }
}

我们将实现一个SelfRefineEvaluationAdvisor，使用Spring AI的递归顾问体现LLM-as-a-Judge模式。

该顾问将自动评估AI响应，并通过反馈驱动的改进重试失败的尝试：生成响应 → 评估质量 → 如果需要则使用反馈重试 → 重复直到达到质量阈值或达到重试限制。

让我们检查演示高级评估模式的实现：

SelfRefineEvaluationAdvisor实现

此实现演示了直接评估模式，其中评判模型使用点式评分系统（1-4分制）评估单个响应。它将其与自我改进策略相结合，通过将具体反馈纳入后续尝试来自动重试失败的评估，创建迭代改进循环。

该顾问体现了两个关键的LLM-as-a-Judge概念：

点式评估：每个响应根据预定义标准接收单独的质量分数
自我改进：失败的响应触发重试尝试，带有建设性反馈以指导改进

（基于文章：使用LLM-as-a-judge 🧑‍⚖️进行自动化和多功能评估）

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105


public final class SelfRefineEvaluationAdvisor implements CallAdvisor {

    private static final PromptTemplate DEFAULT_EVALUATION_PROMPT_TEMPLATE = new PromptTemplate(
        """
        You will be given a user_question and assistant_answer couple.
        Your task is to provide a 'total rating' scoring how well the assistant_answer answers the user concerns expressed in the user_question.
        Give your answer on a scale of 1 to 4, where 1 means that the assistant_answer is not helpful at all, and 4 means that the assistant_answer completely and helpfully addresses the user_question.

        Here is the scale you should use to build your answer:
        1: The assistant_answer is terrible: completely irrelevant to the question asked, or very partial
        2: The assistant_answer is mostly not helpful: misses some key aspects of the question
        3: The assistant_answer is mostly helpful: provides support, but still could be improved
        4: The assistant_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

        Provide your feedback as follows:

        \\{
            "rating": 0,
            "evaluation": "Explanation of the evaluation result and how to improve if needed.",
            "feedback": "Constructive and specific feedback on the assistant_answer."
        \\}

        Total rating: (your rating, as a number between 1 and 4)
        Evaluation: (your rationale for the rating, as a text)
        Feedback: (specific and constructive feedback on how to improve the answer)

        You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

        Now here are the question and answer.

        Question: {question}
        Answer: {answer}

        Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.

        Evaluation:
        """);

    @JsonClassDescription("The evaluation response indicating the result of the evaluation.")
    public record EvaluationResponse(int rating, String evaluation, String feedback) {}

    @Override
    public ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) {
        var request = chatClientRequest;
        ChatClientResponse response;

        // 改进的循环结构，具有更好的尝试计数和更清晰的逻辑
        for (int attempt = 1; attempt <= maxRepeatAttempts + 1; attempt++) {

            // 进行内部调用（例如，到评估LLM模型）
            response = callAdvisorChain.copy(this).nextCall(request);

            // 执行评估
            EvaluationResponse evaluation = this.evaluate(chatClientRequest, response);

            // 如果评估通过，返回响应
            if (evaluation.rating() >= this.successRating) {
                logger.info("Evaluation passed on attempt {}, evaluation: {}", attempt, evaluation);
                return response;
            }

            // 如果这是最后一次尝试，无论如何返回响应
            if (attempt > maxRepeatAttempts) {
                logger.warn(
                    "Maximum attempts ({}) reached. Returning last response despite failed evaluation. Use the following feedback to improve: {}",
                    maxRepeatAttempts, evaluation.feedback());
                return response;
            }

            // 使用评估反馈重试
            logger.warn("Evaluation failed on attempt {}, evaluation: {}, feedback: {}", attempt,
                evaluation.evaluation(), evaluation.feedback());

            request = this.addEvaluationFeedback(chatClientRequest, evaluation);
        }

        // 由于上述循环逻辑，这应该永远不会到达
        throw new IllegalStateException("Unexpected loop exit in adviseCall");
    }

    /**
     * 使用LLM-as-a-Judge执行评估并返回结果。
     */
    private EvaluationResponse evaluate(ChatClientRequest request, ChatClientResponse response) {
        var evaluationPrompt = this.evaluationPromptTemplate.render(
            Map.of("question", this.getPromptQuestion(request), "answer", this.getAssistanAnswer(response)));

        // 使用单独的ChatClient进行评估以避免自恋偏见
        return chatClient.prompt(evaluationPrompt).call().entity(EvaluationResponse.class);
    }

    /**
     * 创建带有评估反馈的新请求以进行重试。
     */
    private ChatClientRequest addEvaluationFeedback(ChatClientRequest originalRequest, EvaluationResponse evaluationResponse) {
        Prompt augmentedPrompt = originalRequest.prompt()
            .augmentUserMessage(userMessage -> userMessage.mutate().text(String.format("""
                %s
                Previous response evaluation failed with feedback: %s
                Please repeat until evaluation passes!
                """, userMessage.getText(), evaluationResponse.feedback())).build());

        return originalRequest.mutate().prompt(augmentedPrompt).build();
    }
}

关键实现特性

递归模式实现

顾问使用callAdvisorChain.copy(this).nextCall(request)创建用于递归调用的子链，在维护正确顾问排序的同时启用多轮评估。

结构化评估输出

使用Spring AI的结构化输出功能，评估结果被解析为EvaluationResponse记录，包含评分（1-4）、评估理由和具体的改进反馈。

独立评估模型

使用专门的LLM-as-a-Judge模型（avcodes/flowaicom-flow-judge:q4）和不同的ChatClient实例来减轻模型偏见。

设置spring.ai.chat.client.enabled=false以启用使用多个聊天模型。

反馈驱动的改进

失败的评估包括具体反馈，这些反馈被纳入重试尝试，使系统能够从评估失败中学习。

可配置的重试逻辑

支持可配置的最大尝试次数，在达到评估限制时具有优雅降级。

完整集成

以下是如何将SelfRefineEvaluationAdvisor集成到完整的Spring AI应用程序中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


@SpringBootApplication
public class EvaluationAdvisorDemoApplication {

    @Bean
    CommandLineRunner commandLineRunner(AnthropicChatModel anthropicChatModel, OllamaChatModel ollamaChatModel) {
        return args -> {
            
            ChatClient chatClient = ChatClient.builder(anthropicChatModel) // @formatter:off
                    .defaultTools(new MyTools())
                    .defaultAdvisors(
                        
                        SelfRefineEvaluationAdvisor.builder()
                            .chatClientBuilder(ChatClient.builder(ollamaChatModel)) // 用于评估的独立模型
                            .maxRepeatAttempts(15)
                            .successRating(4)
                            .order(0)
                            .build(),
                        
                        new MyLoggingAdvisor(2))
                .build(); 
                
            var answer = chatClient
                .prompt("What is current weather in Paris?")
                .call()
                .content();

            System.out.println(answer);
        };
    }

    static class MyTools {
        final int[] temperatures = {-125, 15, -255};
        private final Random random = new Random();
        
        @Tool(description = "Get the current weather for a given location")
        public String weather(String location) {
            int temperature = temperatures[random.nextInt(temperatures.length)];
            System.out.println(">>> Tool Call responseTemp: " + temperature);
            return "The current weather in " + location + " is sunny with a temperature of " + temperature + "°C.";
        }
    }
}

此配置使用Anthropic Claude进行生成，使用Ollama进行评估（避免偏见），需要评分为4，最多15次重试尝试。

它包括天气工具，生成随机响应以触发评估。

天气工具在2/3的情况下生成无效值。

SelfRefineEvaluationAdvisor（顺序0）评估响应质量并在需要时使用反馈重试，随后是MyLoggingAdvisor（顺序2），它记录最终请求/响应以进行可观察性。

运行时，您将看到如下输出：

1
2
3
4
5
6
7
8
9


REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]

>>> Tool Call responseTemp: -255
Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
 
>>> Tool Call responseTemp: 15  
Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data

RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.

🚀 亲自尝试：完整的可运行演示与配置示例，包括不同的模型组合和评估场景，可在evaluation-recursive-advisor-demo项目中获得。

结论

Spring AI的递归顾问使实现LLM-as-a-Judge模式既优雅又可用于生产。

SelfRefineEvaluationAdvisor演示了如何构建自我改进的AI系统，这些系统自动评估响应质量，使用反馈重试，并在无需人工干预的情况下扩展评估。

关键优势包括自动质量控制、通过独立评判模型减轻偏见，以及与现有Spring AI应用程序的无缝集成。

这种方法为跨聊天机器人、内容生成和复杂AI工作流的可靠、可扩展质量保证提供了基础。

实施LLM-as-a-Judge技术时的关键成功因素包括：

使用专用评判模型以获得更好的性能（Judge Arena排行榜）
通过独立的生成/评估模型减轻偏见
确保确定性结果（temperature = 0）
使用整数尺度和少量示例设计提示
对高风险决策保持人工监督

⚠️ 重要说明

递归顾问是Spring AI 1.1.0-M4+中的新实验性功能。目前，它们仅支持非流式传输，需要仔细的顾问排序，并且由于多次LLM调用可能会增加成本。

特别要注意维护外部状态的内部顾问 - 它们可能需要额外注意以在迭代中保持正确性。

始终设置终止条件和重试限制以防止无限循环。

资源

Spring AI文档

Spring AI递归顾问文档
LLM-as-a-judge：使用LLM进行评估的完整指南
Spring AI顾问API指南
ChatClient API文档
EvaluationAdvisor演示项目

LLM-as-a-Judge研究

Judge Arena排行榜 - 最佳评判模型的当前排名
使用MT-Bench和Chatbot Arena评判LLM-as-a-Judge - 引入LLM-as-a-Judge范式的基础论文
法官的裁决：通过人类协议对LLM法官能力的综合分析 - 引入了一个两步基准，通过测试它们与人类判断的相关性和一致性模式来评估54个LLM作为评判者，揭示了27个模型通过类似人类或超一致判断行为实现顶级性能，无论规模大小
LLMs-as-Judges：LLM-based评估方法的综合调查
从生成到判断：LLM-as-a-judge的机遇与挑战（2024） - 涵盖LLM-as-a-Judge完整格局的调查，包括系统分类和最新挑战
LLM-as-a-Judge资源中心 - 包含论文列表、工具和正在进行研究的中央存储库
偏好泄漏：LLM-as-a-judge中的污染问题 - 关于评判模型中偏见的最新研究
谁是您的法官？关于LLM生成判断的可检测性 - 关于判断检测和透明度的新兴研究

使用Spring AI构建LLM自我评估系统：基于递归顾问的LLM-as-a-Judge实现

本文详细介绍了如何使用Spring AI的递归顾问功能实现LLM-as-a-Judge模式，通过自动评估和反馈循环构建自改进的AI系统。文章包含完整的代码实现和配置示例，展示了如何集成评估顾问到Spring AI应用中。