LLM 评估没有”银弹”,但方法的选择决定了你能看到的世界。本文将带你拆解主流评估范式,理解背后的逻辑与局限。

理解 LLM 评估的四大主流方法

如何科学评估大语言模型(LLM)?无论是模型选型、结果解读,还是微调与自研模型的进展衡量,评估方法的选择都至关重要。

目前主流的 LLM 评估方法可分为两大类:基准测试(Benchmark-based)判断类评估(Judgment-based)。常见的四种方法:

  1. 多项选择(Multiple Choice)
  2. 验证器(Verifier)
  3. 排行榜(Leaderboard)
  4. LLM 评审(LLM Judge)

方法一:多项选择准确率评估

多项选择题(如 MMLU)是最常见的基准测试方法,主要考察模型的知识回忆能力。以 MMLU(Massive Multitask Language Understanding)为例,涵盖 57 个学科、约 1.6 万道选择题,评估指标为准确率。

代码示例:加载模型与评测

以下代码演示如何加载 Qwen3 0.6B 模型并进行多项选择题评测:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from pathlib import Path
import torch
from reasoning_from_scratch.ch02 import get_device
from reasoning_from_scratch.qwen3 import (
download_qwen3_small, Qwen3Tokenizer,
Qwen3Model, QWEN_CONFIG_06_B
)

device = get_device()
torch.set_float32_matmul_precision("high")

WHICH_MODEL = "base"

if WHICH_MODEL == "base":
download_qwen3_small(
kind="base", tokenizer_only=False, out_dir="qwen3"
)
tokenizer_path = Path("qwen3") / "tokenizer-base.json"
model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)
elif WHICH_MODEL == "reasoning":
download_qwen3_small(
kind="reasoning", tokenizer_only=False, out_dir="qwen3"
)
tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
model_path = Path("qwen3") / "qwen3-0.6B-reasoning.pth"
tokenizer = Qwen3Tokenizer(
tokenizer_file_path=tokenizer_path,
apply_chat_template=True,
add_generation_prompt=True,
add_thinking=True,
)
else:
raise ValueError(f"Invalid choice: WHICH_MODEL={WHICH_MODEL}")

model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))
model.to(device)

USE_COMPILE = False
if USE_COMPILE:
torch._dynamo.config.allow_unspec_int_on_nn_module = True
model = torch.compile(model)

格式化多项选择题的 prompt

1
2
3
4
5
6
7
8
9
def format_prompt(example):
return (
f"{example['question']}\n"
f"A. {example['choices'][0]}\n"
f"B. {example['choices'][1]}\n"
f"C. {example['choices'][2]}\n"
f"D. {example['choices'][3]}\n"
"Answer: "
)

预测答案并比对

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def predict_choice(model, tokenizer, prompt_fmt, max_new_tokens=8):
pred = None
for t in generate_text_basic_stream_cache(
model=model,
token_ids=prompt_fmt,
max_new_tokens=max_new_tokens,
eos_token_id=tokenizer.eos_token_id,
):
answer = tokenizer.decode(t.squeeze(0).tolist())
for letter in answer:
letter = letter.upper()
if letter in "ABCD":
pred = letter
break
if pred:
break
return pred

实际效果:

1
2
Generated letter: C
Correct? False

多项选择题评测简单直观,适合大规模快速对比,但仅能衡量知识回忆能力,无法反映推理与真实应用表现。

方法二:验证器自动判分

验证器方法允许模型自由生成答案,再用外部工具(如代码解释器、计算器)自动比对答案正确性,适用于数学、代码等可自动验证领域。

该方法可自动生成大量题目,适合推理能力评测,但仅适用于可自动验证的领域,且依赖外部工具的准确性。

方法三:排行榜与偏好投票

排行榜方法通过用户或 LLM 对模型输出的偏好投票,统计模型受欢迎程度。典型如 LM Arena,用户对比两模型输出,投票选出更优者,最终形成排行榜。

代码示例:Elo 排名实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def elo_ratings(vote_pairs, k_factor=32, initial_rating=1000):
ratings = {
model: initial_rating
for pair in vote_pairs
for model in pair
}

for winner, loser in vote_pairs:
expected_winner = 1.0 / (
1.0 + 10 ** (
(ratings[loser] - ratings[winner]) / 400.0
)
)

ratings[winner] = (
ratings[winner] + k_factor * (1 - expected_winner)
)
ratings[loser] = (
ratings[loser] + k_factor * (0 - (1 - expected_winner))
)

return ratings

输出示例:

1
2
3
4
GPT-5 : 1043.7
Claude-3 : 1015.2
Llama-4 : 1000.7
Llama-3 : 940.4

排行榜方法能反映模型在真实场景下的受欢迎程度,但受用户群体、投票偏好等影响较大,且难以衡量答案正确性。

方法四:LLM 评审(AI 评分官)

LLM 评审方法利用强大的 LLM(如 GPT-5)作为评分官,依据评分标准(rubric)对模型输出进行自动打分,兼具可扩展性与一致性。

代码示例:Ollama API 自动评分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import json
import urllib.request

def query_model(
prompt,
model="gpt-oss:20b",
url="http://localhost:11434/api/chat"
):
data = {
"model": model,
"messages": [
{"role": "user", "content": prompt}
],
"options": {
"seed": 123,
"temperature": 0,
"num_ctx": 2048
}
}

payload = json.dumps(data).encode("utf-8")
request = urllib.request.Request(
url,
data=payload,
method="POST"
)
request.add_header("Content-Type", "application/json")

response_data = ""
with urllib.request.urlopen(request) as response:
while True:
line = response.readline().decode("utf-8")
if not line:
break
response_json = json.loads(line)
response_data += response_json["message"]["content"]

return response_data

评分标准 Prompt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def rubric_prompt(instruction, reference_answer, model_answer):
rubric = (
"You are a fair judge assistant. You will be "
"given an instruction, a reference answer, and "
"a candidate answer to evaluate, according to "
"the following rubric:\n\n"
"1: The response fails to address the "
"instruction, providing irrelevant, incorrect, "
"or excessively verbose content.\n"
"2: The response partially addresses the "
"instruction but contains major errors, "
"omissions, or irrelevant details.\n"
"3: The response addresses the instruction to "
"some degree but is incomplete, partially "
"correct, or unclear in places.\n"
"4: The response mostly adheres to the "
"instruction, with only minor errors, "
"omissions, or lack of clarity.\n"
"5: The response fully adheres to the "
"instruction, providing a clear, accurate, and "
"relevant answer in a concise and efficient "
"manner.\n\n"
"Now here is the instruction, the reference "
"answer, and the response.\n"
)

prompt = (
f"{rubric}\n"
f"Instruction:\n{instruction}\n\n"
f"Reference Answer:\n{reference_answer}\n\n"
f"Answer:\n{model_answer}\n\n"
f"Evaluation: "
)
return prompt

评分结果示例:

1
2
3
4
5
6
Score: 5

The candidate answer directly addresses the question,
correctly applies the given premises, and concisely
states that a penguin would be able to fly. It is
accurate, relevant, and clear.

LLM 评审方法适用于大规模自动评测,兼具灵活性与一致性,但结果依赖评分官模型与评分标准,存在一定主观性。

方法对比与适用建议

方法 优点 缺点
多项选择 快速、标准化、可复现 仅测知识回忆,无法反映真实应用能力
验证器 自动化、可评推理、支持自由生成 仅适用可验证领域,依赖外部工具
排行榜 反映用户真实偏好、涵盖风格与安全性 受用户群体影响,难以衡量正确性
LLM 评审 可扩展、一致性强、支持多任务 依赖评分官模型与 rubric,存在主观性

不同评估方法各有优缺点,实际应用中应结合多种方式,综合衡量模型能力。

总结

本文系统梳理了 LLM 评估的四大主流方法,并配以从零实现的代码示例。每种方法各有适用场景与局限,实际评估时建议结合多种方式,并根据业务目标定制评测数据与流程。只有这样,才能全面、客观地衡量模型的真实能力与改进空间。

参考文献