传统可观测性关注”系统是否正常运行”，AI 可观测性还要关注”模型是否输出正确结果”。从 Metrics、Logs、Traces 到 Evals，AI 原生可观测性正在重新定义”可观测”的边界。

为什么 AI 系统需要新的可观测性？

传统可观测性建立在三大支柱上：Metrics（指标）、Logs（日志）、Traces（追踪）。这套体系对微服务很有效，但 AI 系统引入了新的挑战：

传统微服务	AI 系统
错误明确：4xx/5xx 状态码	错误模糊：模型输出看似正确但实际错误（幻觉）
性能指标：延迟、吞吐、错误率	质量指标：准确性、相关性、安全性
无状态：每个请求独立	有状态：上下文、对话历史影响输出
成本可预测：按实例时长计费	成本波动：按 Token 计费，难以预测
行为确定性：相同输入 → 相同输出	行为随机性：相同输入 → 可能不同输出

核心问题：

如何知道模型输出是否正确？（不是”是否报错”，而是”是否准确”）
如何追踪 Token 消耗和成本？（每个请求的成本可能差异巨大）
如何检测模型漂移？（性能随时间下降）
如何调试 Prompt 问题？（相同 Prompt 可能产生不同结果）
如何评估 RAG 效果？（检索是否准确、生成是否相关）

这些问题催生了 AI 可观测性（AI Observability）——在传统三大支柱基础上，增加第四个支柱：Evals（评估）。

AI 可观测性的四大支柱

支柱 1：Metrics（指标）

AI 系统的 Metrics 分为三类：

系统级指标（System Metrics）

与传统微服务相同，关注基础设施健康：

# 基础设施指标
- cpu_usage_percent
- memory_usage_bytes
- gpu_utilization_percent
- gpu_memory_used_mb
- network_io_bytes
- disk_io_bytes

# 服务指标
- request_count
- request_duration_seconds
- error_rate
- active_connections

模型级指标（Model Metrics）

AI 系统特有的指标：

# Token 指标
- tokens_input_total
- tokens_output_total
- tokens_per_request_avg
- tokens_cost_usd

# 性能指标
- time_to_first_token_seconds  # 首个 Token 延迟（流式响应）
- inference_duration_seconds   # 推理总耗时
- tokens_per_second            # 生成速度

# 质量指标（需 Eval 计算）
- accuracy_score
- relevance_score
- hallucination_rate
- toxicity_score

业务级指标（Business Metrics）

与业务目标相关的指标：

# 用户满意度
- user_rating_avg
- thumbs_up_rate
- conversation_completion_rate

# 成本
- cost_per_request_avg
- cost_per_user_month
- cost_total_month

# 效率
- cache_hit_rate  # 语义缓存命中率
- fallback_rate   # 故障回退率

Prometheus 配置示例：

# AI 模型指标
- record: ai:tokens:input:total
  expr: sum(ai_tokens_input_total)
  
- record: ai:tokens:output:total
  expr: sum(ai_tokens_output_total)
  
- record: ai:cost:usd:total
  expr: |
    sum(ai_tokens_input_total * ai_token_input_price_usd) +
    sum(ai_tokens_output_total * ai_token_output_price_usd)
  
- record: ai:latency:first_token:p50
  expr: histogram_quantile(0.5, ai_time_to_first_token_seconds_bucket)
  
- record: ai:latency:first_token:p95
  expr: histogram_quantile(0.95, ai_time_to_first_token_seconds_bucket)

支柱 2：Logs（日志）

AI 系统的日志需要记录更多信息：

请求日志

{
  "timestamp": "2025-12-01T10:30:00Z",
  "request_id": "req-abc123",
  "trace_id": "trace-xyz789",
  "user_id": "user-456",
  "session_id": "session-def",
  
  "model": "gpt-4",
  "prompt": "What is the capital of France?",
  "response": "The capital of France is Paris.",
  
  "tokens": {
    "input": 8,
    "output": 10,
    "total": 18
  },
  
  "cost": {
    "input_usd": 0.00024,
    "output_usd": 0.0006,
    "total_usd": 0.00084
  },
  
  "latency": {
    "first_token_ms": 450,
    "total_ms": 1200
  },
  
  "metadata": {
    "temperature": 0.7,
    "max_tokens": 100,
    "top_p": 1.0
  },
  
  "evals": {
    "relevance": 0.95,
    "accuracy": 1.0,
    "toxicity": 0.0
  }
}

对话日志（多轮对话）

{
  "session_id": "session-def",
  "turn": 3,
  "messages": [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What is its population?"},
    {"role": "assistant", "content": "Paris has a population of about 2.1 million."},
    {"role": "user", "content": "When was it founded?"},
    {"role": "assistant", "content": "Paris was founded around the 3rd century BC."}
  ],
  "total_tokens": 150,
  "total_cost_usd": 0.0045,
  "duration_seconds": 45
}

日志最佳实践：

结构化日志：使用 JSON 格式，便于查询和分析
敏感信息脱敏：对用户输入进行 PII 检测和脱敏
日志采样：对于高流量场景，可以采样记录（如 10%）
日志保留策略：原始日志保留 7 天，聚合数据保留 90 天

支柱 3：Traces（追踪）

AI 系统的追踪比传统微服务更复杂，因为涉及多个组件：

追踪链路示例

用户请求
  └─ AI Gateway（路由、限流、缓存）
      ├─ [缓存命中] 直接返回
      └─ [缓存未命中] 调用 LLM
          ├─ Prompt 构建（RAG 检索、上下文注入）
          │   ├─ 向量数据库查询
          │   └─ 文档检索与排序
          ├─ LLM 推理
          │   ├─ 模型加载（如果冷启动）
          │   └─ Token 生成（流式）
          ├─ 后处理（内容过滤、PII 脱敏）
          └─ 响应返回

OpenTelemetry 追踪示例：

from opentelemetry import trace

tracer = trace.get_tracer("ai-app")

def handle_user_query(query: str, user_id: str):
    with tracer.start_as_current_span("user-query") as root_span:
        root_span.set_attribute("user.id", user_id)
        root_span.set_attribute("query.text", query)
        
        # 1. RAG 检索
        with tracer.start_as_current_span("rag-retrieval") as rag_span:
            documents = retrieve_documents(query)
            rag_span.set_attribute("rag.documents.count", len(documents))
            rag_span.set_attribute("rag.documents.relevance", avg_relevance(documents))
        
        # 2. Prompt 构建
        with tracer.start_as_current_span("prompt-construction") as prompt_span:
            prompt = build_prompt(query, documents)
            prompt_span.set_attribute("prompt.tokens", count_tokens(prompt))
        
        # 3. LLM 调用
        with tracer.start_as_current_span("llm-inference") as llm_span:
            response = call_llm(prompt)
            llm_span.set_attribute("llm.model", "gpt-4")
            llm_span.set_attribute("llm.tokens.input", count_tokens(prompt))
            llm_span.set_attribute("llm.tokens.output", count_tokens(response))
            llm_span.set_attribute("llm.latency.first_token_ms", response.first_token_latency)
            llm_span.set_attribute("llm.latency.total_ms", response.total_latency)
        
        # 4. 后处理
        with tracer.start_as_current_span("post-processing") as post_span:
            filtered_response = filter_and_mask(response)
            post_span.set_attribute("post.pii_detected", filtered_response.pii_count)
        
        return filtered_response

追踪可视化（Jaeger）：

user-query (1200ms)
  ├─ rag-retrieval (150ms)
  │   ├─ vector-db-query (100ms)
  │   └─ document-ranking (50ms)
  ├─ prompt-construction (10ms)
  ├─ llm-inference (1000ms)
  │   ├─ model-loading (200ms)  # 冷启动
  │   └─ token-generation (800ms)
  └─ post-processing (40ms)

支柱 4：Evals（评估）

Evals 是 AI 可观测性的核心创新，用于量化模型输出质量。

Eval 的类型

Eval 类型	说明	实现方式
准确性（Accuracy）	输出是否正确	与标准答案对比
相关性（Relevance）	输出是否与问题相关	语义相似度计算
完整性（Completeness）	输出是否完整回答问题	LLM 评分
安全性（Safety）	输出是否包含有害内容	分类器检测
幻觉率（Hallucination）	输出是否包含虚构信息	事实核查
延迟（Latency）	响应时间	直接测量

Eval 实现示例

1. 准确性评估（与标准答案对比）

def evaluate_accuracy(response: str, ground_truth: str) -> float:
    """使用 LLM 评估准确性"""
    prompt = f"""
    Compare the following response to the ground truth and rate accuracy from 0 to 1:
    
    Ground Truth: {ground_truth}
    Response: {response}
    
    Score (0-1):
    """
    score = call_llm(prompt)
    return float(score)

# 示例
ground_truth = "Paris is the capital of France."
response = "The capital of France is Paris."
accuracy = evaluate_accuracy(response, ground_truth)  # 1.0

2. 相关性评估（语义相似度）

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

def evaluate_relevance(query: str, response: str) -> float:
    """计算查询和响应的语义相似度"""
    query_embedding = model.encode([query])
    response_embedding = model.encode([response])
    similarity = cosine_similarity(query_embedding, response_embedding)
    return similarity[0][0]

# 示例
query = "What is the capital of France?"
response = "The capital of France is Paris."
relevance = evaluate_relevance(query, response)  # 0.85

3. 幻觉检测（事实核查）

def detect_hallucination(response: str, context: str) -> float:
    """检测响应中是否包含上下文中未提及的信息"""
    prompt = f"""
    Check if the response contains information not supported by the context.
    Rate hallucination from 0 (no hallucination) to 1 (complete hallucination).
    
    Context: {context}
    Response: {response}
    
    Hallucination Score (0-1):
    """
    score = call_llm(prompt)
    return float(score)

# 示例
context = "Paris is the capital of France. It has a population of 2.1 million."
response = "Paris is the capital of France. It was founded in 1500 AD."
hallucination = detect_hallucination(response, context)  # 0.5 (部分幻觉)

4. 安全性评估（毒性检测）

from transformers import pipeline

toxicity_classifier = pipeline("text-classification", model="unitary/toxic-bert")

def evaluate_safety(response: str) -> float:
    """检测响应中的毒性内容"""
    result = toxicity_classifier(response)
    toxicity_score = result[0]['score'] if result[0]['label'] == 'toxic' else 0.0
    return 1.0 - toxicity_score  # 安全性 = 1 - 毒性

# 示例
response = "This is a perfectly safe response."
safety = evaluate_safety(response)  # 0.99

Eval 自动化管道

# 定时运行 Eval 测试
def run_eval_pipeline():
    # 1. 加载测试数据集
    test_dataset = load_dataset("test_questions.json")
    
    # 2. 运行模型并收集结果
    results = []
    for item in test_dataset:
        response = call_llm(item["question"])
        results.append({
            "question": item["question"],
            "ground_truth": item["answer"],
            "response": response
        })
    
    # 3. 计算 Eval 指标
    metrics = {
        "accuracy": avg([evaluate_accuracy(r["response"], r["ground_truth"]) for r in results]),
        "relevance": avg([evaluate_relevance(r["question"], r["response"]) for r in results]),
        "hallucination": avg([detect_hallucination(r["response"], r["ground_truth"]) for r in results]),
        "safety": avg([evaluate_safety(r["response"]) for r in results])
    }
    
    # 4. 记录到监控系统
    log_metrics(metrics)
    
    # 5. 告警（如果指标下降）
    if metrics["accuracy"] < 0.8:
        alert("Model accuracy dropped below 80%")

# 每天运行一次
schedule.every().day.at("02:00").do(run_eval_pipeline)

AI 可观测性工具栈

开源工具

工具	核心功能	适用场景
Langfuse	LLM 追踪、Eval、成本分析	通用 LLM 应用
Helicone	LLM 日志、成本追踪	快速上手，无需修改代码
OpenLLMetry	OpenTelemetry 扩展，LLM 追踪	已有 OpenTelemetry 基础设施
Arize Phoenix	模型评估、漂移检测	ML 模型监控
Prometheus + Grafana	指标收集与可视化	基础设施监控

商业工具

工具	核心功能	适用场景
Datadog LLM Observability	完整的 AI 可观测性平台	企业级，已有 Datadog
New Relic AI Monitoring	AI 应用性能监控	已有 New Relic
Weights & Biases	ML 实验追踪、模型评估	ML 团队
LangSmith	LangChain 官方追踪工具	使用 LangChain 的团队

工具选型建议

小型团队 / 快速验证：

Helicone：5 分钟上手，无需修改代码
Langfuse（开源版）：功能全面，社区活跃

中大型企业 / 生产环境：

Langfuse（商业版）+ Prometheus + Grafana：完整的可观测性栈
Datadog LLM Observability：如果已有 Datadog，无缝集成

ML 团队 / 模型研发：

Weights & Biases：实验追踪和模型评估
Arize Phoenix：模型漂移检测

实践案例：构建完整的 AI 可观测性平台

场景描述

某企业部署了 RAG 应用，需要：

追踪每个请求的 Token 消耗和成本
监控模型输出质量（准确性、相关性、幻觉率）
检测模型漂移和性能下降
生成详细的可观测性报告

架构设计

┌─────────────────────────────────────────────────────┐
│                  AI Application                     │
│  (RAG: 向量检索 + LLM 生成)                          │
└──────────────┬──────────────────────────────────────┘
               │
               │ OpenTelemetry + Langfuse SDK
               │
┌──────────────▼──────────────────────────────────────┐
│              AI Gateway (Envoy)                     │
│  - Token 计量                                        │
│  - 请求追踪                                          │
│  - 内容过滤                                          │
└──────────────┬──────────────────────────────────────┘
               │
      ┌────────┼────────┐
      │        │        │
┌─────▼──┐ ┌──▼─────┐ ┌▼──────────┐
│Langfuse│ │Prometheus│ │Elasticsearch│
│(追踪)  │ │(指标)    │ │(日志)      │
└────────┘ └──────────┘ └───────────┘

实施步骤

1. 集成 Langfuse SDK

from langfuse import Langfuse

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"
)

def handle_query(query: str, user_id: str):
    # 创建追踪
    trace = langfuse.trace(
        name="rag-query",
        user_id=user_id,
        metadata={"query": query}
    )
    
    # 1. 向量检索
    retrieval_span = trace.span(name="vector-retrieval")
    documents = retrieve_documents(query)
    retrieval_span.end(metadata={"documents_count": len(documents)})
    
    # 2. LLM 调用
    llm_span = trace.span(name="llm-inference")
    response = call_llm(query, documents)
    llm_span.end(
        metadata={
            "model": "gpt-4",
            "tokens_input": response.usage.prompt_tokens,
            "tokens_output": response.usage.completion_tokens
        }
    )
    
    # 3. 记录评估
    trace.score(
        name="relevance",
        value=evaluate_relevance(query, response.content)
    )
    
    return response

2. 配置 Prometheus 指标

# prometheus.yml
scrape_configs:
- job_name: 'ai-gateway'
  static_configs:
  - targets: ['ai-gateway:8080']
  metrics_path: '/metrics'

# 告警规则
- alert: HighTokenCost
  expr: ai_cost_usd_total > 1000
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Token cost exceeds $1000/hour"

- alert: ModelAccuracyDrop
  expr: ai_eval_accuracy < 0.8
  for: 1d
  labels:
    severity: critical
  annotations:
    summary: "Model accuracy dropped below 80%"

3. 构建 Grafana 仪表盘

仪表盘面板：

Token 消耗趋势
- 图表：时间序列
- 指标：ai_tokens_input_total, ai_tokens_output_total
成本分析
- 图表：时间序列
- 指标：ai_cost_usd_total
- 分组：按模型、按用户
模型质量
- 图表：时间序列
- 指标：ai_eval_accuracy, ai_eval_relevance, ai_eval_hallucination
延迟分布
- 图表：直方图
- 指标：ai_latency_first_token_seconds, ai_latency_total_seconds
请求量
- 图表：时间序列
- 指标：ai_request_count
- 分组：按状态码

实施效果

成本可见性：

实时追踪 Token 消耗和成本
按用户/部门生成详细账单
成本异常时自动告警

质量监控：

准确性：从 85% 提升至 92%（通过 Eval 发现问题并优化 Prompt）
幻觉率：从 15% 降至 5%（引入 RAG 上下文验证）
安全性：100% 请求经过内容过滤

问题定位：

平均故障定位时间：从 2 小时降至 10 分钟
根因分析：通过追踪链路快速定位瓶颈（如向量检索延迟）

模型漂移检测：

自动检测准确性下降，触发重新训练
模型更新后，Eval 自动验证质量

最佳实践与建议

1. 分阶段实施

阶段 1：基础追踪（0-1 个月）

集成 Langfuse 或 Helicone，记录请求和响应
收集 Token 消耗和成本数据
建立基础仪表盘

阶段 2：质量评估（1-3 个月）

实现 Eval 管道，定期评估模型质量
监控准确性、相关性、幻觉率
建立质量告警

阶段 3：深度可观测性（3-6 个月）

集成 OpenTelemetry，实现端到端追踪
构建详细的 Grafana 仪表盘
实现自动漂移检测和重新训练

2. 关键指标优先级

P0（必须监控）：

Token 消耗和成本
请求延迟（首 Token 延迟、总延迟）
错误率

P1（强烈建议）：

准确性（Eval）
幻觉率（Eval）
缓存命中率

P2（可选）：

用户满意度
对话完成率
模型漂移指标

3. 常见陷阱

陷阱 1：只监控系统指标，忽略质量指标

问题：系统运行正常，但模型输出错误
解决：实施 Eval 管道，定期评估质量

陷阱 2：Eval 频率过低

问题：模型漂移未及时发现
解决：至少每天运行一次 Eval，关键场景实时评估

陷阱 3：日志过多，存储成本高

问题：记录所有请求，存储成本爆炸
解决：实施日志采样（如 10%），只保留聚合数据

陷阱 4：告警过多，导致告警疲劳

问题：大量误报，运维忽略告警
解决：从宽松阈值开始，逐步收紧，监控误报率

4. Eval 最佳实践

1. 建立基准测试集

# 基准测试集应覆盖：
test_dataset = {
    "simple_questions": [...],      # 简单事实性问题
    "complex_questions": [...],     # 复杂推理问题
    "edge_cases": [...],            # 边界情况
    "adversarial_examples": [...]   # 对抗样本（Prompt 注入等）
}

2. 多模型评估（LLM-as-Judge）

# 使用多个 LLM 评估，减少偏见
def evaluate_with_multiple_judges(response: str, ground_truth: str):
    judges = ["gpt-4", "claude-3", "llama-3"]
    scores = []
    for judge in judges:
        score = call_llm(judge, f"Rate accuracy: {response} vs {ground_truth}")
        scores.append(score)
    return avg(scores)  # 取平均值

3. 人工评估与自动评估结合

# 自动评估：快速、低成本、覆盖广
# 人工评估：准确、高成本、样本少

# 建议：
# - 自动评估：每天运行，覆盖所有请求
# - 人工评估：每周抽样 100 个案例，验证自动评估的准确性

总结

AI 可观测性是在传统三大支柱（Metrics、Logs、Traces）基础上，增加第四个支柱：Evals（评估）。

四大支柱：

Metrics：系统指标 + 模型指标 + 业务指标
Logs：结构化日志，记录请求、响应、Token、成本
Traces：端到端追踪，覆盖 RAG、LLM、后处理
Evals：质量评估，量化准确性、相关性、幻觉率

核心价值：

成本可见性：实时追踪 Token 消耗和成本
质量监控：自动检测模型输出错误和幻觉
问题定位：通过追踪链路快速定位瓶颈
漂移检测：自动发现模型性能下降，触发重新训练

实施建议：

分阶段推进：从基础追踪到深度可观测性
工具选型：小型团队用 Helicone/Langfuse，大型企业用完整栈
Eval 优先：从第一天就建立质量评估管道
持续优化：监控指标、调整阈值、优化 Prompt

没有可观测性的 AI 系统，就像没有仪表盘的飞机——你可能在飞行，但不知道飞向哪里，也不知道何时会坠毁。

AI 原生基础设施的可观测性：从 Metrics 到 Evals