Chapter 09

推理成本控制:Budget Forcing 与缓存

推理模型的 token 成本可能是普通模型的 10-50 倍。系统化的成本控制策略让推理模型在生产中具备可行性。

成本构成分析

推理模型的成本来源: 输入 Token(较便宜): system prompt + user message + 历史对话 思考 Token(较贵): <thinking> 内容:通常 1000 - 20000 tokens 费率通常与输出 token 相同 输出 Token(较贵): 最终回答:通常 100 - 2000 tokens 示例(claude-sonnet-4-6,假设费率): 普通问答:100 输入 + 500 输出 = ~$0.002 推理模式:100 输入 + 8000 思考 + 500 输出 = ~$0.035 成本倍数:约 17.5×

Budget Forcing 策略

import anthropic

client = anthropic.Anthropic()

class ThinkingBudgetOptimizer:
    """根据问题难度动态调整 thinking budget"""

    BUDGET_MAP = {
        "trivial":  0,      # 不启用思考
        "easy":    1000,
        "medium":  5000,
        "hard":    10000,
        "expert":  25000,
    }

    def classify_difficulty(self, question: str) -> str:
        # 简单规则分类(也可以用小模型分类)
        word_count = len(question.split())
        has_math = any(c in question for c in ['∫', '∑', '≥', '证明', '求解'])
        has_code = '```' in question or '代码' in question

        if word_count < 10:
            return "trivial"
        elif has_math and word_count > 50:
            return "hard"
        elif has_code:
            return "medium"
        elif word_count > 30:
            return "easy"
        else:
            return "trivial"

    def ask(self, question: str) -> str:
        difficulty = self.classify_difficulty(question)
        budget = self.BUDGET_MAP[difficulty]

        thinking_config = {"type": "enabled", "budget_tokens": budget} \
            if budget > 0 \
            else {"type": "disabled"}

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4000,
            thinking=thinking_config,
            messages=[{"role": "user", "content": question}]
        )
        print(f"难度: {difficulty}, budget: {budget}")
        return next(b.text for b in response.content if b.type == "text")

Prompt Caching 减少输入成本

# System Prompt 缓存(适合固定、较长的系统提示)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    system=[{
        "type": "text",
        "text": long_system_prompt,  # 长系统提示(>1024 tokens)
        "cache_control": {"type": "ephemeral"}  # 缓存这个块
    }],
    messages=[{"role": "user", "content": question}]
)

# 检查缓存命中
usage = response.usage
print(f"缓存命中: {usage.cache_read_input_tokens} tokens")
print(f"缓存写入: {usage.cache_creation_input_tokens} tokens")
# 缓存命中的 token 费率约为普通输入的 10%!

难度路由:模型选型策略

class ModelRouter:
    """根据任务难度路由到不同模型"""

    def route(self, task_type: str, complexity: str) -> tuple:
        # 返回 (model, thinking_config)
        routes = {
            ("chat",   "low"):    ("claude-haiku-4-5-20251001", None),
            ("chat",   "medium"): ("claude-sonnet-4-6", None),
            ("reason", "medium"): ("claude-sonnet-4-6",
                                      {"type": "enabled", "budget_tokens": 5000}),
            ("reason", "high"):   ("claude-sonnet-4-6",
                                      {"type": "enabled", "budget_tokens": 20000}),
            ("math",   "high"):   ("claude-opus-4-6",
                                      {"type": "enabled", "budget_tokens": 30000}),
        }
        return routes.get((task_type, complexity),
                          ("claude-sonnet-4-6", None))

budget_tokens vs 质量的权衡曲线

MATH-500 准确率 vs budget_tokens(示意): 准确率 100% │ ▓▓▓▓▓ 90% │ ▓▓▓▓▓▓▓▓▓▓▓ 80% │ ▓▓▓▓▓▓▓▓▓▓ 70% │ ▓▓▓▓▓▓▓▓ 60% │ ▓▓▓▓ 50% │▓▓ └──────────────────────────────────────── 0 1K 2K 4K 8K 16K 32K budget_tokens 观察: - 0 → 2000 tokens: 质量快速提升(高性价比区间) - 2000 → 8000 tokens: 稳步提升 - 8000 → 20000 tokens: 收益递减 - 20000+: 边际收益极小(仅超难题有帮助) 推荐: 5000-8000 是大多数场景的甜点区间
本章小结 成本控制三板斧:难度分级(不同问题用不同 budget)、Prompt Caching(缓存固定系统提示)、模型路由(简单任务用小模型)。大多数场景 budget=5000-8000 是性价比最优区间。下一章完整端到端实战。