成本构成分析
推理模型的成本来源:
输入 Token(较便宜):
system prompt + user message + 历史对话
思考 Token(较贵):
<thinking> 内容:通常 1000 - 20000 tokens
费率通常与输出 token 相同
输出 Token(较贵):
最终回答:通常 100 - 2000 tokens
示例(claude-sonnet-4-6,假设费率):
普通问答:100 输入 + 500 输出 = ~$0.002
推理模式:100 输入 + 8000 思考 + 500 输出 = ~$0.035
成本倍数:约 17.5×
Budget Forcing 策略
import anthropic
client = anthropic.Anthropic()
class ThinkingBudgetOptimizer:
"""根据问题难度动态调整 thinking budget"""
BUDGET_MAP = {
"trivial": 0, # 不启用思考
"easy": 1000,
"medium": 5000,
"hard": 10000,
"expert": 25000,
}
def classify_difficulty(self, question: str) -> str:
# 简单规则分类(也可以用小模型分类)
word_count = len(question.split())
has_math = any(c in question for c in ['∫', '∑', '≥', '证明', '求解'])
has_code = '```' in question or '代码' in question
if word_count < 10:
return "trivial"
elif has_math and word_count > 50:
return "hard"
elif has_code:
return "medium"
elif word_count > 30:
return "easy"
else:
return "trivial"
def ask(self, question: str) -> str:
difficulty = self.classify_difficulty(question)
budget = self.BUDGET_MAP[difficulty]
thinking_config = {"type": "enabled", "budget_tokens": budget} \
if budget > 0 \
else {"type": "disabled"}
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4000,
thinking=thinking_config,
messages=[{"role": "user", "content": question}]
)
print(f"难度: {difficulty}, budget: {budget}")
return next(b.text for b in response.content if b.type == "text")
Prompt Caching 减少输入成本
# System Prompt 缓存(适合固定、较长的系统提示)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8000,
thinking={"type": "enabled", "budget_tokens": 5000},
system=[{
"type": "text",
"text": long_system_prompt, # 长系统提示(>1024 tokens)
"cache_control": {"type": "ephemeral"} # 缓存这个块
}],
messages=[{"role": "user", "content": question}]
)
# 检查缓存命中
usage = response.usage
print(f"缓存命中: {usage.cache_read_input_tokens} tokens")
print(f"缓存写入: {usage.cache_creation_input_tokens} tokens")
# 缓存命中的 token 费率约为普通输入的 10%!
难度路由:模型选型策略
class ModelRouter:
"""根据任务难度路由到不同模型"""
def route(self, task_type: str, complexity: str) -> tuple:
# 返回 (model, thinking_config)
routes = {
("chat", "low"): ("claude-haiku-4-5-20251001", None),
("chat", "medium"): ("claude-sonnet-4-6", None),
("reason", "medium"): ("claude-sonnet-4-6",
{"type": "enabled", "budget_tokens": 5000}),
("reason", "high"): ("claude-sonnet-4-6",
{"type": "enabled", "budget_tokens": 20000}),
("math", "high"): ("claude-opus-4-6",
{"type": "enabled", "budget_tokens": 30000}),
}
return routes.get((task_type, complexity),
("claude-sonnet-4-6", None))
budget_tokens vs 质量的权衡曲线
MATH-500 准确率 vs budget_tokens(示意):
准确率
100% │ ▓▓▓▓▓
90% │ ▓▓▓▓▓▓▓▓▓▓▓
80% │ ▓▓▓▓▓▓▓▓▓▓
70% │ ▓▓▓▓▓▓▓▓
60% │ ▓▓▓▓
50% │▓▓
└────────────────────────────────────────
0 1K 2K 4K 8K 16K 32K budget_tokens
观察:
- 0 → 2000 tokens: 质量快速提升(高性价比区间)
- 2000 → 8000 tokens: 稳步提升
- 8000 → 20000 tokens: 收益递减
- 20000+: 边际收益极小(仅超难题有帮助)
推荐: 5000-8000 是大多数场景的甜点区间
本章小结
成本控制三板斧:难度分级(不同问题用不同 budget)、Prompt Caching(缓存固定系统提示)、模型路由(简单任务用小模型)。大多数场景 budget=5000-8000 是性价比最优区间。下一章完整端到端实战。