第 9 章 · Agent · 工具调用 + ReAct + 多 Agent

一、Agent = LLM + Tools + Loop

user msg ─▶ LLM(带 tool schema) ─▶ tool calls ─▶ 执行工具 ─▶ tool 结果返回 LLM ─▶ 继续推理或最终回答

Agent 的核心差异点在"LLM 如何决定调用哪个工具":

FunctionAgent(也叫 FunctionCallingAgent):用 OpenAI/Anthropic/Gemini 的原生 tool_calls——LLM 输出结构化工具调用,准、稳、推荐
ReActAgent:LLM 输出自然语言的 Thought → Action → Observation——更通用但失败率高,适合不支持原生 tool 的小模型

二、最小 FunctionAgent

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini")

async def get_weather(city: str) -> str:
    """获取某城市的当前天气。"""
    return f"{city} 今天 22 度,晴"

async def add(a: float, b: float) -> float:
    """两数相加。"""
    return a + b

agent = FunctionAgent(
    tools=[get_weather, add],
    llm=llm,
    system_prompt="你是一个能查天气和做算术的助理",
)

import asyncio
ans = asyncio.run(agent.run("北京天气怎么样?22 加 30 等于多少?"))
print(ans)

几个要点:

类型注解必须——LlamaIndex 把 Python function 转成 JSON schema 靠它
docstring 就是 tool description——LLM 靠它判断何时调用,写清楚
async 首选——并发能力大幅提升,同步函数也支持但会在 thread pool 里跑

三、FunctionTool:更精细的包装

from llama_index.core.tools import FunctionTool
from pydantic import BaseModel, Field

class WeatherInput(BaseModel):
    city: str = Field(..., description="城市名,中文")
    date: str = Field("today", description="日期 YYYY-MM-DD,默认今天")

async def get_weather(city: str, date: str = "today") -> str:
    return f"{city} {date}:晴"

weather_tool = FunctionTool.from_defaults(
    fn=get_weather,
    name="get_weather",
    description="查询城市天气,参数是城市名和日期",
    fn_schema=WeatherInput,
    return_direct=False,     # True 表示工具结果直接返回给用户,不走 LLM 合成
)

四、把 QueryEngine 变成 Tool:RAG × Agent

from llama_index.core.tools import QueryEngineTool

rag_tool = QueryEngineTool.from_defaults(
    query_engine=my_query_engine,
    name="company_kb",
    description="公司内部知识库。包含产品、政策、合同相关信息。回答任何内部问题应优先用此工具。",
)

agent = FunctionAgent(tools=[rag_tool, get_weather, add], llm=llm)
ans = await agent.run("我们的退款政策是什么?顺便告诉我北京天气。")

这就是 Agentic RAG——LLM 自己决定查知识库还是调 API,一个 agent 统一入口。

五、ReActAgent:思考-行动-观察

from llama_index.core.agent.workflow import ReActAgent

agent = ReActAgent(
    tools=[rag_tool, get_weather],
    llm=OpenAI(model="gpt-4o-mini"),
    max_iterations=10,
    verbose=True,
)

ans = await agent.run("查一下我们产品在 Acme 合同里的保修条款")

LLM 输出大概长这样:

Thought: 用户要查 Acme 合同里关于产品保修的条款,应该用 company_kb。
Action: company_kb
Action Input: {"query": "Acme 合同中的产品保修条款"}
Observation: 保修期 12 个月,覆盖...
Thought: 已经找到答案,可以回答。
Answer: 根据合同,保修期 12 个月,覆盖...

ReAct 的坑:LLM 有时胡编 Action 格式或者Observation 理解错。生产强烈推荐 FunctionAgent(原生 tool_calls),只在必须用不支持原生 tool 的本地小模型时上 ReAct。

六、Memory:让对话有记忆

from llama_index.core.memory import ChatMemoryBuffer, ChatSummaryMemoryBuffer

# A. 简单缓冲:超过 token 就丢旧消息
memory = ChatMemoryBuffer.from_defaults(token_limit=4000)

# B. 带摘要:旧消息自动 LLM 压缩成 summary
memory = ChatSummaryMemoryBuffer.from_defaults(
    token_limit=4000,
    llm=llm,
)

agent = FunctionAgent(tools=[...], llm=llm, memory=memory)

# 多轮对话
await agent.run("我叫小明")
ans = await agent.run("我叫什么?")    # "小明"

持久化 memory

from llama_index.core.memory import Memory

memory = Memory.from_defaults(
    session_id="user-42-convo-1",
    chat_store=RedisChatStore(redis_url="redis://localhost"),
    token_limit=8000,
)
# 每次 agent.run 后 memory 自动持久化,换进程能继续对话

七、流式返回 + 中间事件

handler = agent.run("帮我查一下 Q3 营收,并告诉我上海天气")

async for ev in handler.stream_events():
    from llama_index.core.agent.workflow import (
        AgentStream, ToolCall, ToolCallResult
    )
    if isinstance(ev, ToolCall):
        print(f"🔧 调用 {ev.tool_name}({ev.tool_kwargs})")
    elif isinstance(ev, ToolCallResult):
        print(f"📦 结果 {ev.tool_output}")
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

final = await handler

这些事件在前端做"AI 正在查 xxx..."提示非常有用——比干等十秒再出答案强十倍。

八、AgentWorkflow:多 Agent 协作

单 agent 工具一多,LLM 决策质量会下降(20+ 工具就开始乱调)。解决方案:拆成多个专业化 agent,一个总控 agent 路由。

from llama_index.core.agent.workflow import AgentWorkflow, FunctionAgent

research_agent = FunctionAgent(
    name="researcher",
    description="专门做信息搜集,会用内部 KB 和 Web 搜索",
    tools=[rag_tool, web_search_tool],
    system_prompt="你负责搜集信息。整理成笔记交给 writer。",
    llm=llm,
    can_handoff_to=["writer"],
)

writer_agent = FunctionAgent(
    name="writer",
    description="根据 researcher 提供的笔记写成正式答复",
    tools=[],
    system_prompt="你根据笔记写答复,引用要具体到来源。",
    llm=llm,
    can_handoff_to=["researcher"],      # 如果缺资料可以退回
)

workflow = AgentWorkflow(
    agents=[research_agent, writer_agent],
    root_agent="researcher",
    initial_state={"notes": []},
)

handler = workflow.run(user_msg="帮我写一份关于 Acme 合同要点的摘要")

研究员 + 作家这种两 agent pattern 是多 agent 里最朴素也最实用的——明确分工,交接条件清晰。

九、Context:agent 间共享状态

多 agent 工作时要共享"当前笔记""检索历史",用 Context:

from llama_index.core.workflow import Context

async def save_note(ctx: Context, note: str) -> str:
    """把一条笔记存入 agent 共享 context。"""
    async with ctx.store.edit_state() as state:
        state["notes"].append(note)
    return f"已存:{note}"

async def list_notes(ctx: Context) -> list[str]:
    """列出所有笔记。"""
    state = await ctx.store.get_state()
    return state["notes"]

# 两个 agent 都注册这俩工具,便能共用笔记池

十、常用 Tool 清单

LlamaHub 也有大量现成 Tool 包,常用的:

llama-index-tools-tavily-research / duckduckgo:Web 搜索
llama-index-tools-google:Gmail/Calendar/Drive
llama-index-tools-slack:收发 Slack 消息
llama-index-tools-database:SQL 查询
llama-index-tools-code-interpreter:跑 Python 代码
llama-index-tools-arxiv / wikipedia:学术/知识
llama-index-tools-requests:任意 HTTP
MCP Server:用 llama-index-tools-mcp 接入任何 MCP 工具

十一、人类介入(HITL)

from llama_index.core.workflow import InputRequiredEvent, HumanResponseEvent

async def refund_order(ctx: Context, order_id: str, amount: float) -> str:
    """给订单退款(需要人工审批)。"""
    response = await ctx.wait_for_event(
        HumanResponseEvent,
        waiter_event=InputRequiredEvent(
            prefix=f"即将给订单 {order_id} 退款 {amount},确认? yes/no"
        ),
    )
    if response.response.lower() == "yes":
        return "已退款"
    return "已取消"

Ch10 Workflow 会更深入讲 HITL——但 Agent 层就可以用 ctx.wait_for_event 直接等待外部事件。

十二、Chat Engine:对话式 RAG 的快捷方式

如果你只需要"带记忆的 RAG 问答",不需要完整 Agent,用 ChatEngine:

# A. Condense Question:每轮把历史 + 当前 query 压成独立问题
chat = index.as_chat_engine(chat_mode="condense_question", verbose=True)

# B. Context:每轮把检索结果塞进 system prompt,LLM 自己判断
chat = index.as_chat_engine(chat_mode="context")

# C. Best:自动组合,优先检索相关片段再合成
chat = index.as_chat_engine(chat_mode="best")

resp = chat.chat("双因素认证怎么配?")
resp = chat.chat("刚说的 TOTP 具体是啥?")   # 自动带上下文

十三、调试技巧

开 verbose=True——tool 调用、输入输出全打
接 Arize Phoenix:px.launch_app() 可以看 agent 的每一步决策树(Ch11)
LLM 老调用错 tool → description 写得不够具体,加"何时使用 / 何时不用"
工具死循环 → 设 max_iterations,ReActAgent 默认 10 轮
慢 → 看哪个 tool 慢,并发友好的 tool 用 async def,I/O 工具加缓存

十四、反模式

Tool 塞 30+ 个:LLM 注意力稀释,准确率崩。拆多 Agent。
Tool description 写一行"获取数据":LLM 瞎猜——写参数含义、举例、何时用何时不用。
Tool 返回整块 JSON:让 LLM 自己解析 vs. 工具里预处理成自然语言——后者稳得多。
没设 max_iterations:LLM 循环调工具不停,账单爆炸。
memory 不持久化:用户关闭浏览器再回来,agent 失忆。
同步函数硬塞 async agent:并发完全失效。能 async 就 async。
ReAct 用在 GPT-4o 上:放着原生 tool 不用,手写 Thought/Action 字符串靠 regex 解析——降智。
多 agent 嵌套超过 2 层:调试噩梦,消息流向不可控。大部分项目研究员+作家/路由+执行就够。

十五、本章小结

记住:
① FunctionAgent 是默认——OpenAI/Claude/Gemini 都支持原生 tool_calls,别折腾 ReAct。
② RAG + Agent 融合 = 把 QueryEngine 包成 QueryEngineTool,LLM 自己决定查不查。
③ 工具多了拆多 Agent——研究员 + 作家这种两角色就能解决大部分问题。
④ Memory 持久化、HITL、流式事件,都是生产必备——别在一次性脚本里才做这些。