多智能体系统设计：协作式 AI 的未来

引言

单个 AI Agent 的能力是有限的，而多智能体系统（Multi-Agent Systems）通过让多个专业化的 Agent 协同工作，可以解决更复杂的问题。本文将深入探讨如何设计和实现高效的多智能体系统。

为什么需要多智能体系统

单 Agent 的局限性

认知负担：单个 Agent 难以处理多领域知识
并行瓶颈：串行执行效率低下
专业深度不足：通才 Agent 在特定领域不够深入
错误累积：长链任务中错误会累积

多 Agent 的优势

单 Agent 方式：
用户 → 通用 Agent → 结果
      (需要处理所有子任务)

多 Agent 方式：
用户 → 协调 Agent
       ├→ 研究 Agent  → 数据
       ├→ 分析 Agent  → 洞察
       └→ 写作 Agent  → 报告
       (各司其职，并行执行)

Anthropic 的多智能体研究系统

Anthropic 构建了一个多智能体研究系统，其架构包括：

规划 Agent：根据用户查询制定研究计划
并行搜索 Agent：同时进行多个搜索任务
合成 Agent：整合所有搜索结果

1class MultiAgentResearchSystem:
2    def __init__(self):
3        self.planner = PlannerAgent()
4        self.searchers = [SearchAgent() for _ in range(5)]
5        self.synthesizer = SynthesizerAgent()
6
7    async def research(self, query):
8        # 1. 规划研究任务
9        plan = await self.planner.create_plan(query)
10
11        # 2. 并行搜索
12        search_tasks = [
13            searcher.search(subtask)
14            for searcher, subtask in zip(self.searchers, plan.subtasks)
15        ]
16        results = await asyncio.gather(*search_tasks)
17
18        # 3. 合成结果
19        final_report = await self.synthesizer.synthesize(results)
20
21        return final_report

多 Agent 架构模式

1. 管道模式（Pipeline）

Agent 按顺序处理，每个 Agent 的输出是下一个的输入：

1class PipelineMultiAgent:
2    def __init__(self):
3        self.agents = [
4            DataExtractionAgent(),
5            DataValidationAgent(),
6            DataTransformationAgent(),
7            DataStorageAgent()
8        ]
9
10    async def process(self, input_data):
11        result = input_data
12
13        for agent in self.agents:
14            # 记录每个阶段
15            stage_name = agent.__class__.__name__
16            print(f"阶段: {stage_name}")
17
18            # 执行当前 Agent
19            result = await agent.process(result)
20
21            # 验证输出
22            if not agent.validate_output(result):
23                raise PipelineError(f"{stage_name} 输出无效")
24
25        return result

适用场景：

数据处理管道
文档生成流程
多步骤转换任务

2. 并行模式（Parallel）

多个 Agent 同时执行，结果合并：

1class ParallelMultiAgent:
2    def __init__(self):
3        self.agents = {
4            'sentiment': SentimentAgent(),
5            'entities': EntityExtractionAgent(),
6            'topics': TopicModelingAgent(),
7            'summary': SummaryAgent()
8        }
9
10    async def analyze(self, text):
11        # 并行执行所有分析
12        tasks = {
13            name: agent.analyze(text)
14            for name, agent in self.agents.items()
15        }
16
17        # 等待所有结果
18        results = {}
19        for name, task in tasks.items():
20            try:
21                results[name] = await task
22            except Exception as e:
23                print(f"Agent {name} 失败: {e}")
24                results[name] = None
25
26        return results

适用场景：

多角度分析
独立任务批处理
资源密集型操作

3. 层级模式（Hierarchical）

管理 Agent 协调多个工作 Agent：

1class HierarchicalMultiAgent:
2    def __init__(self):
3        self.manager = ManagerAgent()
4        self.workers = {
5            'research': ResearchAgent(),
6            'code': CodingAgent(),
7            'test': TestingAgent(),
8            'document': DocumentationAgent()
9        }
10
11    async def execute_project(self, requirements):
12        # 管理者制定计划
13        plan = await self.manager.create_plan(requirements)
14
15        for task in plan.tasks:
16            # 管理者分配任务
17            assigned_agent = self.manager.assign_task(task, self.workers)
18
19            # 工作者执行任务
20            result = await assigned_agent.execute(task)
21
22            # 管理者验证结果
23            if await self.manager.validate(result, task):
24                await self.manager.mark_complete(task)
25            else:
26                # 重新分配或调整策略
27                await self.manager.handle_failure(task, result)
28
29        return await self.manager.finalize_project()

适用场景：

复杂项目管理
动态任务分配
需要决策协调的场景

4. 市场模式（Market-Based）

Agent 通过竞标获取任务：

1class MarketBasedMultiAgent:
2    def __init__(self):
3        self.agents = [
4            SpecializedAgent(specialty="web_dev"),
5            SpecializedAgent(specialty="data_science"),
6            SpecializedAgent(specialty="devops"),
7        ]
8
9    async def execute_task(self, task):
10        # 1. 发布任务
11        bids = []
12        for agent in self.agents:
13            # 每个 Agent 评估任务并出价
14            bid = await agent.bid_for_task(task)
15            if bid:
16                bids.append((agent, bid))
17
18        # 2. 选择最佳 Agent
19        if not bids:
20            raise NoAgentAvailableError("没有 Agent 能处理此任务")
21
22        # 根据能力评分和成本选择
23        best_agent, best_bid = max(bids, key=lambda x: x[1].score)
24
25        # 3. 执行任务
26        result = await best_agent.execute(task)
27
28        return result

适用场景：

资源受限环境
需要负载均衡
Agent 能力差异大

5. 团队协作模式（Collaborative）

Agent 共享信息，协同完成任务：

1class CollaborativeMultiAgent:
2    def __init__(self):
3        self.shared_memory = SharedMemory()
4        self.agents = [
5            CollaborativeAgent("researcher", self.shared_memory),
6            CollaborativeAgent("analyst", self.shared_memory),
7            CollaborativeAgent("writer", self.shared_memory)
8        ]
9
10    async def collaborate(self, task):
11        # 初始化共享状态
12        await self.shared_memory.set("task", task)
13        await self.shared_memory.set("progress", {})
14
15        # Agent 持续协作直到完成
16        while not await self.is_task_complete():
17            for agent in self.agents:
18                # Agent 读取共享信息
19                context = await self.shared_memory.get_context()
20
21                # Agent 决定是否采取行动
22                if await agent.should_act(context):
23                    # 执行行动
24                    action_result = await agent.act(context)
25
26                    # 更新共享内存
27                    await self.shared_memory.update(
28                        agent.name,
29                        action_result
30                    )
31
32        return await self.shared_memory.get("final_result")
33
34class CollaborativeAgent:
35    def __init__(self, name, shared_memory):
36        self.name = name
37        self.memory = shared_memory
38
39    async def should_act(self, context):
40        # 根据上下文判断是否需要行动
41        prompt = f"""
42        你是 {self.name} Agent。
43
44        当前任务进度：{context}
45
46        基于其他 Agent 的工作，你现在需要采取行动吗？
47        如果需要，你应该做什么？
48
49        回答格式：
50        {{"should_act": true/false, "reason": "原因"}}
51        """
52
53        response = await self.call_claude(prompt)
54        return response["should_act"]

适用场景：

需要紧密协作
信息高度关联
迭代优化任务

Agent 间通信机制

1. 消息队列

1import asyncio
2from collections import deque
3
4class MessageQueue:
5    def __init__(self):
6        self.queues = {}
7
8    def create_queue(self, agent_id):
9        self.queues[agent_id] = deque()
10
11    async def send(self, from_agent, to_agent, message):
12        if to_agent not in self.queues:
13            self.create_queue(to_agent)
14
15        self.queues[to_agent].append({
16            'from': from_agent,
17            'content': message,
18            'timestamp': time.time()
19        })
20
21    async def receive(self, agent_id, timeout=None):
22        queue = self.queues.get(agent_id, deque())
23
24        if timeout:
25            start = time.time()
26            while not queue and (time.time() - start) < timeout:
27                await asyncio.sleep(0.1)
28
29        if queue:
30            return queue.popleft()
31        return None
32
33# 使用
34message_queue = MessageQueue()
35
36# Agent A 发送消息给 Agent B
37await message_queue.send(
38    from_agent="agent_a",
39    to_agent="agent_b",
40    message={"task": "分析数据", "data": data}
41)
42
43# Agent B 接收消息
44message = await message_queue.receive("agent_b", timeout=5)

2. 共享状态存储

1class SharedState:
2    def __init__(self):
3        self.state = {}
4        self.locks = {}
5        self.history = []
6
7    async def set(self, key, value, agent_id):
8        # 记录变更历史
9        self.history.append({
10            'key': key,
11            'value': value,
12            'agent': agent_id,
13            'timestamp': time.time()
14        })
15
16        self.state[key] = value
17
18    async def get(self, key):
19        return self.state.get(key)
20
21    async def lock(self, key, agent_id):
22        """锁定资源，防止竞争"""
23        if key in self.locks:
24            raise ResourceLockedError(f"{key} 已被 {self.locks[key]} 锁定")
25
26        self.locks[key] = agent_id
27
28    async def unlock(self, key, agent_id):
29        if self.locks.get(key) == agent_id:
30            del self.locks[key]
31
32    async def get_history(self, key=None):
33        """获取变更历史"""
34        if key:
35            return [h for h in self.history if h['key'] == key]
36        return self.history

3. 事件驱动

1class EventBus:
2    def __init__(self):
3        self.subscribers = {}
4
5    def subscribe(self, event_type, agent, callback):
6        if event_type not in self.subscribers:
7            self.subscribers[event_type] = []
8
9        self.subscribers[event_type].append({
10            'agent': agent,
11            'callback': callback
12        })
13
14    async def publish(self, event_type, data):
15        if event_type not in self.subscribers:
16            return
17
18        # 通知所有订阅者
19        tasks = [
20            subscriber['callback'](data)
21            for subscriber in self.subscribers[event_type]
22        ]
23
24        await asyncio.gather(*tasks)
25
26# 使用
27event_bus = EventBus()
28
29# Agent 订阅事件
30def on_data_ready(data):
31    print(f"Agent received data: {data}")
32
33event_bus.subscribe("data_ready", "analyst_agent", on_data_ready)
34
35# 发布事件
36await event_bus.publish("data_ready", {"records": 1000})

协调策略

1. 中心化协调

一个 Agent 负责协调所有其他 Agent：

1class CentralCoordinator:
2    def __init__(self, agents):
3        self.agents = agents
4        self.task_queue = asyncio.Queue()
5        self.results = {}
6
7    async def coordinate(self, project):
8        # 分解任务
9        tasks = await self.decompose_project(project)
10
11        # 添加到队列
12        for task in tasks:
13            await self.task_queue.put(task)
14
15        # 分配给 Agent
16        workers = [
17            self.worker_loop(agent)
18            for agent in self.agents
19        ]
20
21        # 等待完成
22        await asyncio.gather(*workers)
23
24        return self.compile_results()
25
26    async def worker_loop(self, agent):
27        while True:
28            try:
29                task = await asyncio.wait_for(
30                    self.task_queue.get(),
31                    timeout=1.0
32                )
33            except asyncio.TimeoutError:
34                break
35
36            result = await agent.execute(task)
37            self.results[task.id] = result
38            self.task_queue.task_done()

2. 分布式协调

Agent 自主决策和协调：

1class DistributedAgent:
2    def __init__(self, agent_id, peers):
3        self.id = agent_id
4        self.peers = peers
5        self.local_state = {}
6
7    async def coordinate(self, task):
8        # 1. 评估自己能否处理
9        can_handle = await self.can_handle(task)
10
11        if can_handle:
12            return await self.execute(task)
13
14        # 2. 寻求 peer 帮助
15        for peer in self.peers:
16            if await peer.can_handle(task):
17                # 委托给 peer
18                result = await peer.execute(task)
19                # 更新本地状态
20                await self.learn_from(peer, task, result)
21                return result
22
23        # 3. 协作完成
24        return await self.collaborative_execute(task)
25
26    async def collaborative_execute(self, task):
27        # 分解任务
28        subtasks = await self.decompose(task)
29
30        # 分配给合适的 peer
31        assignments = {}
32        for subtask in subtasks:
33            best_peer = await self.find_best_peer(subtask)
34            assignments[subtask] = best_peer
35
36        # 并行执行
37        results = await asyncio.gather(*[
38            peer.execute(subtask)
39            for subtask, peer in assignments.items()
40        ])
41
42        # 整合结果
43        return await self.integrate_results(results)

错误处理和容错

1. Agent 故障恢复

1class FaultTolerantMultiAgent:
2    def __init__(self):
3        self.agents = []
4        self.backup_agents = []
5        self.checkpoints = {}
6
7    async def execute_with_recovery(self, task):
8        for attempt in range(3):
9            try:
10                # 保存检查点
11                checkpoint = await self.save_checkpoint()
12
13                # 执行任务
14                result = await self.execute(task)
15
16                # 验证结果
17                if await self.validate_result(result):
18                    return result
19                else:
20                    raise ValidationError("结果验证失败")
21
22            except Exception as e:
23                print(f"尝试 {attempt + 1} 失败: {e}")
24
25                # 恢复到检查点
26                await self.restore_checkpoint(checkpoint)
27
28                # 切换到备用 Agent
29                if attempt < 2:
30                    await self.switch_to_backup()
31
32        raise MaxRetriesError("任务执行失败")
33
34    async def switch_to_backup(self):
35        """切换到备用 Agent"""
36        if self.backup_agents:
37            backup = self.backup_agents.pop(0)
38            self.agents.insert(0, backup)

2. 部分失败处理

1async def partial_failure_handling(agents, task):
2    """处理部分 Agent 失败的情况"""
3
4    results = []
5    failures = []
6
7    for agent in agents:
8        try:
9            result = await agent.execute(task)
10            results.append(result)
11        except Exception as e:
12            failures.append({
13                'agent': agent,
14                'error': str(e)
15            })
16
17    # 如果大多数成功，继续
18    if len(results) > len(agents) / 2:
19        # 使用多数投票或平均
20        final_result = aggregate_results(results)
21        return final_result
22    else:
23        # 失败太多，重试或报错
24        raise PartialFailureError(f"只有 {len(results)}/{len(agents)} 成功")

性能优化

1. Agent 池管理

1class AgentPool:
2    def __init__(self, agent_class, pool_size=5):
3        self.agent_class = agent_class
4        self.pool_size = pool_size
5        self.available = asyncio.Queue()
6        self.in_use = set()
7
8        # 初始化池
9        for _ in range(pool_size):
10            agent = agent_class()
11            self.available.put_nowait(agent)
12
13    async def acquire(self, timeout=None):
14        """获取可用 Agent"""
15        agent = await asyncio.wait_for(
16            self.available.get(),
17            timeout=timeout
18        )
19        self.in_use.add(agent)
20        return agent
21
22    async def release(self, agent):
23        """释放 Agent 回池"""
24        if agent in self.in_use:
25            self.in_use.remove(agent)
26            await agent.reset()  # 重置状态
27            await self.available.put(agent)
28
29    async def execute_with_pool(self, tasks):
30        """使用池执行多个任务"""
31        async def execute_task(task):
32            agent = await self.acquire()
33            try:
34                return await agent.execute(task)
35            finally:
36                await self.release(agent)
37
38        return await asyncio.gather(*[
39            execute_task(task) for task in tasks
40        ])

2. 负载均衡

1class LoadBalancer:
2    def __init__(self, agents):
3        self.agents = agents
4        self.load_stats = {agent: 0 for agent in agents}
5
6    async def assign_task(self, task):
7        # 找到负载最低的 Agent
8        least_loaded = min(
9            self.agents,
10            key=lambda a: self.load_stats[a]
11        )
12
13        # 增加负载计数
14        self.load_stats[least_loaded] += 1
15
16        try:
17            result = await least_loaded.execute(task)
18            return result
19        finally:
20            # 减少负载计数
21            self.load_stats[least_loaded] -= 1

3. 结果缓存

1class CachedMultiAgent:
2    def __init__(self):
3        self.cache = {}
4        self.cache_hits = 0
5        self.cache_misses = 0
6
7    async def execute_with_cache(self, task):
8        # 生成缓存键
9        cache_key = self.generate_cache_key(task)
10
11        # 检查缓存
12        if cache_key in self.cache:
13            self.cache_hits += 1
14            return self.cache[cache_key]
15
16        # 缓存未命中，执行任务
17        self.cache_misses += 1
18        result = await self.execute(task)
19
20        # 存入缓存
21        self.cache[cache_key] = result
22
23        return result
24
25    def get_cache_stats(self):
26        total = self.cache_hits + self.cache_misses
27        hit_rate = self.cache_hits / total if total > 0 else 0
28        return {
29            'hit_rate': hit_rate,
30            'hits': self.cache_hits,
31            'misses': self.cache_misses
32        }

评估多 Agent 系统

关键指标

1class MultiAgentMetrics:
2    def __init__(self):
3        self.metrics = {
4            'throughput': [],      # 吞吐量
5            'latency': [],         # 延迟
6            'success_rate': [],    # 成功率
7            'agent_utilization': {},  # Agent 利用率
8            'communication_overhead': [],  # 通信开销
9            'coordination_time': []  # 协调时间
10        }
11
12    async def measure_execution(self, multi_agent_system, task):
13        start_time = time.time()
14
15        # 执行任务
16        try:
17            result = await multi_agent_system.execute(task)
18            success = True
19        except Exception as e:
20            result = None
21            success = False
22
23        # 记录指标
24        latency = time.time() - start_time
25        self.metrics['latency'].append(latency)
26        self.metrics['success_rate'].append(1 if success else 0)
27
28        # 收集 Agent 利用率
29        for agent in multi_agent_system.agents:
30            agent_id = agent.id
31            if agent_id not in self.metrics['agent_utilization']:
32                self.metrics['agent_utilization'][agent_id] = []
33
34            utilization = agent.get_utilization()
35            self.metrics['agent_utilization'][agent_id].append(utilization)
36
37        return result
38
39    def get_summary(self):
40        return {
41            'avg_latency': np.mean(self.metrics['latency']),
42            'success_rate': np.mean(self.metrics['success_rate']),
43            'avg_agent_utilization': {
44                agent_id: np.mean(utils)
45                for agent_id, utils in self.metrics['agent_utilization'].items()
46            }
47        }

实战案例：自动化软件开发系统

1class AutomatedDevSystem:
2    """多 Agent 协作的自动化开发系统"""
3
4    def __init__(self):
5        # 初始化各个专业 Agent
6        self.product_manager = ProductManagerAgent()
7        self.architect = ArchitectAgent()
8        self.developers = [DeveloperAgent(f"dev_{i}") for i in range(3)]
9        self.tester = TestingAgent()
10        self.reviewer = CodeReviewAgent()
11
12        # 共享工作空间
13        self.workspace = SharedWorkspace()
14
15    async def develop_feature(self, feature_request):
16        # 1. 产品经理分析需求
17        requirements = await self.product_manager.analyze(
18            feature_request,
19            self.workspace
20        )
21
22        # 2. 架构师设计方案
23        design = await self.architect.design(
24            requirements,
25            self.workspace
26        )
27
28        # 3. 开发者并行实现
29        implementation_tasks = design.split_into_tasks()
30        implementations = await asyncio.gather(*[
31            developer.implement(task, self.workspace)
32            for developer, task in zip(self.developers, implementation_tasks)
33        ])
34
35        # 4. 代码审查
36        review_results = await asyncio.gather(*[
37            self.reviewer.review(impl, self.workspace)
38            for impl in implementations
39        ])
40
41        # 5. 修改代码（如果需要）
42        for impl, review in zip(implementations, review_results):
43            if review.needs_changes:
44                await impl.developer.fix_issues(review.issues, self.workspace)
45
46        # 6. 测试
47        test_results = await self.tester.test_feature(
48            implementations,
49            self.workspace
50        )
51
52        # 7. 最终验证
53        if test_results.all_passed:
54            return await self.finalize_feature(implementations)
55        else:
56            # 返回开发者修复
57            return await self.fix_and_retry(test_results.failures)
58
59class ProductManagerAgent:
60    async def analyze(self, feature_request, workspace):
61        """分析需求，生成用户故事和验收标准"""
62        prompt = f"""
63        作为产品经理，请分析以下功能请求：
64
65        {feature_request}
66
67        请提供：
68        1. 用户故事
69        2. 验收标准
70        3. 技术约束
71        4. 优先级评估
72        """
73
74        response = await self.call_claude(prompt)
75        requirements = self.parse_requirements(response)
76
77        # 保存到共享工作空间
78        await workspace.set("requirements", requirements)
79
80        return requirements

最佳实践总结

明确职责分工：每个 Agent 有清晰的专长
选择合适架构：根据任务特点选择协作模式
设计通信机制：确保 Agent 间高效通信
实现容错机制：处理 Agent 故障和部分失败
性能优化：Agent 池、负载均衡、缓存
监控和评估：追踪关键指标，持续优化
测试协作：验证 Agent 间协作的正确性

挑战与解决方案

挑战 1：复合错误

多个 Agent 协作时，错误会累积和传播。

解决方案：

在关键节点设置检查点
实现错误隔离机制
提供降级方案

挑战 2：通信开销

Agent 间频繁通信可能影响性能。

解决方案：

批量处理消息
使用异步通信
缓存共享状态

挑战 3：协调复杂度

多 Agent 协调逻辑复杂。

解决方案：

使用成熟的协调模式
实现清晰的协议
提供可视化工具

未来展望

多智能体系统的发展方向：

自适应协作：Agent 动态调整协作策略
学习型系统：从历史协作中学习改进
大规模部署：支持数百个 Agent 协同
跨组织协作：不同系统的 Agent 互操作

总结

多智能体系统通过专业化分工和协同合作，能够处理单个 Agent 无法完成的复杂任务。关键是选择合适的架构模式、设计高效的通信机制、实现可靠的容错策略，并持续监控和优化系统性能。

引言

为什么需要多智能体系统

单 Agent 的局限性

多 Agent 的优势

Anthropic 的多智能体研究系统

多 Agent 架构模式

1. 管道模式（Pipeline）

2. 并行模式（Parallel）

3. 层级模式（Hierarchical）

4. 市场模式（Market-Based）

5. 团队协作模式（Collaborative）

Agent 间通信机制

1. 消息队列

2. 共享状态存储

3. 事件驱动

协调策略

1. 中心化协调

2. 分布式协调

错误处理和容错

1. Agent 故障恢复

2. 部分失败处理

性能优化

1. Agent 池管理

2. 负载均衡

3. 结果缓存

评估多 Agent 系统

关键指标

实战案例：自动化软件开发系统

最佳实践总结

挑战与解决方案

挑战 1：复合错误

挑战 2：通信开销

挑战 3：协调复杂度

未来展望

总结

参考资源

相关文章

Agent 设计依然很难

Agent Skills 和工具使用：构建专业化 AI 能力

Claude 4 时代的 Prompt Engineering：从入门到精通