上下文工程:从提示词到内存管理
Andrej Karpathy 提出的范式转变——将大语言模型视为计算系统的核心概念
Andrej Karpathy 在2024年提出了一个深刻的比喻:LLM = CPU,Context Window = RAM,Engineer = OS。这个框架彻底改变了我们思考模型交互的方式。不再将注意力集中在精心设计"提示词"(prompt engineering),而是转向上下文工程(context engineering)——一个关于如何在有限的"内存"中动态调配信息的系统问题。Context Window 是一个有限资源,就像计算机的RAM一样,工程师的职责是像操作系统一样高效地管理这份资源,确保在任何时刻,最重要的信息都被保留在模型的"视野"中。
一个完整的上下文系统由多个独立的"层"组成,每一层都有明确的用途和生命周期:
- Task Description(任务描述)——系统级别的目标陈述,通常是静态的。定义了模型应该做什么的基本框架。
- Few-shot Examples(少样本示例)——通过具体例子展示预期的输出格式和推理模式。研究表明3-5个高质量例子可以显著提升性能。
- RAG Documents(检索增强的外部文档)——基于用户查询动态检索的背景知识。包括文档片段、API文档、知识库条目。
- Tool Specifications(工具规范)——可用工具的完整定义(函数签名、参数、返回值、错误处理)。
- State/History(状态与历史)——对话历史、中间计算结果、前一步的输出。这是"内存"的核心部分。
- Multimodal Context(多模态上下文)——图像、表格、代码块等非文本信息。
传统的静态系统提示(system prompt)无法适应变化的任务需求。现代的方法是条件式动态加载(conditional dynamic loading),根据请求的优先级动态组装系统提示:
// 伪代码:动态系统提示组装
class DynamicPromptBuilder {
constructor(tokenBudget = 8000) {
this.tokenBudget = tokenBudget;
this.layers = [];
}
addLayer(name, content, priority, condition = null) {
// condition 是一个谓词函数,决定该层是否加载
this.layers.push({
name, content, priority,
condition, tokens: countTokens(content)
});
}
build(context = {}) {
// 1. 评估所有层的条件
let active = this.layers
.filter(l => !l.condition || l.condition(context))
.sort((a, b) => b.priority - a.priority);
// 2. 按优先级打包,直到达到token预算
let result = [];
let used = 0;
for (const layer of active) {
if (used + layer.tokens <= this.tokenBudget) {
result.push(layer.content);
used += layer.tokens;
}
}
return result.join('\n---\n');
}
}
// 使用示例
const builder = new DynamicPromptBuilder(8000);
builder.addLayer(
'base_instructions',
'You are a code review agent...',
priority = 100 // 最高优先级
);
builder.addLayer(
'framework_context',
'This is a React project. Framework-specific patterns...',
priority = 80,
condition = (ctx) => ctx.projectType === 'react'
);
builder.addLayer(
'rag_documents',
retrievedDocs,
priority = 60,
condition = (ctx) => ctx.queryType === 'knowledge'
);
const systemPrompt = builder.build({ projectType: 'react', queryType: 'knowledge' });Token预算是硬约束。高效的预算管理需要多个策略的组合:
模型在执行长期任务时会犯错误,特别是在面临决策点时。事件驱动提醒(Event-Driven Reminders)是一种在关键时刻自动注入指导的技术:
给定固定的context window,信息压缩是必然的。传统方法(如简单的总结)会丧失细节。现代方法使用熵减原理(entropy-reduction principle):优先保留高信息密度的部分,降低冗余信息。
一个关键洞察是:并非所有token都是平等的。一个包含关键变量赋值的代码行(5 tokens)的价值可能高于一段解释性的自然语言文本(20 tokens)。使用信息论的方法——如互信息(Mutual Information)——来评估每个token对最终任务的贡献程度,然后根据这个评分进行压缩。
当系统有100个可用工具时,将所有工具规范都放在上下文中会产生巨大的开销。渐进式披露(Progressive Disclosure)策略会根据对话的进度逐步引入工具:
- 初始阶段:只列出5个最常用的工具的简化描述。
- 中期阶段:用户提到某个工具时,加载该工具的完整规范。
- 高级阶段:显示高级工具的组合模式和最佳实践。
| 维度 | 提示词工程(Prompt Engineering) | 上下文工程(Context Engineering) |
|---|---|---|
| 焦点 | 优化单个提示词的措辞 | 整个上下文生态系统的设计与优化 |
| 视角 | 静态、一次性 | 动态、适应性、多层次 |
| 时间维度 | 关注单一请求 | 跨越多轮对话和会话的长期策略 |
| 资源管理 | 尽力而为,无严格预算 | 精确的token预算和优先级管理 |
| 适应性 | 手动调整,通常是trial-and-error | 基于条件和上下文信号的自动化调整 |
| 扩展性 | 难以扩展到复杂、多步骤任务 | 为长链任务和多代理系统设计 |
// TypeScript: 完整的上下文管理框架
interface ContextLayer {
id: string;
content: string;
priority: number;
tokens: number;
condition?: (state: AppState) => boolean;
refreshInterval?: number; // ms,用于动态更新
version: number;
}
interface TokenBudget {
total: number;
reserved: Map; // 为特定层预留的tokens
used: number;
}
class ContextManager {
private layers: Map = new Map();
private budget: TokenBudget;
private eventEmitter = new EventEmitter();
constructor(totalTokens: number) {
this.budget = {
total: totalTokens,
reserved: new Map(),
used: 0
};
}
registerLayer(layer: ContextLayer): void {
this.layers.set(layer.id, layer);
if (layer.refreshInterval) {
setInterval(() => this.refreshLayer(layer.id), layer.refreshInterval);
}
}
private refreshLayer(layerId: string): void {
const layer = this.layers.get(layerId);
if (layer) {
layer.version++;
this.eventEmitter.emit('layer-updated', { layerId, version: layer.version });
}
}
buildContext(state: AppState): {
context: string;
tokenUsage: number;
excluded: string[];
} {
// 1. 评估所有层
const activeLayers = Array.from(this.layers.values())
.filter(l => !l.condition || l.condition(state))
.sort((a, b) => b.priority - a.priority);
// 2. 分配tokens
let context: string[] = [];
let tokenUsage = 0;
let excluded: string[] = [];
const reserved = this.budget.reserved;
for (const layer of activeLayers) {
const reserved_tokens = reserved.get(layer.id) || 0;
const available = this.budget.total - tokenUsage;
if (reserved_tokens > 0 && available >= reserved_tokens) {
// 这一层有预留,必须包含
context.push(`[${layer.id}]`);
context.push(layer.content);
tokenUsage += layer.tokens;
} else if (available >= layer.tokens && reserved_tokens === 0) {
// 没有预留,容量许可就包含
context.push(`[${layer.id}]`);
context.push(layer.content);
tokenUsage += layer.tokens;
} else {
excluded.push(layer.id);
}
}
return {
context: context.join('\n---\n'),
tokenUsage,
excluded
};
}
getUtilization(): number {
return (this.budget.used / this.budget.total) * 100;
}
setReservedTokens(layerId: string, tokens: number): void {
this.budget.reserved.set(layerId, tokens);
}
getExcludedLayers(): string[] {
// 返回由于预算限制而未被包含的层的ID列表
return Array.from(this.layers.values())
.filter(l => this.budget.used + l.tokens > this.budget.total)
.map(l => l.id);
}
}
// 使用示例
const manager = new ContextManager(8192);
manager.registerLayer({
id: 'system_instructions',
content: 'You are a code generation assistant...',
priority: 100,
tokens: 150,
version: 1
});
manager.registerLayer({
id: 'user_project_context',
content: 'This project uses React 18, TypeScript, Vite...',
priority: 90,
tokens: 300,
condition: (state) => state.projectType === 'web',
version: 1
});
manager.registerLayer({
id: 'relevant_docs',
content: retrievedDocumentation,
priority: 70,
tokens: 2000,
refreshInterval: 30000, // 每30秒刷新
version: 1
});
manager.registerLayer({
id: 'conversation_history',
content: lastNMessages(10),
priority: 85,
tokens: 1500,
refreshInterval: 5000, // 每消息更新
version: 1
});
const result = manager.buildContext(currentState);
console.log(`Context utilization: ${manager.getUtilization().toFixed(2)}%`);
console.log(`Excluded layers: ${result.excluded.join(', ')}`);
Q: Context Window 真的像RAM一样工作吗? A: 在很多方面是的。就像RAM一样,更大的context window 允许更复杂的任务,但成本更高。就像RAM管理一样,工程师必须做出关于什么留在"内存"中、什么被换出的决策。然而,不同之处在于RAM是随机访问,而context window是顺序的——前面的token对后面的token有更强的影响力。
Q: 我应该如何决定哪些层应该被优先加载? A: 使用一个启发式:信息密度除以token成本。如果一个100-token的文档片段能解决用户的问题,而一个1000-token的详细指南可能不会,前者有更高的优先级。此外,始终给予任务定义和少样本示例最高优先级,因为它们影响整个推理过程。
Q: 动态加载是否会引入延迟? A: 是的,但通常是可接受的。初始化动态上下文的延迟(通常为50-200ms)通常被推理本身的改进所弥补。关键是在离线预计算(如文档索引)和在线决策(如条件评估)之间找到平衡。缓存是你的朋友——缓存已评估的条件和频繁访问的文档。
Q: 上下文工程的下一个前沿是什么? A: 学习自适应context allocation——使用强化学习来自动优化层的优先级和token分配,基于任务结果反馈。另一个方向是跨会话的长期记忆管理,这需要解决健忘策略(遗忘过时信息)和记忆检索(找到相关的旧信息)的问题。
测试时计算缩放:从参数到推理的范式转变
在推理时投入更多计算资源以换取更好的性能——打破参数缩放的天花板
过去十年,深度学习的进步基于一个简单的模式:扩大模型。Scaling Laws(缩放定律)表明,性能与参数数量有幂律关系。然而,这种方法在2024年遇到了实际限制:训练更大的模型需要更多的计算、能源和数据,而收益在逐渐递减。
OpenAI 的 o1、o3 模型和其他最先进的系统代表了一个根本性的转变:在推理时进行重型计算,而不仅仅在训练时。这是一个自古以来就被计算机科学理解的原则——在计算密集的任务上,你可以选择权衡:更强大的硬件、更多的时间或更聪明的算法。现在,LLMs 被赋予了在推理时探索更多可能性、尝试多条路径、进行更深入思考的能力。
Chinchilla Scaling Laws(由DeepMind提出)建立了一个经验性的关系:对于给定的计算预算,最优的模型大小和训练token数有一个特定的比率。这通常表述为:
FLOPs ≈ C
N (参数) ≈ C / (6D) ,其中 D = 训练 tokens
最优:N 和 D 应该大约相等
Chinchilla 定律优化了训练时的计算分配。但它假设所有计算都发生在训练时。推理优化缩放(Inference-Optimal Scaling)重新思考了这一点:如果我们有固定的推理预算(例如,用户愿意等待5秒),我们应该如何分配它?
答案是:投入更多token生成,而不是更大的模型。一个较小的模型,给予更多的"思考时间"(更多生成的token),往往可以超越一个更大的模型的单次传递推理。这反映了一个深刻的真理:计算深度(reasoning depth)有时比参数宽度更有价值。
一个关键的创新是思考标记的概念。与其他tokens不同,思考tokens是"隐藏的"——模型可以在其中进行冗长、迂回的推理,而无需因为冗长而被惩罚。这在传统的模型中是不被鼓励的,因为用户必须为所有输出token付费。
在扩展思考模型中,架构看起来像这样:
输入 → [Hidden Thinking Tokens] → [Final Answer Tokens] → 输出
总计算 = 输入 + 隐藏思考 + 答案
用户成本 = 输入 + 答案(隐藏思考通常按更低的费率计费或不计费)
这改变了优化动态。模型被激励进行更多的思考,因为这不会直接增加用户的成本。在o1的情况下,平均思考:答案的比率约为5:1到10:1,这意味着模型为了生成简洁的最终答案而投入大量思考。
<thinking>标签内进行内部推理。最适合需要复杂多步推理的任务。推理时计算是一场三维权衡:成本、延迟和精度。你不能同时最大化这三个维度。
精度 (Accuracy)
↙ ↘
成本 (Cost) ↔ 延迟 (Latency)
- 高精度 + 低成本 → 高延迟。一个单一的o1高思考运行可能需要30秒,但会产生极其准确的答案。
- 高精度 + 低延迟 → 高成本。多个并行的o1高思考运行,然后投票选择最佳答案。
- 低成本 + 低延迟 → 低精度。单次通过一个小模型的快速生成。
应用程序设计必须根据具体需求在这个三角形中选择一个点。实时聊天应用可能会选择低延迟的角。科学论文生成可能会选择高精度的角。
| 场景 | 测试时计算(推理时扩展) | 模型缩放(训练时扩展) | 推荐方法 |
|---|---|---|---|
| 复杂推理问题(数学、编码) | ✓✓✓ 优秀 思考空间有大的ROI |
✓✓ 好 但边际收益递减 |
测试时计算 |
| 事实性知识 | ✓ 有限帮助 更多思考不会增加事实知识 |
✓✓✓ 优秀 更大的模型有更多知识 |
模型缩放 |
| 创意写作 | ✓✓ 中等 思考可以改进结构,但创意需要多样性 |
✓✓ 中等 更大的模型更有创意,但ROI有限 |
组合方法 |
| 低延迟交互 | ✗ 不适用 思考增加延迟 |
✓✓✓ 必需 必须预先优化 |
模型缩放 |
| 成本敏感应用 | ✗✗ 昂贵 N倍推理成本 |
✓✓ 好 一次性推理 |
模型缩放 |
| 长链任务(验证、多步骤) | ✓✓✓ 优秀 反思和自纠正有高ROI |
✓ 有限 长链仍然容易失败 |
测试时计算 |
设L(N, T)为一个有N个参数的模型,生成T个额外的推理tokens后的损失。缩放定律通常表述为:
L(N, T) = A * N^(-α) + B * T^(-β) + ε
其中 N = 参数数, T = 生成的tokens
α 和 β 是幂律指数(通常在0.07到0.1之间)
这个公式的含义是:性能改进来自两个独立的来源——更大的模型(N)和更多的推理(T)。关键的洞察是边际收益通常是相似的。这意味着,对于给定的计算预算,你可能可以通过投入更多的推理tokens而不是扩大模型来获得类似的收益。
// Python: 使用Best-of-N采样的推理时计算
import anthropic
from typing import Any
class InferenceTimeComputeOrchestrator:
def __init__(self, model: str, n_samples: int = 4):
self.client = anthropic.Anthropic()
self.model = model
self.n_samples = n_samples
def best_of_n_sampling(
self,
prompt: str,
scoring_fn=None,
temperature: float = 1.0
) -> dict[str, Any]:
"""
生成 N 个候选答案,使用评分函数选择最好的。
Args:
prompt: 用户提示
scoring_fn: 函数 f(answer) -> float,给出答案的质量分数
如果为None,使用长度启发式(偏好较短、有结构的答案)
temperature: 采样温度(更高=更多多样性)
Returns:
{
'best_answer': str,
'score': float,
'all_candidates': list[str],
'total_tokens': int,
'inference_cost_multiplier': float
}
"""
candidates = []
token_counts = []
print(f"生成 {self.n_samples} 个候选答案...")
for i in range(self.n_samples):
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
temperature=temperature,
messages=[{"role": "user", "content": prompt}]
)
candidates.append(response.content[0].text)
token_counts.append(
response.usage.input_tokens + response.usage.output_tokens
)
print(f" 候选 {i+1}/{self.n_samples} 生成完成")
# 评分
if scoring_fn is None:
# 默认启发式:更偏好包含逻辑分解的答案
def default_scorer(text: str) -> float:
length_score = len(text.split()) / 500 # 归一化长度
structure_score = (
text.count('\n') +
text.count('1.') +
text.count('2.') +
text.count('**')
) / 10
return length_score * 0.3 + structure_score * 0.7
scoring_fn = default_scorer
scores = [scoring_fn(cand) for cand in candidates]
best_idx = scores.index(max(scores))
return {
'best_answer': candidates[best_idx],
'score': scores[best_idx],
'all_candidates': candidates,
'candidate_scores': scores,
'total_tokens': sum(token_counts),
'inference_cost_multiplier': self.n_samples,
'best_candidate_index': best_idx
}
def tree_search_simplified(
self,
prompt: str,
max_depth: int = 3,
branching_factor: int = 2
) -> dict[str, Any]:
"""
简化的树搜索实现:在每一步生成多个可能的延续,
使用启发式方法修剪不有希望的分支。
Args:
prompt: 初始提示
max_depth: 搜索树的最大深度
branching_factor: 每个节点的分支数
Returns:
最有希望的路径和其得分
"""
def get_branching_continuations(text: str, k: int) -> list[str]:
"""生成 text 的 k 个可能的延续"""
response = self.client.messages.create(
model=self.model,
max_tokens=200,
temperature=1.2, # 更高的温度来鼓励多样性
messages=[{"role": "user", "content": text}]
)
# 简化:使用单个响应,通过分割来模拟多个延续
# 在实际中,应该生成 k 个单独的响应
main_response = response.content[0].text
# 模拟分支的启发式方法
return [main_response + f" [branch_{i}]" for i in range(k)]
def heuristic_value(text: str) -> float:
"""评估部分解的质量"""
# 启发式:包含具体步骤的文本得分更高
step_count = text.count('step') + text.count('Step')
logic_count = text.count('because') + text.count('therefore')
return (step_count * 0.6 + logic_count * 0.4) / max(1, len(text.split()) / 100)
# 简化的深度优先搜索
best_solution = None
best_value = -float('inf')
total_tokens = 0
nodes_explored = 0
def dfs(current_text: str, depth: int):
nonlocal best_solution, best_value, total_tokens, nodes_explored
if depth >= max_depth:
value = heuristic_value(current_text)
if value > best_value:
best_value = value
best_solution = current_text
return
branches = get_branching_continuations(current_text, branching_factor)
nodes_explored += 1
total_tokens += branching_factor * 200 # 近似
# 修剪:只追踪最有希望的分支
scored_branches = [
(branch, heuristic_value(branch)) for branch in branches
]
scored_branches.sort(key=lambda x: x[1], reverse=True)
# 只扩展top-50%的分支(修剪)
for branch, score in scored_branches[:max(1, branching_factor // 2)]:
dfs(branch, depth + 1)
dfs(prompt, 0)
return {
'best_path': best_solution,
'value_score': best_value,
'nodes_explored': nodes_explored,
'total_tokens': total_tokens,
'inference_cost_multiplier': max_depth * branching_factor
}
# 使用示例
orchestrator = InferenceTimeComputeOrchestrator(
model="claude-opus-4-1",
n_samples=4
)
# Best-of-N 采样
result_bon = orchestrator.best_of_n_sampling(
prompt="Solve this: What is 17 * 23 + 45?",
temperature=0.8
)
print(f"Best answer: {result_bon['best_answer']}")
print(f"Cost multiplier: {result_bon['inference_cost_multiplier']}x")
print(f"Total tokens used: {result_bon['total_tokens']}")
# 树搜索
result_tree = orchestrator.tree_search_simplified(
prompt="Design a simple algorithm to find the median of two sorted arrays",
max_depth=2,
branching_factor=2
)
print(f"Best solution: {result_tree['best_path']}")
print(f"Value: {result_tree['value_score']:.3f}")
Q: Best-of-N 采样相比单次推理要贵多少倍? A: 精确成本取决于模型定价,但粗略地说,它乘以N。如果你使用N=4,推理成本是4倍。然而,这通常通过更高的准确性来补偿。在许多情况下,4倍的成本结合Best-of-4的精度可能比运行单一的"更大"模型更经济。
Q: 我可以并行运行多个采样吗以减少延迟? A: 绝对可以。如果你有足够的并发容量,你可以并行生成所有N个候选答案,然后评分。延迟变成单个生成的延迟,而不是N倍。这是服务端缓存的完美用例——缓存候选答案以获得未来相同查询的益处。
Q: 测试时计算什么时候不值得? A: 当你的任务对精度边界不敏感时。例如,在聊天应用中生成自然对话,多数投票或Best-of-N可能不会带来显著的用户可见改进。同样,对于事实性检索问题(模型只是查找信息),更多的采样不会增加更多的事实知识。
AI编码代理:从补全到自主执行的演进
2025年的浪潮:代码补全 → 代码生成 → 自主任务执行和推理验证
五年前,"AI编程助手"意味着代码补全——你打字,模型预测下一个tokens。今天,这个概念已经进化成了完整的自主代码代理,能够理解任务规范、设计解决方案、编写代码、运行测试、调试失败,并迭代直到成功。这不仅仅是更好的补全;这是一个在本质上不同的架构范式。
关键的里程碑是SWE-bench的出现和广泛采用。SWE-bench是第一个允许AI系统在真实软件工程任务上进行基准测试的标准化基准——修复真实GitHub问题、实现功能、通过单元测试。这将AI编程从"演示"变成了有可验证指标的科学领域。
CLI 代理
快速迭代
沙箱环境
终端驱动
所有现代代码代理都遵循一个共同的循环模式,虽然在细节上有所不同:
SWE-bench(Software Engineering Bench)是由Princeton和OpenAI的研究人员创建的第一个大规模的AI编程基准。它包含来自真实GitHub存储库的2,294个真实世界的软件工程问题。
它如何工作: 每个问题包括一个问题描述(从GitHub问题提取)和一个应该应用的补丁(解决方案)。代理被给予问题描述和代码库,必须生成代码修改来解决问题。评估是自动的:代理的补丁应用于代码库,运行测试套件,成功仅当所有原始测试通过且不会引入回归(不破坏其他测试)。
| 系统 | SWE-bench 通过率 | 备注 |
|---|---|---|
| Claude Code(Claude 3.5 Sonnet) | 72% | 顶级性能。强大的上下文理解和多步骤推理。 |
| Devin(Cognition) | ~70%(内部报告) | 完全沙箱化。高自主性但数据点较少。 |
| GPT-4 Turbo | ~50% | 基线。单次传递,无反馈循环。 |
| Claude 3 Opus | ~48% | 前代模型。仍然强大,但不如Sonnet。 |
| Cursor(基于GPT-4o) | ~45-50%* | *非官方估计。用户驱动的迭代可能提高真实性能。 |
值得注意的是,72%的"通过率"不意味着代理完全自主解决了所有问题。许多情况涉及多次迭代、失败的初始尝试和错误恢复。这仍然非常有用——即使代理只解决了50%的问题而无需人工干预,也能显著加快开发。
// TypeScript + 伪代码:代码代理的一个迭代循环
interface TaskDefinition {
problem: string;
context: CodeContext; // 代码库、测试等
maxIterations: number;
}
interface AgentState {
currentPlan: string;
generatedCode: string;
testResults: TestResult[];
executionErrors: string[];
iterationCount: number;
isComplete: boolean;
}
class CodeAgent {
async solveTask(task: TaskDefinition): Promise {
const state: AgentState = {
currentPlan: '',
generatedCode: '',
testResults: [],
executionErrors: [],
iterationCount: 0,
isComplete: false
};
// 步骤 1: 规划
state.currentPlan = await this.planTask(task);
console.log(`计划:\n${state.currentPlan}\n`);
// 步骤 2-5: 迭代循环
while (state.iterationCount < task.maxIterations && !state.isComplete) {
state.iterationCount++;
console.log(`\n=== 迭代 ${state.iterationCount} ===`);
// 步骤 2: 代码生成
const previousContext = state.iterationCount > 1
? {
previousAttempt: state.generatedCode,
errors: state.executionErrors,
testFailures: state.testResults
.filter(r => !r.passed)
.map(r => r.message)
}
: null;
state.generatedCode = await this.generateCode(
task,
state.currentPlan,
previousContext
);
console.log(`生成代码(${state.generatedCode.split('\n').length} 行)`);
// 步骤 3: 执行和测试
const execution = await this.executeCode(
state.generatedCode,
task.context
);
state.testResults = execution.testResults;
state.executionErrors = execution.errors;
const passedTests = state.testResults.filter(r => r.passed).length;
const totalTests = state.testResults.length;
console.log(`测试结果: ${passedTests}/${totalTests} 通过`);
if (state.executionErrors.length > 0) {
console.log(`执行错误:`);
state.executionErrors.forEach(err => console.log(` - ${err}`));
}
// 检查完成条件
if (passedTests === totalTests && state.executionErrors.length === 0) {
state.isComplete = true;
console.log('✓ 所有测试通过!任务完成。');
} else if (state.iterationCount >= task.maxIterations) {
console.log(`✗ 达到最大迭代次数(${task.maxIterations})。');
}
}
return state;
}
private async planTask(task: TaskDefinition): Promise {
const systemPrompt = `You are a software engineer solving coding tasks.
First, analyze the problem and create a step-by-step plan.`;
const userMessage = `
问题: ${task.problem}
代码库上下文:
${task.context.summary}
制定一个解决这个问题的计划。`;
const response = await this.modelCall(systemPrompt, userMessage, {
thinkingBudget: 'high' // 使用扩展思考进行规划
});
return response.text;
}
private async generateCode(
task: TaskDefinition,
plan: string,
previousContext: any = null
): Promise {
let userMessage = `
计划: ${plan}
现在,基于这个计划生成代码来解决问题。
确保代码与现有风格一致。
包括错误处理。`;
if (previousContext) {
userMessage += `
之前的尝试失败了,错误如下:
${previousContext.errors.join('\n')}
测试失败:
${previousContext.testFailures.join('\n')}
分析这些错误并改进你的解决方案。`;
}
const response = await this.modelCall(
'You are a code generation AI. Generate high-quality, production-ready code.',
userMessage,
{ temperature: 0.7 }
);
// 提取代码块
const codeMatch = response.text.match(/```[\w]*\n([\s\S]*?)\n```/);
return codeMatch ? codeMatch[1] : response.text;
}
private async executeCode(
code: string,
context: CodeContext
): Promise<{ testResults: TestResult[]; errors: string[] }> {
try {
// 模拟执行环境(实际使用沙箱)
const sandbox = new CodeSandbox(context);
const result = await sandbox.execute(code);
const testResults: TestResult[] = result.tests.map(t => ({
name: t.name,
passed: t.passed,
message: t.message
}));
return {
testResults,
errors: result.runtimeErrors || []
};
} catch (error) {
return {
testResults: [],
errors: [`执行失败: ${error.message}`]
};
}
}
private async modelCall(
systemPrompt: string,
userMessage: string,
options: any = {}
): Promise<{ text: string }> {
// 实际调用 Claude API
// 这里是伪代码
return {
text: '// 代码生成结果'
};
}
}
// 使用示例
const agent = new CodeAgent();
const result = await agent.solveTask({
problem: '在 utils.ts 中实现一个 debounce 函数,应该返回一个延迟执行的函数',
context: new CodeContext('./my-project'),
maxIterations: 5
});
console.log(`\n最终状态:`);
console.log(`完成: ${result.isComplete}`);
console.log(`迭代: ${result.iterationCount}`);
console.log(`通过的测试: ${result.testResults.filter(r => r.passed).length}/${result.testResults.length}`);
| 维度 | 传统自动补全 | 代理编码 |
|---|---|---|
| 作用域 | 预测下一行或函数 | 解决整个任务或问题 |
| 反馈 | 无真实反馈;基于统计模式 | 执行代码、测试输出、实际错误 |
| 迭代 | 不迭代;用户手动修正 | 自动调试和改进,直到成功 |
| 上下文理解 | 局部(当前行周围) | 全局(整个代码库、架构) |
| 验证 | 用户必须测试 | 自动运行测试;验证通过/失败 |
| 典型准确性 | ~60-70%(第一个token) | ~70-75%(整个任务解决) |
使用代码代理涉及一个有趣的经济学问题。单次API调用成本较低,但代理可能执行多个迭代(因此多个调用)。然而,关键是相对于开发者时间。
- 成本的观点: 一个可能花费2-5倍的推理成本的代理,但消除了人工调试时间,仍然是极其经济的。
- 延迟的观点: 一个简单的任务("写一个登录表单")可能需要2-3次迭代,总共10-20秒。中等任务可能需要30-60秒。复杂的任务(重构、架构变更)可能需要2-5分钟,但仍然比手工编码快。
尽管进步很大,但代码代理仍然有重要的限制:
- 幻觉和不切实际的提议: 模型可能生成看起来合理但实际上不起作用的代码。这通常通过测试反馈来捕获,但有时测试本身可能不完整。
- 长期推理: 对于需要维持一致性跨越许多文件和步骤的大型重构,代理往往会偏离。上下文工程有帮助,但这仍然是一个挑战。
- 错误积累: 当代理在早期步骤中犯错误时,它可能在后续迭代中加倍这个错误,而不是纠正根本原因。
- 测试覆盖依赖: 代理的有效性受到可用测试的质量和覆盖范围的严格限制。如果没有测试来验证行为,代理可能生成看起来有效但实际上有微妙缺陷的代码。
Q: 我应该何时在生产中使用代码代理 vs 人工开发者? A: 代理在明确定义的问题上最有效,具有现成的测试。修复Bug、实现有明确规范的功能、写一致代码——这些都是好候选项。它们在架构决策、跨系统设计和创意解决方案上较差。最佳实践是混合方法:使用代理处理明确的任务,有经验的工程师处理复杂的设计。
Q: 72%的SWE-bench通过率在现实中意味着什么? A: 它意味着在真实GitHub问题中,代理可以自主解决约3/4的问题而不需要人工干预。但要注意基准的局限性:大多数问题相对较小,许多有非常清晰的测试。更大的重构或架构变更可能有更低的自主通过率。
Q: 代理应该尝试多少次迭代? A: 典型的限制是5-10次。在那之后,如果代理仍然失败,问题可能太复杂了,需要人工干预。某些框架使用"可达性分析"来检测代理是否陷入循环(重复相同的错误),并提前停止。
Q: 下一步对代码代理来说是什么? A: 更好的长期规划(使用扩展思考的专门训练)、多代理协作(不同专家处理不同子任务)和跨repo推理(理解多个项目的交互)。还有工程工具集成的问题——代理需要能够使用更多工具(deployment、monitoring、debugging profilers)以处理真实工程工作流。
结构化输出与约束解码:可靠性的基础
在生成时强制LLM输出遵循严格的结构——消除解析歧义和幻觉
LLMs 是概率系统。它们生成token one-at-a-time,每个token是概率分布上的样本。这意味着:
- 输出格式不一致: 你要求JSON;有时你得到JSON,有时得到Markdown,有时是纯文本。
- 错误的数据类型: 你期望一个数字;模型返回一个字符串,有时是一个格式错误的数字。
- 无效字段: 你定义了一个包含特定枚举值的字段("status": "active" 或 "inactive");模型创造了一个新值("pending")。
- 缺失字段: 一个需要5个字段的结构在只有3个字段的地方被返回,迫使下游代码处理缺失的数据。
传统解决方案是后处理解析(post-hoc parsing):模型生成任意文本,然后应用程序尝试用正则表达式或自定义解析器提取结构。这是脆弱的,容易出错,并增加延迟(你必须首先生成全部输出,然后解析它)。
约束解码(Constrained Decoding)是一个在token生成的每一步强制遵守约束的技术。与其让模型在所有50,000个可能的token上自由采样,我们根据当前生成的前缀以及目标结构动态屏蔽无效token。
例如,如果生成了 {"name": "John", "age": 并且schema要求age是一个整数,解码器只允许token `0` 到 `9` 和 `.` (对于浮点数)。当到达对象的结尾时,解码器强制一个 `}` token,确保有效的JSON。
原始token概率分布(50K tokens)
↓ [应用约束]
屏蔽分布(仅有效tokens;其他为 -∞)
↓ [采样]
有保证有效的下一个token
约束解码的一个关键优势是零或最小的性能开销。早期的实现(2023)为每个token增加了30-50%的延迟。现代库已经优化到near-zero:
- Outlines(FSM): 约
5-10μs的开销/token(可忽略不计) - llguidance(CFG): 约
50μs的开销/token(仍然可接受,对于4K-token生成只增加200ms) - OpenAI Structured Outputs: 无测量开销;从模型推理时间中摊销。
关键的洞察是约束检查可以在GPU或CPU上高度并行化,不会阻塞生成循环。
为了系统地评估约束解码框架的准确性、性能和鲁棒性,社区创建了JsonSchemaBench——包含10,000个实际JSON schemas和相应的测试用例的基准。这个基准测试:
- 准确性: 给定一个schema,框架是否生成有效的JSON?
- 完整性: 是否可以使用该框架表达任意的JSON schemas?
- 性能: 约束检查增加的吞吐量损失是多少?
- 多模态支持: 框架是否处理数组、嵌套对象、联合类型等?
| 框架 | 方法 | 支持的格式 | 性能 | 使用便利性 |
|---|---|---|---|---|
| Outlines | FSM | JSON, Regex, Pydantic | ~5-10μs/token | 优秀。集成HF生态。 |
| llguidance | CFG | 任意CFG | ~50μs/token | 好。学习曲线陡峭。 |
| XGrammar | 混合(FSM + CFG) | JSON, XML, 自定义 | ~15-25μs/token | 好。多语言支持。 |
| llama.cpp | Token掩码 | GBNF (EBNF变体) | ~2-5μs/token | 优秀。轻量级。 |
| OpenAI结构化输出 | 模型级别 | JSON Schema | 无开销 | 优秀。API一体化。 |
| Anthropic工具使用 | 模型级别 | JSON Schema(工具) | 无开销 | 优秀。原生集成。 |
| Gemini API | 模型级别 | JSON Schema | 无开销 | 优秀。Google集成。 |
// Python: 使用 Outlines 和 llguidance 的约束解码
# 使用 Outlines(FSM 方法)
from outlines.models import transformers
from outlines.integrations.json_schema import to_json_schema
# 定义 Pydantic 模型作为schema
from pydantic import BaseModel
from typing import List
class Person(BaseModel):
name: str
age: int
email: str
skills: List[str]
# 加载模型和创建受约束的生成器
model = transformers.get_model("mistral-7b")
schema = to_json_schema(Person)
generator = model.generate(schema=schema)
# 生成受约束的输出
prompt = "Generate a person with the following details:"
result = generator(prompt, max_tokens=200)
print(result)
# 保证是有效的 JSON,可以直接解析为 Person
---
# 使用 llguidance(CFG 方法)
from lm_format_enforcer import JsonSchemaParser
import json
class Product(BaseModel):
product_id: int
name: str
price: float
in_stock: bool
tags: List[str]
# 创建解析器
schema = {
"type": "object",
"properties": {
"product_id": {"type": "integer"},
"name": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"},
"tags": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["product_id", "name", "price", "in_stock", "tags"]
}
parser = JsonSchemaParser(schema)
# 在解码循环中应用约束
prompt = "Generate a product listing:"
for token_id in model.generate_token_ids(prompt):
# 获取下一个可能的tokens
valid_tokens = parser.get_allowed_tokens(decoder_state)
# 只允许有效的tokens(屏蔽其他)
if token_id not in valid_tokens:
token_id = valid_tokens[0] # 选择第一个有效token
# 更新解析器状态
parser.update(token_id)
yield token_id
if parser.is_complete():
break
# 输出保证是有效的 JSON,可以直接反序列化
output = parser.result()
product = Product(**output)
---
# 使用 OpenAI 的结构化输出(API 级别)
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class Event(BaseModel):
event_type: str # "meeting", "deadline", "reminder"
date: str # ISO 8601 格式
title: str
description: str
attendees: list[str]
# API 强制 schema
response = client.beta.messages.create(
model="gpt-4-turbo",
messages=[
{
"role": "user",
"content": "Extract calendar events from this text: ..."
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "CalendarEvents",
"schema": Event.model_json_schema(),
"strict": True # 强制严格 schema 遵守
}
}
)
# response.content[0].text 保证是有效的 JSON
events = [Event(**event) for event in json.loads(response.content[0].text)]
---
# 使用 Anthropic 的工具使用(结构化输出)
from anthropic import Anthropic
client = Anthropic()
# 定义工具schema
tools = [
{
"name": "record_user_feedback",
"description": "记录用户关于功能的反馈",
"input_schema": {
"type": "object",
"properties": {
"feature": {
"type": "string",
"description": "反馈所关于的功能"
},
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "反馈的情感"
},
"rating": {
"type": "integer",
"minimum": 1,
"maximum": 5,
"description": "1-5 的评分"
},
"comments": {
"type": "string",
"description": "详细的评论(可选)"
}
},
"required": ["feature", "sentiment", "rating"]
}
}
]
response = client.messages.create(
model="claude-opus-4-1",
max_tokens=1024,
tools=tools,
messages=[
{
"role": "user",
"content": "用户说:'我喜欢新的搜索功能,但有时它很慢。我会给它4/5。'"
}
]
)
# 处理工具调用响应
for content_block in response.content:
if content_block.type == "tool_use":
tool_input = content_block.input
# tool_input 保证遵守 schema
# 可以直接验证和存储
record_feedback(
feature=tool_input["feature"],
sentiment=tool_input["sentiment"],
rating=tool_input["rating"],
comments=tool_input.get("comments", "")
)
---
# 性能对比:约束 vs 非约束解码
import time
from outlines import models
model = models.transformers.get_model("mistral-7b")
prompt = "Generate a JSON object with person details:"
# 非约束(有风险)
start = time.time()
unconstrained = model.generate(prompt, max_tokens=200)
unconstrained_time = time.time() - start
# 约束(使用 Outlines)
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"email": {"type": "string"}
}
}
start = time.time()
constrained = model.generate(prompt, schema=schema, max_tokens=200)
constrained_time = time.time() - start
print(f"非约束: {unconstrained_time:.3f}s")
print(f"约束: {constrained_time:.3f}s")
print(f"开销: {((constrained_time - unconstrained_time) / unconstrained_time * 100):.1f}%")
# 典型结果: 开销 < 5%
| 场景 | 约束解码 | 后处理解析 | 推荐 |
|---|---|---|---|
| 简单JSON APIs | ✓✓✓ 完美 零开销,100%有效 |
✓ 有效 但容易失败 |
约束解码 |
| 结构化数据提取 | ✓✓✓ 优秀 确保有效性 |
✓✓ 可行 但需要重试 |
约束解码 |
| 自由格式文本 | ✗ 不适用 约束会限制创意 |
✓✓✓ 好 模型有自由度 |
后处理 |
| 复杂递归结构 | ✓✓ 好 CFG 支持(但较慢) |
✗ 困难 容易有边界情况 |
约束解码 |
| 低成本、高吞吐 | ✓✓ 好 最小化重试 |
✗ 差 失败时需要重新生成 |
约束解码 |
| 实验/探索 | ✓ 有限 约束可能太严格 |
✓✓✓ 好 最大灵活性 |
后处理 |
即使约束解码可以保证有效的JSON,语义错误仍然可能发生。
// 常见问题和诊断
// 问题 1: 约束太严格,模型频繁失败
// 症状: 生成过程经常"卡住",输出不完整
症状代码:
{
"name": "John",
"age": [STUCK - 期望数字,但模型想说"30或35"]
}
诊断:
- 检查 schema 是否过度约束
- 确认模型理解 schema(在少样本中显示示例)
- 考虑使用更宽松的类型(string 而不是严格的 enum)
解决方案:
# 之前:严格的 enum
"status": {
"type": "string",
"enum": ["active", "inactive"] # 模型有时想生成"pending"
}
# 之后:允许任何字符串,但记录意外值
"status": {
"type": "string"
// 在后处理中验证
}
if status not in ["active", "inactive"]:
log_unexpected_value(status)
status = "unknown"
---
// 问题 2: 语义不正确但语法有效的输出
// 症状: JSON 有效,但内容没有意义
例子:
{
"product": "laptop",
"quantity": -5, # 数字有效,但负数没有意义
"price": "apple" # 类型有效(string),但不是有效的价格
}
诊断:
- schema 仅强制类型和格式,不强制业务逻辑
- 需要在输出后进行后处理验证
解决方案:
class Product(BaseModel):
name: str
quantity: int = Field(gt=0) # quantity > 0
price: float = Field(gt=0) # price > 0
# Pydantic 自动验证业务逻辑
try:
product = Product(**json_output)
except ValidationError as e:
log_error(f"Semantic error: {e}")
retry_with_feedback(original_prompt)
---
// 问题 3: 递归或嵌套结构中的不完整数据
例子(期望):
{
"items": [
{"id": 1, "name": "Item 1"},
{"id": 2, "name": "Item 2"}
]
}
例子(实际,使用某些框架):
{
"items": [
{"id": 1, "name": "Item 1"}
// 第二个对象未生成
]
}
原因:
- 某些约束库在数组处理中有错误
- token 限制可能截断输出
解决方案:
- 使用已知在数组上工作良好的库(Outlines, OpenAI)
- 增加 max_tokens
- 在少样本中显示完整的数组示例
---
// 问题 4: 性能退化
症状: 约束解码比预期慢得多
原因:
- 复杂的 CFG 约束(O(n²) 最坏情况)
- 每 token 过多的重新编译/重新计算
- 不匹配的库(某些库针对某些模型优化)
解决方案:
# 使用分析来检查约束开销
import time
for i, token in enumerate(constrained_generator):
if i % 10 == 0:
elapsed = time.time() - start
tokens_per_second = i / elapsed
if tokens_per_second < 50: # 典型值:100-500 tps
logger.warning(f"Slow generation: {tokens_per_second} tps")
# 如果很慢,尝试:
# 1. 简化 schema(FSM 比 CFG 快)
# 2. 切换库
# 3. 在模型推理后应用约束(trade-off 可靠性对性能)
Q: 我应该在哪一层实施约束解码? A: 最好的位置是在模型级别(OpenAI Structured Outputs、Anthropic 工具使用)或在 token 生成循环中(Outlines、llguidance)。远离post-hoc解析。如果使用API,选择本地支持约束解码的提供商。
Q: 当设计 JSON schema 用于约束解码时,我应该考虑什么? A: (1) 使用枚举而不是自由字符串来强制有限的值集。(2) 为数字使用 minimum/maximum 限制。(3) 使用 required 字段来避免缺失值。(4) 在少样本示例中显示有效的输出。(5) 避免过度约束(模型在太多约束下容易失败)。
Q: 约束解码的成本效益分析是什么? A: 假设无约束解码有70%的成功率。为了获得95%的成功率,你需要重试(在最坏的情况下2.5倍成本乘数)。约束解码有接近100%的成功率,开销<5%。经济上,约束解码赢出。而且还有错误处理成本——无效的JSON导致数据库错误、日志垃圾等。
Q: 约束解码在多模态模型(有图像输入)中工作吗? A: 是的,token 级别的约束与输入模态无关。JSON 约束将像往常一样工作,无论输入是文本还是包含图像。框架(Outlines、OpenAI API)支持这种开箱即用。
Synthetic Data Generation: 从稀缺到丰富
How frontier models create training data at scale, and the critical failure modes that threaten generalization.
为什么合成数据如此关键
Real-world labeled data is expensive, slow, and fragmented. Collecting high-quality examples for specialized domains—medical imaging, legal documents, scientific papers—requires expert annotation costing thousands per sample. Meanwhile, privacy regulations (GDPR, HIPAA) restrict reuse of sensitive data. Synthetic data sidesteps these constraints: a capable model can generate unlimited task-specific examples, domain-adapted to your exact needs, without privacy leaks. This is why every frontier lab now runs synthetic data pipelines.
核心生成技术
Prompt-Based Generation
The simplest approach: use a capable model (GPT-4, Claude) to generate examples given a prompt template. E.g., "Generate 10 customer support questions about billing" → model outputs diverse questions. Quick and cheap, but quality is bounded by the prompt precision and model instruction-following.
Retrieval-Augmented Synthesis (RAS)
Ground generation in real documents. Retrieve relevant passages, then instruct the model: "Based on this document, generate a Q&A pair." This anchors synthesis to domain facts, reducing hallucination. Essential for technical domains where factual grounding matters.
Iterative Self-Refinement
Multiple passes improve quality. Generate → score → filter → regenerate weak examples. Each iteration tightens distribution toward desired quality threshold. Works especially well with weak supervision signals (e.g., reward models, classifier feedback).
Evol-Instruct (WizardLM, 2023)
Evolve instructions into progressively harder versions. Start with simple prompt → model generates response → "Make this 20% harder" → recurse. This creates curriculum-like synthetic data where complexity gradates naturally. WizardLM achieved competitive results with >90% synthetic training data.
Self-Instruct (Alpaca, Stanford 2023)
Bootstrap from a seed set of hand-written examples. Iteratively: (1) sample seed instructions, (2) prompt model to generate new instructions and outputs, (3) filter low-quality pairs, (4) add to training set. Alpaca's 52K examples cost ~$500 to generate using GPT-3.5, competitive with supervised fine-tuning.
RAFT: Retrieval-Augmented Fine-Tuning
RAFT (2024) marries retrieval-augmented generation with fine-tuning: (1) retrieve relevant documents, (2) use RAG to generate synthetic examples grounded in those documents, (3) fine-tune on synthetic + real data. Result: models that answer questions in-domain, citing sources, resistant to distribution shift. Shows 5-15% accuracy gains over vanilla fine-tuning on specialized tasks.
Model Collapse: 合成数据的致命陷阱
The Critical Risk: Train a model on synthetic data → use that model to generate more synthetic data → train the next model on that → repeat. Each generation, the distribution narrows: outliers are trimmed, variance collapses, modes concentrate. Mathematically:
where ρ ∈ (0,1) is the contraction rate per generation
After k generations, the support of the distribution is drastically reduced. Specific failure modes:
- Loss of diversity: Models repeat top-k common patterns, losing tail behaviors.
- Language degradation: Repeated paraphrasing introduces grammatical artifacts (e.g., increased use of "undoubtedly," "in conclusion").
- Fact erosion: Hallucinations compound; false patterns get reinforced as "facts."
- Catastrophic forgetting: Minority classes vanish; marginal domains become extinct in data.
Mitigation Strategies:
Scaling Laws for Synthetic Data
Early research (2023-2024) suggested synthetic data obeys different scaling laws than real data. Recent findings:
- Quality Threshold: Synthetic data below ~80% human-equivalent quality hurts more than it helps. Beyond 90%, gains plateau.
- Task Dependency: Simple tasks (classification) tolerate 100% synthetic; complex reasoning needs 40%+ real examples.
- Scale Curve: Synthetic data enables 2-4x larger effective dataset size before hitting diminishing returns, vs. real-only baseline.
- Diversity Cost: To match coverage of N real examples, need 2-3x more synthetic examples due to mode concentration.
| Scenario | Synthetic Helps? | Key Insight |
|---|---|---|
| Data scarce domain (<10K real) | ✓ Strong Win | Synthetic fills the gap; quality over quantity. |
| Abundant real data (>1M examples) | ✗ Marginal/Negative | Real diversity already sufficient; synthetic adds noise. |
| Domain shift problem | ✓ Moderate Win | Generate target-domain examples; reduces distribution mismatch. |
| Long-tail minority classes | ✓ Strong Win | Oversample rare classes synthetically without real-world cost. |
| Iterative retraining (>3 cycles) | ✗ Risky | Model collapse risk grows exponentially; requires strict quality gates. |
Llama 3 & Frontier Labs: Synthetic in Practice
Meta disclosed that Llama 3 (400B tokens) mixed:
- ~60% public web data (original real)
- ~25% synthetic code/reasoning examples (generated by internal models)
- ~15% high-quality curated real data (research papers, books, forums)
GPT-4o similarly uses synthetic data for chain-of-thought reasoning and multi-modal alignment. OpenAI has hinted at iterative refinement loops where weak synthetic examples are fixed by human feedback, then reused.
Cost-Benefit Analysis
Cost per 1M tokens of synthetic data: ~$100-500 using a capable model (GPT-4 API). Comparison:
- Human annotation: $2000-5000 per 1M tokens (slower, higher quality).
- Weak supervision: $500-1000 per 1M tokens (mixed quality, requires validation).
- Synthetic (high-quality filter): $100-300 per 1M tokens (fast, needs model investment).
Q: How do you prevent model collapse in production?
"We enforce a hard rule: never chain models without real-data grounding. Every generation step includes retrieval from external documents. We also track distribution entropy across generations; if it drops >5%, we halt and audit the data."
Q: When should a team NOT use synthetic data?
"If your domain is adversarial (security, finance), synthetic can hurt by teaching patterns an attacker can exploit. Also if you have abundant high-quality real data—synthetic adds variance without signal."
Q: What's the biggest surprise in scaling synthetic data?
"Quality filtering is harder than generation. We spend 2-3x more compute on validation than creation. And diversity is brutal: you need clever sampling strategies to avoid mode collapse."
Q: Is synthetic data a moat for frontier labs?
"Temporarily, yes. The best synthetic pipelines are proprietary (OpenAI, DeepMind). But the techniques are publishable. Within 18 months, open-source tools (Argilla, HuggingFace datasets) will commoditize it."
Model Merging: 无需重训的模型融合
Combining task-specific fine-tuned models through parameter manipulation, without retraining. A breakthrough for scalable multi-task serving.
问题:孤立的任务特定模型
Fine-tune a base model on Task A → get a specialized model with SOTA performance. Fine-tune the same base on Task B → another SOTA model. But you have two separate models. Want both in one? Naive averaging of weights produces catastrophic interference: performance on both tasks drops 20-40%. You need either (a) ensemble all models at inference (expensive), or (b) retrain from scratch with multi-task loss (slow, needs careful tuning).
Model merging solves this: combine task-specific checkpoints into a single unified model that retains performance on all tasks, without retraining. This unlocked the "model marketplace" where community members fine-tune and merge models freely (e.g., Open LLM Leaderboard merged models achieving SOTA with zero additional training).
核心融合技术
Model Soup (Wortsman et al., 2022)
The simplest merge: uniform averaging of fine-tuned checkpoints. Given base model θ₀ and n task-specific models θ₁, θ₂, ..., θₙ, the soup is:
Why it works: Fine-tuning on different tasks explores different regions of parameter space, but many regions are "compatible"—averaging them sums the task-specific improvements. Works best when tasks are diverse and base model is strong.
- Pros: Dead simple; no hyperparameters; embarrassingly parallel.
- Cons: Naive averaging doesn't resolve task interference (conflicting gradients); performance lags single-task models by 5-15%.
Task Arithmetic (Ilharco et al., 2023)
Instead of merging raw weights, merge task vectors—the difference between fine-tuned and base model. Given base θ₀ and task-specific θᵢ, the task vector is τᵢ = θᵢ - θ₀. Then:
where λᵢ are scalar weights controlling each task's contribution. Intuition: Task vectors isolate task-specific signal, removing the base model noise. Merging them preserves shared knowledge (θ₀) while adding task-specific deltas.
- Pros: More interpretable; task vectors are sparse (few non-zero elements), so interference is reduced.
- Cons: Still naive—conflicting deltas cause cancellation or amplification; needs manual tuning of λᵢ.
TIES-Merging (Trim, Elect Sign, Merge) - Wang et al., 2023
A principled solution to task interference via a three-step algorithm:
Impact: TIES-merging reduced interference dramatically. On 8-task merges, TIES achieved 96% of average single-task performance vs. 78% for naive soup. This became the de-facto standard for community model merging.
DARE: Drop And REscale (Yu et al., 2023)
A probabilistic complement to TIES. Instead of keeping top-k%, randomly drop each delta parameter with probability p, then rescale survivors by 1/(1-p):
Why rescale? To maintain expected value; dropping 90% of params requires scaling survivors 10x to compensate, preventing collapse.
- Key finding: Works surprisingly well even at p=0.9-0.99 (dropping 90-99% of deltas!). Most task-specific info concentrates in a sparse subset.
- Intuition: Task vectors are redundant; many params are noise. Stochastic sparsity discovers the signal.
DARE-TIES Combination (2024)
Recent work combines DARE's sparsity with TIES' sign-election. (1) Apply DARE to sparsify task vectors, (2) apply TIES to resolve conflicts on surviving params. Result: better generalization across diverse task sets, ~3-5% improvement over either alone.
Differentiable DARE-TIES (2024-2025)
Rather than fixed sparsity/sign rules, learn optimal merge via gradient descent. Treat dropout probabilities {p_i} and combination weights {λᵢ} as learnable parameters. Optimize on a held-out validation set (e.g., MMLU subset). Result: task-aware, adaptive merges that outperform manual tuning by 2-3%. Compute cost: ~1-2 GPU hours per merge task, still trivial vs. retraining.
任务干扰:为什么朴素平均失败
When fine-tuning on Task A, the optimizer updates weights to reduce Task A loss. Task B fine-tuning updates those same weights for Task B. If the updates conflict (e.g., increase weight w for A, decrease for B), naïve averaging cancels both changes, leaving w near its base value. Neither task benefits.
Mathematical formulation: suppose Task A loss ∂L_A/∂w > 0 (increase w) and Task B loss ∂L_B/∂w < 0 (decrease w). Merging both models gives a w between them—suboptimal for both. TIES resolves this by dropping conflicts (keeping only params where signs agree), and DARE by discovering sparsity (assuming signal concentrates in non-conflicting regions).
| Method | Mechanism | Pros | Cons | Compute Cost |
|---|---|---|---|---|
| Soup | Uniform average | Trivial to implement | High interference; 15-20% perf drop | Minutes |
| Task Arith | Merge deltas w/ scalars | Interpretable; sparse ops | Still naive; manual λ tuning | Hours |
| TIES | Trim + Sign Elect + Merge | Resolves conflicts; SOTA baseline | Hyperparams (trim %, vote threshold) | Hours |
| DARE | Stochastic sparsity + rescale | Robust; works at extreme sparsity | Randomness; benefits from averaging runs | Minutes |
| Diff DARE-TIES | Learned sparsity & weights | Optimal for specific task set | Needs validation data; slow | 1-2 GPU hours |
实际应用:多任务服务与社区融合
Scenario 1: LoRA Merging — Fine-tune a base model with 5 different LoRA adapters (one per task). Each LoRA adds ~0.1% to base param count. Rather than load all LoRAs at inference, merge them into base via TIES. Single inference path, negligible overhead, near-lossless performance on all tasks.
Scenario 2: Open LLM Leaderboard — Community merges fine-tuned models (Mistral, Llama) into "super-models" (e.g., Mistral-7B-Merge-v1). These merged models sometimes outperform their constituent models on averaged benchmarks. Model merging democratized "research-grade" model optimization—any practitioner can now combine models without access to training infrastructure.
Scenario 3: Cost Reduction for Multi-Task APIs — Instead of running 10 separate models (10x VRAM, 10x latency), merge into 1. Trade-off: single-task performance drops 2-5%, but total cost drops 10x. Favorable for SaaS providers.
When Merging Helps vs. Hurts
| Scenario | Merging Outcome | Recommendation |
|---|---|---|
| 2-3 related tasks (e.g., QA variants) | ✓ Win (5-10% perf) | Use TIES; tasks share structure. |
| 5-10 diverse tasks | ~ Neutral (2-3% drop) | Use DARE or Diff-TIES; interference unavoidable. |
| >15 orthogonal tasks | ✗ Loss (>10% drop) | Avoid merging; use ensemble or multi-task training. |
| Tasks with conflicting labels (e.g., Q-A vs. reverse) | ✗ Catastrophic | Do not merge; models disagree fundamentally. |
| Serving with latency/memory constraints | ✓ Essential | Merge aggressively; cost savings dominate. |
Q: Is merging better than multi-task training?
"For post-hoc combination of existing models, merging is superior because you don't need the original data or training compute. But if you're designing from scratch, multi-task training is still better—it leverages shared structure intentionally."
Q: Why does DARE work at 99% dropout?
"Task vectors are heavily redundant. Most params carry little info; a tiny fraction (1-5%) drives the task-specific improvement. DARE discovers this sparsity stochastically. The 1% surviving params often suffice."
Q: Can you merge models from different architectures?
"Not directly—param correspondence breaks. But you can merge different sizes if one is quantized/pruned to match the other. This is active research (DistilBERT + BERT merging)."
Q: What's the merge count ceiling?
"Empirically, 5-10 tasks is comfortable. Beyond 20, interference explodes; merging fails. With Diff-TIES (learned weights), we've pushed to ~15 stable merges, but performance drops noticeably."
Diffusion Transformers: U-Net的终结
Why Vision Transformers replaced U-Net in diffusion models, and how Sora, Flux, and Stable Diffusion 3 define the new generation.
背景:扩散模型的传统架构
For the first ~5 years of diffusion models (2020-2023), the standard architecture was U-Net: encoder-decoder with skip connections, learned via DDPM and refined via DDIM. U-Net works—it powers Stable Diffusion 1.x, DALL-E 2, and countless community models.
But U-Net has fundamental limitations: (1) Fixed compute graph— depth and width are hardcoded; scaling is awkward. (2) CNN-biased design— prioritizes local, shift-invariant patterns; long-range dependencies are harder. (3) Ecosystem mismatch— optimizations for Transformers (Flash Attention, DDP, quantization) don't transfer.
In 2023, researchers realized: diffusion is just next-token prediction with visual tokens. If Transformers scale better for language, why not vision? Enter Diffusion Transformers (DiT).
为什么DiT超越U-Net
DiT Architecture详解
Input Pipeline:
- Take noisy image latent z_t (from VAE encoder, ~8x compressed).
- Patchify: split into patches (e.g., 2×2 stride), linearize → sequence.
- Embed: learnable patch embedding → token dimension (e.g., 768).
- Add positional embeddings: absolute position IDs or rotary embeddings (RoPE).
Conditioning: Timestep + Class via Adaptive Layer Norm (adaLN-Zero)
Classical approach: concatenate timestep & class labels to the sequence. Problem: breaks Transformer symmetry; adds extra tokens.
Better approach (DiT): Adaptive Layer Norm (adaLN): use timestep/class to compute layer norm affine params (γ, β). For each block:
where γ, β = MLPs(timestep_embedding, class_embedding)
This modulates the representation without adding tokens. adaLN-Zero variant initializes γ, β → 0, so pre-trained models are "conditioned downstream" (can plug in new timesteps/classes at test time).
Core: Standard Transformer Blocks
Repeat N times (e.g., N=28 for DiT-XL):
- Multi-head self-attention (e.g., 8 heads, head_dim=64).
- Residual connection + LayerNorm.
- Feed-forward (MLP with hidden_dim=4×embed_dim).
- Residual connection + LayerNorm.
Output: Noise or Velocity
Final layer projects each patch token to per-pixel predictions (e.g., 4 channels for 2×2 patch = 16 values). Predict either:
- ε (noise): the added Gaussian noise → standard DDPM loss.
- v (velocity): interpolation velocity between x_0 and x_T → often converges faster.
DiffiT: NVIDIA的时间敏感注意力
NVIDIA's DiffiT (2024) extends DiT with Time-dependent Multihead Self Attention (TMSA): rather than using a fixed positional embedding, each diffusion timestep gets its own positional encoding. The intuition: early denoising steps focus on structure (low freq), late steps on details (high freq). TMSA adapts receptive fields per timestep.
Results: on ImageNet 256×256, DiffiT achieves FID 1.73 (SOTA at the time). Simple change, big gains. Shows that conditioning information can be baked directly into attention geometry.
Scaling: DiT-XL/2与缩放规律
DiT scaling experiments (from the original paper):
- DiT-S/2: 60M params, FID ~5.3.
- DiT-B/2: 300M params, FID ~3.1.
- DiT-L/2: 750M params, FID ~2.6.
- DiT-XL/2: 675M params, FID ~2.27 (optimal efficiency).
Sora的背后:视频生成的DiT
OpenAI's Sora (2024) applies DiT to video generation. Key insight: video is just image sequences. Extend patchification to 3D:
- Spacetime patches: (t, h, w) → 1D sequence (e.g., 2×2 spatial, 1 frame chunks).
- Positional embeddings: now encode frame number + spatial location.
- Self-attention: over all spacetime tokens, enabling temporal coherence.
Sora can generate videos of arbitrary duration (beyond training) and resolution, because the attention operates on patch sequences—no fixed image size constraint. This is a major capability leap over earlier video diffusion models.
Flux: 开源DiT的成功
Black Forest Labs' Flux (2024) is a DiT-based open-source image model that commands ~40% of the image generation market (by inference volume on Replicate). Key features:
- 12B params, trained on public data.
- DiT architecture with joint spatial-temporal attention.
- FID ~11 on standard benchmarks, competitive with Stable Diffusion 3.
- Fast inference: ~1 second for 1024×1024 image on H100.
Flux's success validated two ideas: (1) DiT is the right architecture choice. (2) Open-source diffusion models can compete with commercial ones if scaled properly.
Stable Diffusion 3: 多模态DiT
Stability AI's Stable Diffusion 3 (2024) introduces MM-DiT (Multimodal Diffusion Transformer): a single Transformer handles both image tokens and text tokens. Instead of separate encoders, the Transformer attends across all modalities:
- Text tokens from a frozen language model (e.g., T5).
- Image patch tokens.
- Cross-attention: unified attention over both modalities.
Benefit: native multimodal alignment—the Transformer learns how image structure relates to text semantics directly, not through separate alignment losses. SD3 shows improved text rendering and concept consistency vs. Stable Diffusion 2.
动态DiT变体
D2iT (Dynamic Diffusion Transformer): Adaptively compute attention based on input complexity. Simple images → fewer layers/heads. Complex images → full model. Reduces latency by 20-30% with minimal quality loss.
DyDiT (Dynamic Depth-Wise DiT): Prune layers based on timestep. Early denoising steps (high noise) don't need deep networks; later steps do. Skip layers early → faster inference without sacrificing quality.
Video Diffusion & Temporal Attention
Extending DiT to video requires capturing temporal dynamics. Standard approaches:
- Spacetime attention: single attention over (T, H, W) tokens (Sora approach, expensive).
- Separable attention: spatial attention per frame, then temporal attention across frames (cheaper, less coherent).
- 3D patches: group (t, h, w) voxels into tokens (reduces sequence length, better scaling).
Recent models (Runway, Stability) opt for hybrid: 3D patches + some cross-frame attention. Provides good coherence without prohibitive compute.
| Architecture | Scaling | Simplicity | Long-Range | SOTA (2024) |
|---|---|---|---|---|
| U-Net (SD 1.x) | Ad-hoc | Complex | Poor | FID ~20 |
| U-Net + Attention (SD 2.x) | Marginal | Fragile | Better | FID ~15 |
| DiT (SOTA) | Predictable | Elegant | Excellent | FID ~2-3 |
Q: Why did U-Net dominate for so long?
"Inertia. SD 1.5 worked well; the research community optimized it heavily. DiT required reimplementation of training pipelines, new infrastructure. The first DiT models (2023) were research prototypes. By 2024, enough evidence accumulated that companies were willing to retrain."
Q: Will DiT scale to trillion-parameter models?
"Probably, yes. Transformers have shown consistent scaling to 1T+ params (GPT-4, Grok). No architectural barrier for vision. The bottleneck is data diversity (image-text pairs) and compute. With enough investment, we'll see billion-scale vision models within 2-3 years."
Q: What's the biggest remaining challenge for DiT?
"Latency at inference. Self-attention is O(N²); for high-resolution video (8K, 60fps), the sequence length becomes prohibitive. Linear attention and approximations are areas of active research."
Q: Can DiT replace all vision architectures (detection, segmentation)?
"For generation, yes. For discriminative tasks, Transformers are already standard (ViT, DINO). DiT is a natural fit for generative modeling because diffusion is fundamentally about iterative refinement—Transformer's sequential nature is a feature, not a bug."
世界模型与具体化AI:学习环境动力学
Internal environment simulators that enable prediction, planning, and counterfactual reasoning in embodied systems.
什么是世界模型
A world model is a learned, compact representation of how an environment evolves. Given current state + action, predict the next state. Given observations, infer hidden state. Given a plan, imagine future trajectories.
Unlike end-to-end RL (state → action mapping), world models decouple understanding (what will happen next) from planning (which action is best). This decomposition is powerful: a model trained on observation video can enable planning without action labels, or transfer to new tasks unseen during training.
Three canonical representations:
核心管道:观察→编码→预测→规划→行动
Observe: Camera/sensor input (e.g., image, point cloud).
Encode: Compress into latent representation (VAE bottleneck, embedding layer).
Predict: RNN or Transformer forecasts k steps ahead.
Plan: Optimize action sequence under learned model (CEM, MPPI, gradient-based).
Act: Execute highest-value action; observe result; loop.
Key advantage: planning happens in latent space (10-100 dims), not pixel space (millions of dims). This makes imagining long trajectories tractable.
应用领域
Autonomous Driving: Predict pedestrian trajectories, vehicle behavior, lane evolution 5+ seconds ahead. World models enable risk-aware planning: if pedestrian crosses, steer; otherwise, maintain speed.
Robotics: Pre-train world models on unlabeled video (robot reaching, manipulation in diverse scenes). Fine-tune with small labeled action data. Examples: diffusion models for robot arm trajectory planning (Diffusion Policy), video prediction for pick-and-place.
Game AI: MuZero (DeepMind, 2020) learns a world model of Atari games without knowing rules. Plans by rolling out the learned model; achieves superhuman play. Key insight: you don't need to understand the game rules to plan in the learned latent dynamics.
Video Generation & Understanding: World models naturally extend to unconditional generation (sample trajectories) or video understanding (infer hidden causes from observations).
三种架构范式详解
RSSM-Based (DreamerV3, 2023)
Architecture: VAE encoder → latent z_t → RNN → predict z_{t+1}. During training, use latent loss + reconstruction loss. During inference, plan by sampling action sequences and scoring under the learned model.
- Pros: Efficient (small latent space); interpretable (z_t is a bottleneck).
- Cons: Lossy compression; hallucinations compound over long horizons.
JEPA-Based (Yann LeCun's Vision, adopted by Meta AI)
Joint-Embedding Predictive Architecture: learn encoders f (for current obs) and g (for future obs) such that f(x_t) ≈ g(x_{t+1}). Minimize ||f(x_t) - g(x_{t+1})||² but with stop-gradients to prevent collapse.
- Pros: Non-contrastive (no negatives needed); learns high-level structure; invariant to pixel details.
- Cons: Doesn't directly predict observations; harder to visualize what model learns.
Transformer-Based Token Prediction
Tokenize observations (VQ-VAE, image-to-token models like VQGAN). Train a Transformer to predict next tokens: p(x_{t+1} | x_1,...,x_t). Enables scaling to billions of parameters; benefits from all Transformer optimizations.
Example: Genie (DeepMind, 2024) uses a Transformer to predict interactive environment tokens from video. Given a text prompt ("move left"), Genie generates plausible next frames. Works on diverse games without action labels.
MuZero: 无规则的游戏规划
MuZero (2020) is a landmark result: an RL algorithm that learns a compact world model—not of the full environment, but of the "value-relevant" dynamics. The model predicts:
- s_{t+1} (next abstract state)
- r_t (expected immediate reward)
- v_t (expected future return)
Why effective: By predicting only reward-relevant features, the model is smaller, faster to compute, and generalizes better than predicting full observables. MuZero achieved SOTA on Atari, Go, and chess with a single algorithm.
then use learned model for planning via Monte Carlo Tree Search
Genie: DeepMind 2024的交互生成
Genie (2024) is a generative interactive environment model: given a video clip from any game, Genie learns to simulate it interactively. User provides text prompts or keyboard input; Genie generates next frames.
Architecture: Transformer-based world model (token predictor). Training:
- Tokenize video frames into a sequence of codes (VQ tokens).
- Train Transformer to predict next tokens conditioned on action tokens (learned embeddings of up/down/left/right).
- Inference: iteratively sample tokens, decode back to pixels, repeat.
4D Embodied World Models (ICCV 2025)
Emerging frontier: 4D models that capture spatial layout + temporal evolution + camera motion. Instead of predicting 2D pixels, predict a 4D representation (3D voxels over time). Enables:
- View synthesis from novel viewpoints (camera motion prediction).
- Simulation from multiple simultaneous viewpoints.
- Better robotics transfer (3D understanding generalizes across camera heights/angles).
Early results show 4D models are more data-efficient and robust than 2D token models. This likely becomes standard for embodied AI in 2025-2026.
Mixture of World Models (MoWM)
Recent work (2024-2025): combine multiple world models (one for each "mode" of environment behavior). Mixture-of-Experts-style: given observations, learn which model best explains current state, then use that for planning.
Benefit: handles multimodal futures gracefully. Standard models average over modes (blurry predictions). MoWM can maintain multiple hypotheses. Essential for long-horizon planning where uncertainty compounds.
世界模型与强化学习
Model-based RL: use world model to generate synthetic trajectories ("dream" into the future). Train policy on these dream-generated samples, then deploy on real environment.
- Sample efficiency: generate unlimited synthetic experience; real environment interactions are minimized.
- Off-policy learning: world model enables learning from past data; no need to collect new trajectories for each policy version.
- Transfer: policy trained in latent space can transfer across tasks if world model is task-agnostic.
DreamerV3 (DeepMind, 2023) unified world models + policy learning: single loss optimizes both model accuracy and policy reward. Achieves competitive RL performance on Atari and control benchmarks with <10% of environment interactions compared to model-free RL.
开放问题与挑战
Sim-to-Real Transfer: World models trained in simulation often fail on real robots due to domain gap. Techniques: domain randomization, adversarial training, but still imperfect. A key research frontier.
Long-Horizon Prediction Accuracy: Errors compound exponentially. Predicting 100 frames accurately is hard; predicting 1000 frames nearly impossible. Active research: ensemble methods, uncertainty quantification, latent diffusion models.
Multimodal Futures: In stochastic environments (humans in room), many futures are plausible. Model must maintain multi-hypothesis beliefs. Standard single-mode prediction fails.
Computational Cost: Planning via world models is expensive: ~1000 forward passes per action. Techniques: distillation (student policy learns from teacher world model planning), but still nascent.
Q: When is a world model better than end-to-end learning?
"When you have limited action-labeled data but abundant observation data (video). World models exploit this asymmetry. If you have abundant action-labeled RL data, end-to-end might be simpler. But for vision-based robotics, world models are currently the way."
Q: Will world models replace reinforcement learning?
"Not replace—complement. Humans combine both: we have world models (predict consequences) and cached policies (habits). Hybrid is likely the future: world model for novel situations, policy for familiar ones."
Q: How do you handle long-horizon planning errors?
"Uncertainty quantification is key. Instead of point predictions, predict distributions. Then plan to maximize expected reward under worst-case scenarios (robust control). This is still an open problem."
Q: Can world models learn causality?
"Partially. If you train on interventional data (agent causes changes), models can learn causal structure. But purely from observational video, causal discovery is hard. This is a frontier for next-generation world models."
Q: What's the smallest world model that's useful?
"Surprisingly small—10M-100M params can capture essential dynamics for simple domains. Scaling doesn't always help if the task is simple. Efficiency (small models, low latency) is underexplored in world modeling."