当 Bedrock 不可用时，在中国区使用Strands Evals Detector进行Agent故障诊断的方案

zhaojiew10

12人浏览 · 2026-06-28 11:36:02

zhaojiew10 · 2026-06-28 11:36:02 发布

在这里插入图片描述

Detector 是什么

想象你管理一个 AI Agent 服务。用户说这个 agent 今天表现很差，你看监控面板：成功率从 95% 掉到 78%。但监控只能告诉你分数掉了，不能告诉你为什么掉了。你想深入分析，打开 trace 日志——每天几千条 trace，每条几十个 span，每个 span 包含 agent 的每轮思考、每次工具调用、每次 LLM 对话。人工逐条看不可能。

Strands Evals Detector 就是这个场景的答案。它像医院的病理科：你把 agent 的执行 session（一次完整对话的 span 树）送进去，它告诉你具体哪里病了、什么病、怎么治。

Session 是什么？想象 agent 和用户的一次完整对话。Agent 先理解用户需求（第 1 个 span），决定调用 weather 工具查天气（第 2 个 span，工具调用），LLM 根据天气结果生成回复（第 3 个 span），最后把回复返回给用户（第 4 个 span）。这四个操作按时间顺序排列，形成一棵树——这就是 session，也就是 span 树。Detector 分析的就是这棵树。

Detector 跑两阶段流水线。

Phase 1 的 detect_failures 逐 span 检查，对照预定义的失败类别打标——比如第 2 个 span 工具调用超时，标记为 execution-error-category-timeout；第 3 个 span LLM 虚构了一个不存在的方法，标记为 hallucination-category-hall-capabilities。
Phase 2 的 analyze_root_cause 把这些失败按因果链串起来：工具超时是根因（PRIMARY），导致 agent 不得不编造回复（SECONDARY），最后用户不满意（TERTIARY）。修复建议也分层——根因层加超时重试，连锁反应层改进 fallback 策略。

Detector 的入口函数 diagnose_session(session, model=...) 把两阶段串起来。session 是 span 树，model 是做分析的 LLM。输出是结构化的诊断结果：哪个 span 失败、什么类别、置信度多高、证据在哪、根因是什么、怎么修复。

这套能力在全球区开箱即用，但前提是 Amazon Bedrock 可用。Detector 的 LLM 推理默认走 Bedrock，diagnose_session 的 model 参数如果是字符串（比如 "claude-3-sonnet"）会被包成 BedrockModel。而 trace 的数据源全球区默认走 Bedrock AgentCore runtime，从 /aws/bedrock-agentcore/runtimes/ log group 读取 body.input/output 格式。中国区没有 Bedrock，既无该 LLM 提供商，也无该 trace 数据源。

中国区（cn-north-1 / cn-northwest-1）没有 Bedrock。三层依赖全部断裂。本文记录我们在 cn-north-1 上用智谱 GLM-4.7 替代 Bedrock 推理、自建格式桥接替代 AgentCore 转换层的完整验证过程。

Bedrock 在中国区的适配

两阶段流水线的入口 diagnose_session(session, model=...) 有两个关键参数。

session 从哪来？Agent 运行时通过 OpenTelemetry 把 span 写入 CloudWatch Logs，后续CloudWatchProvider 用 Logs Insights 查询读回，再由 mapper 转成 Session 对象。
model 怎么解析？由内部的 _resolve_model 决定。

这两个环节里各藏着 Bedrock 的影子。深入看发现 Bedrock 渗透在三个层面。

模型层

_resolve_model 负责把 model 参数解析成 Strands SDK 的 Model 实例。Model 是 SDK 的模型抽象，BedrockModel、OpenAIModel 等都实现这个接口。代码逻辑有两个 Bedrock 回退路径：

def _resolve_model(model: Model | str | None) -> Model:
    if model is None:
        return BedrockModel(model_id=DEFAULT_DETECTOR_MODEL)
    if isinstance(model, str):
        return BedrockModel(model_id=model)                 
    return model # 需要绕过bedrock

strands-evals diagnose --model gpt-4o 传入字符串会被包成 BedrockModel，中国区静默失败。必须传 Model 实例（如 OpenAIModel）。

数据层

Agent 运行时产生的 OTel span 使用 gen_ai.* 语义约定，这是OpenTelemetry 社区为 GenAI 场景定义的标准属性前缀，消息内容嵌在 gen_ai.user.message、gen_ai.choice 等 span event 里。

但 CloudWatchProvider 从 CWLog 拉数据时，实际的数据流是：CloudWatchProvider 用 Logs Insights 查询原始日志 → CloudWatchLogsParser 解析并规范化 → Mapper 转成 Session。Provider 期望的是 Bedrock Converse API 的 body.input/output 格式：

// Strands 产出（gen_ai.* 标准）
{
  "attributes": {
    "gen_ai.operation.name": "chat"
  },
  "events": [
    {
      "name": "gen_ai.user.message",
      "attributes": {
        "content": "..."
      }
    }
  ]
}

// CloudWatchProvider 期望（Bedrock 私有）
{
  "body": {
    "input": {
      "messages": [
        {
          "role": "user",
          "content": {
            "text": "..."
          }
        }
      ]
    },
    "output": {
      "messages": [...]
    }
  }
}

两种格式毫无关联。全球区使用 Bedrock AgentCore runtime 时，CloudWatchProvider 从 /aws/bedrock-agentcore/runtimes/ log group 读取 body.input/output 格式。中国区没有 Bedrock，该数据源不可用。

类型层

即使数据进了 CWLog 并被读回，还需要 mapper 转成 Session。detect_otel_mapper 根据 span 的 scope.name 自动选择 mapper——检测到 "strands.telemetry.tracer" 就选 StrandsInMemorySessionMapper。

但这个 mapper 的签名是 map_to_session(data: list[ReadableSpan])。ReadableSpan 是 OpenTelemetry SDK 的内存对象，内部直接访问 span.context.trace_id、span.events 等属性。CWLog 拉回的是 JSON dict，没有这些属性，会触发 AttributeError。

这个问题在格式桥接部分解决——用适配器模式把 dict 包装成 ReadableSpan 鸭子类型。

Detector 的十大类三十三细分

Detector 能识别什么病？这取决于它的 Prompt 里定义的失败类别。Prompt 位于 src/strands_evals/detectors/prompt_templates/failure_detection/failure_detection_v0.py，定义了十个父类别、三十三个细分类别，覆盖 AI Agent 失败的全谱系。

执行错误（execution-error）

最常见的一类，涵盖工具调用层面的各种失败：

authentication：凭证或权限问题（401/403）。Agent 调用天气 API 返回「Invalid API key」
resource-not-found：资源不存在（404）。查询「北京」天气但 API 只支持国内城市
service-errors：上游服务失败（500）。天气服务本身宕机
rate-limiting：请求频率超限（429）。1 分钟内调用 100 次被限流
formatting：输出格式错误。LLM 返回的 JSON 缺了右括号
timeout：操作超时。30 秒没收到响应
resource-exhaustion：资源耗尽。内存不足导致处理失败
environment：环境配置缺失。OPENAI_API_KEY 未设置
tool-schema：工具输入输出不匹配 schema。Required 字段缺失

幻觉（hallucination）

LLM 特有的「编造」行为：

hall-capabilities：声称使用不存在的能力。「我用 AI 图像生成器画了一张图」——但根本没有这个工具
hall-usage：声称使用了没使用的工具。「我已查询天气」——实际未调用 API
hall-history：引用不存在的对话历史。「你之前说过要去巴黎」——用户从没提过
hall-params：使用与上下文冲突的参数。用户说「飞北京」，agent 却查询「上海」
fabricate-tool-outputs：伪造工具输出。编造「天气晴朗」但 API 返回错误
hall-misunderstand：误解工具响应。把错误码当成成功处理

编排错误（orchestration-related-errors）

任务规划层的失败：

reasoning-mismatch：推理与执行脱节。计划验证身份但实际跳过直接回复
goal-deviation：偏离原始目标。简单查询变成复杂分析，越走越远
premature-termination：提前终止。还没确认支付就结束对话
unaware-termination：未识别完成信号。任务已完成却继续问「还需要什么」

错误动作（incorrect-actions）

工具使用策略失误：

tool-selection：选错工具。用 general_search() 查航班（应该用 check_flight_status()）
poor-information-retrieval：工具使用不当。搜索关键词太泛，返回无关结果
clarification：信息不完整却继续。没确认日期就预订酒店
inappropriate-info-request：不当信息请求。问用户「汇率多少」但自己有 currency_api

重复行为（repetitive-behavior）

无意义的循环：

repetition-tool：重复调用同一工具。连续 3 次 check_flight_status(flight_id="123")
repetition-info：重复请求用户提供过的信息。再次问「您要去哪里」（用户已说过）
step-repetition：重复工作流步骤。验证步骤执行了 3 遍

任务指令（task-instruction）

指令遵从问题：

non-compliance：不遵从指令。系统提示「先认证再操作」，agent 直接操作
problem-id：路径选择错误。知道要修 SQL 缩进，却打开了错误的文件

上下文处理错误（context-handling-error）

记忆断层：

context-handling-failures：突然丢失上下文。正在处理预订，突然问「您好，有什么可以帮您」

LLM 输出问题（llm-output）

输出质量异常：

nonsensical：无意义输出。回复包含「根据我的系统提示…」或 [MASK] 占位符

配置不匹配（configuration-mismatch）

系统配置错误：

tool-definition：工具定义不符。声明是计算器，实际做搜索

编码场景特有错误（coding-use-case-specific-failure-types）

代码生成场景专用：

edge-case-oversights：未处理边界条件
dependency-issues：依赖处理失败

这三十三个类别可以叠加。一个 span 可能同时有多个失败——比如 agent 伪造工具输出（hallucination）且重复了三次（repetitive-behavior）。Detector 会同时标记，并给出各自的置信度和证据。

核心验证

第一步是要确认核心路径能用非 Bedrock 模型跑通。写一个会故意失败的 demo agent——配一个会抛 ConnectionError 的 weather 工具，让 agent 调用失败。trace 走纯内存路径，detector 用 OpenAIModel 实例指向GLM-4.7。

telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
ctx = baggage.set_baggage("session.id", session_id)
token = context.attach(ctx)
agent = build_agent()  # OpenAIModel(GLM-4.7)
result = agent("What's the weather in Beijing?")
context.detach(token)

spans = list(telemetry.in_memory_exporter.get_finished_spans())
session = StrandsInMemorySessionMapper().map_to_session(spans, session_id)
diagnosis = diagnose_session(session, model=build_model())  # Model 实例，不是字符串

运行结果：

6 spans captured, 4 mapped to Session
detect_failures → execution-error-category-service-errors, confidence=0.90
  Evidence: ConnectionError - Weather service unavailable
analyze_root_cause → PRIMARY_FAILURE
  Fix: Implement retry logic with exponential backoff + fallback data sources
→ China-region core path: VALIDATED ✓

_resolve_model 收到 OpenAIModel 实例直接 pass-through，没碰 BedrockModel。Phase 1 的 streaming 文本模式检测和 Phase 2 的 structured output 根因分析都在 GLM-4.7 上正常工作。十个父类别中的 execution-error-category-service-errors 被正确识别，置信度 0.90。

格式桥接

核心验证绕开了 CWLog——spans 直接走内存，不碰 CloudWatch。但生产环境需要持久化存储和事后分析，必须打通 Agent → CWLog → Detector 这条链路。

问题本质：三种断裂

打通链路时遇到三个断裂点，每个都需要桥接：

断裂 1：格式转换

Strands SDK 用 gen_ai.* OTEL 标准格式（gen_ai.user.message、gen_ai.choice），CloudWatchProvider 期望 body.input/output Bedrock 格式。全球区从 /aws/bedrock-agentcore/runtimes/ 读后者，中国区无此数据源，必须自己写转换。

断裂 2：Events 覆盖

CloudWatchLogsParser 会把 span 的嵌入式 events 用独立 event 记录覆盖：

span["span_events"] = events_by_span_id.get(span_id, [])  # 覆盖，不合并

如果 events 嵌在 span 记录里，这行代码把它们替换成空列表，导致 messages 丢失。

断裂 3：类型失配

StrandsInMemorySessionMapper 期望 list[ReadableSpan]——OpenTelemetry SDK 的内存对象，直接访问 span.context.trace_id、span.events 等属性。CWLog 拉回的是 JSON dict，没有这些属性，触发 AttributeError。

桥接方案：写入端

解决断裂 1 和 2，把 spans 按正确格式写入 CWLog：

def spans_to_cwlog_records(spans, session_id=""):
    records = []
    for span in spans:
        attrs = dict(span.attributes) if span.attributes else {}
        if session_id:
            attrs["session.id"] = session_id
            attrs["session"] = {"id": session_id}  # CWLog Insights 查询需要

        # Span 记录（有 startTime → _is_span() = True）
        records.append({
            "traceId": format(span.context.trace_id, "032x"),
            "spanId": format(span.context.span_id, "016x"),
            "startTimeUnixNano": str(span.start_time),
            "attributes": attrs,
            "scope": {"name": span.instrumentation_scope.name}
        })

        # Event 记录（有 event.name → _is_event() = True，按 spanId 关联）
        for event in span.events or []:
            event_attrs = {"event.name": event.name}
            event_attrs.update(dict(event.attributes or {}))
            if session_id:
                event_attrs["session"] = {"id": session_id}
            records.append({
                "spanId": format(span.context.span_id, "016x"),
                "attributes": event_attrs
            })
    return records

实际写入 CWLog 的记录示例（一个 InferenceSpan + 两个 event）：

// Span 记录（_is_span() = True，因为有 startTimeUnixNano）
{
  "traceId": "a1b2c3d4e5f678901234567890123456",
  "spanId": "1234567890abcdef",
  "startTimeUnixNano": "1751073200000000000",
  "endTimeUnixNano": "1751073201000000000",
  "attributes": {
    "gen_ai.system": "openai",
    "gen_ai.operation.name": "chat",
    "session.id": "e2e-20250627-001",
    "session": {"id": "e2e-20250627-001"}
  },
  "scope": {"name": "strands.telemetry.tracer"}
}

// Event 记录 1（_is_event() = True，因为有 event.name）
{
  "spanId": "1234567890abcdef",
  "attributes": {
    "event.name": "gen_ai.user.message",
    "gen_ai.user.message": "What's the weather in Beijing?",
    "session": {"id": "e2e-20250627-001"}
  }
}

// Event 记录 2
{
  "spanId": "1234567890abcdef",
  "attributes": {
    "event.name": "gen_ai.choice",
    "gen_ai.choice": "I'll check the weather for you.",
    "session": {"id": "e2e-20250627-001"}
  }
}

关键设计：

span 记录和 event 记录独立写入，通过 spanId 关联，避免覆盖
session.id flat key 供 mapper 查找，session.id nested dict 供 CWLog Insights 查询（filter attributes.session.id）
CWLog JSON 序列化要求，_DictSpanAdapter 读取时转 int()

同一个 session 的所有记录写入同一个 Log Stream（session-{session_id}），比如 session-e2e-20250627-001。CloudWatch Logs 中查询该 session 的所有 span 和 event：

# 实际写入调用
client.put_log_events(
    logGroupName="/aws/agents/cn-eval-test",
    logStreamName="session-e2e-20250627-001",
    logEvents=[
        {"timestamp": 1751073200000, "message": json.dumps(span_record)},
        {"timestamp": 1751073200001, "message": json.dumps(event_record)},
        # ... 共 6 条
    ]
)

query = "filter attributes.session.id like 'e2e-20250627-001'"
# 返回该 Log Stream 内所有匹配记录

桥接方案：读取端

主要解决类型层的断裂（CWLog 拉回 JSON dict，但 StrandsInMemorySessionMapper 期望 ReadableSpan 对象）——用适配器模式把 dict 包装成 ReadableSpan 鸭子类型，让 mapper 无需修改就能读取 CWLog 数据：

class _DictSpanAdapter:
    """把 dict 包成 ReadableSpan 鸭子类型，复用父类 460 行逻辑"""
    def __init__(self, data: dict):
        self.name = data.get("name", "")
        # CWLog 的 startTimeUnixNano 是字符串，父类做 start_time / 1e9 需要 int
        self.start_time = int(data.get("start_time", 0)) or 0
        self.end_time = int(data.get("end_time", 0)) or 0
        self.attributes = data.get("attributes") or {}
        # hex string → int，父类做 format(trace_id, "032x") 需要 int
        self.context = SimpleNamespace(
            trace_id=int(data.get("trace_id", "0"), 16),
            span_id=int(data.get("span_id", "0"), 16))
        self.parent = SimpleNamespace(
            span_id=int(data.get("parent_span_id", "0"), 16)) \
            if data.get("parent_span_id") else None
        self.events = [_DictEventAdapter(e) for e in data.get("span_events") or []]

class StrandsCloudWatchSessionMapper(StrandsInMemorySessionMapper):
    def map_to_session(self, data: list[dict], session_id: str) -> Session:
        adapted = [_DictSpanAdapter(d) for d in data]
        return super().map_to_session(adapted, session_id)

桥接过程中还遇到时间戳字符串转换、点号嵌套解析等细节问题，都在代码注释中标注了解决方案。

验证结果

验证完整的 Agent → CWLog → Detector 链路。Agent 配置故意失败的天气工具（ConnectionError），trace 写入 CWLog（cn-north-1），Detector 从 CWLog 读取并诊断。

Agent 执行日志

[2026-06-27T16:45:12] Agent started | session_id=e2e-20250627-001
[2026-06-27T16:45:13] Calling tool: weather_api(city="Beijing")
[2026-06-27T16:45:13] Tool error: ConnectionError - Weather service unavailable
[2026-06-27T16:45:14] LLM generating response with error context
[2026-06-27T16:45:15] Agent completed | 4 spans, 6 events

→ Writing to CWLog: /aws/agents/cn-eval-test
→ 6 records written (2 span + 4 event)
→ Waiting for Logs Insights indexing...

CloudWatch 查询日志

attempt=<1> | CWLog indexing not ready, retrying in 5s...
attempt=<2> | CWLog indexing not ready, retrying in 5s...
attempt=<3> | trace_count=<1>, span_count=<4> | read from CWLog ✓

Query: filter attributes.session.id like 'e2e-20250627-001'
Results: 6 records (2 spans + 4 events)
- span: InferenceSpan (gen_ai.system=openai, gen_ai.operation.name=chat)
- span: ToolExecutionSpan (tool.name=weather_api)
- span: InferenceSpan (error handling)
- span: AgentInvocationSpan (completion)

Detector 诊断结果

=== FAILURE DETECTION ===
Location: span_2 (ToolExecutionSpan)
Category: execution-error-category-service-errors
Confidence: 0.90
Evidence: ConnectionError - Weather service unavailable

Location: span_3 (InferenceSpan)
Category: task-instruction-category-non-compliance
Confidence: 0.75
Evidence: Agent did not retry or use fallback after tool failure

=== ROOT CAUSE ANALYSIS ===
Primary Failure: Tool execution error (ConnectionError)
Secondary Effect: Agent proceeded without proper error handling
Tertiary Effect: User received incomplete response

Fix Recommendation:
  Type: system_prompt
  Action: Add retry logic with exponential backoff and fallback data sources
  Code: Implement @retry decorator on weather_api tool

Mapper 单元测试对比

桥接方案的核心假设是：把 dict 包装成 ReadableSpan 鸭子类型，mapper 应该产出和原生路径完全一致的结果。如果这个假设不成立，整个格式桥接就是不可靠的。

对比设计：

Path A（原生路径）：InMemorySpanExporter → StrandsInMemorySessionMapper
Path B（桥接路径）：CwLogSpanExporter → CloudWatchLogsParser → _DictSpanAdapter → StrandsInMemorySessionMapper

两条路径喂给同一个 mapper，对比输出：_DictSpanAdapter 没有丢失或篡改数据，桥接后的数据流和原生数据流等价，可以放心使用

Path A (ReadableSpan 原生): 4 spans
  - InferenceSpan: prompt="What's the weather?", response=ToolCall(weather_api)
  - ToolExecutionSpan: error=ConnectionError
  - InferenceSpan: error_context="Service unavailable"  
  - AgentInvocationSpan: completion="Sorry..."

Path B (CWLog → dict → adapter): 4 spans
  - InferenceSpan: prompt="What's the weather?", response=ToolCall(weather_api)
  - ToolExecutionSpan: error=ConnectionError
  - InferenceSpan: error_context="Service unavailable"
  - AgentInvocationSpan: completion="Sorry..."

✓ Session equivalence: PASS
✓ Span count: 4 == 4
✓ Types match: InferenceSpan, ToolExecutionSpan, InferenceSpan, AgentInvocationSpan
✓ Messages extracted: 6 user messages, 4 assistant messages

三层全部通过。核心路径产出 1 个 failure（execution-error-category-service-errors，0.90）和 1 个 root cause（PRIMARY + 修复建议）。Mapper 单元测试两条路径 Session 完全一致（4==4 spans，类型/ID/prompt/response 全匹配）。

结语

走到这里，最初的三个窟窿都被堵上了。

模型层的问题最简单——绕过 _resolve_model 的 Bedrock 回退，直接把 OpenAIModel 实例喂给 diagnose_session，GLM-4.7 就能正常产出失败检测和根因分析。
数据层的问题最繁琐——CloudWatch Logs 的格式、索引、点号解析，每个细节都要对齐。
类型层的问题最巧妙——_DictSpanAdapter 用 Python 的鸭子类型，让 JSON dict 冒充 ReadableSpan，复用既有 mapper 而不重写。

中国区没有 Bedrock，但只要有 OpenAI-compatible 的 endpoint，Detector 就能跑。没有 AgentCore 的格式转换，但只要在写入端和读取端各加一层薄桥接，数据就能流通。

智能体开发者社区

中国智能体开发者社区，聚焦智能体与大模型开发，提供前沿资讯、实用工具链、开源项目及行业案例。通过技术沙龙、开发者大赛等活动，促进经验交流与协作，助力开发者快速构建创新智能应用。

更多推荐

【纯干货】Ollama + DeepSeek 本地部署全攻略，零基础也能拥有私人 AI 助手

本文提供了Ollama+DeepSeek本地AI助手的详细部署指南，适合零基础用户。主要内容包括：本地部署的优势（隐私保护、零成本等）、硬件要求说明、Ollama安装步骤（支持Windows/macOS/Linux）、模型下载与调教方法（可自定义上下文长度和角色设定）、API调用示例，以及Web界面搭建教程。文章还分享了常见问题解决方案，并邀请读者加入技术交流群共同探讨。通过本教程，用户可以在个人