Unified Chat / Agent Design

Status: Draft — revised after first review round Date: 2026-05-26 Author: mg (with Claude)

Summary

Replace the current dual-path chat architecture (regular chat vs. agent_mode → forced planning) with a single unified loop: an LLM-with-tools agent that decides for itself whether to answer directly, make a single tool call, or decompose work into recursive subtasks. Planning becomes a primitive (run_subtask), not a hardcoded mode.

The UI surface collapses to one chat composer with no mode selector and no tool picker. The websocket runner exposes one chat-agent route (alongside the existing workflow-run and media-generation routes). The CLI deprecates its agent-mode flag. Toolbelt composition is context- and policy-driven on the server, not configuration-driven from the client.

This document is the destination. The implementation will be staged across multiple PRs (see Staging below).

Motivation

Today the chat UI forces a choice between two architecturally different paths:

This is overkill for trivial questions and forces users to predict ahead of time which mode a task needs. It also hard-codes a control flow around the model — the opposite of what we want per Sutton’s bitter lesson. The model should decide whether to decompose.

The fix: give the model primitives, let it pick the shape.

Design choices

The design was settled through brainstorming. The key forks and the chosen options:

Fork Choice Why
Granularity of “planning” Primitives (not an atomic plan_and_execute) Bitter-lesson: give the model real building blocks, not a renamed mode.
Recursion Recursive with max_depth = 3 Honest tree-of-thought at any level; bounded by depth + global budgets.
Execution model Sync only. Parallelism via the LLM’s native parallel tool calls. No id juggling; runtime fans out parallel calls.
UX rendering Nested, collapsed-by-default tool-call cards. Recursive UI mirrors recursive execution. Main thread stays clean.
Workflow editor build_workflow is gated on an explicit chat_context: "editor" wire flag, not on NodeRegistry presence. Registry availability is a server capability; editor context is a UI fact. They are not the same thing.
Toolbelt policy Server-side permission classes, not client-driven tool lists. The UI sends no tool list, but trivial chat still can’t get destructive tools unless the session policy allows it.
Migration Deprecate, don’t delete. agent_mode and --agent become no-ops first; removal is a later PR. Honest staging; doesn’t break scripts/docs on day 1.

Architecture

Routing

handleChatMessage keeps its precedence but the third branch is now the unified chat-agent path, and the editor context is read from a new explicit field:

1. workflow_target === "workflow"  →  handleWorkflowMessage  (run saved workflow)
2. media_generation.mode !== "chat" →  handleMediaGenerationMessage
3. ALL OTHER CHAT                   →  runChatAgent (NEW)
   ├─ chat_context.kind === "editor"  → toolbelt includes build_workflow + ui_*
   ├─ chat_context.kind === "thread"  → standard chat toolbelt
   └─ chat_context omitted            → defaults to "thread"

Important: today the route fires on workflow_id alone, which conflates “run this saved workflow” with “I’m chatting from inside the editor of this workflow.” This spec splits them:

The web composer in the editor sets chat_context.kind = "editor". Outside the editor (Global Chat, mobile), it’s "thread" or omitted.

Components

ChatAgentRunner (new — packages/websocket/src/agent/chat-agent-runner.ts)

Replaces both the regular-chat branch and handleAgentMessage. One entry point:

async function runChatAgent(args: {
  threadId: string;
  userId: string;
  provider: BaseProvider;
  model: string;
  chatContext: ChatContext;            // { kind: "thread" | "editor", workflowId? }
  registry: NodeRegistry | null;        // server capability; required by editor context
  content: MessageContent;
  collections: string[];
  memoryEnabled: boolean;
  requestSeq: number;
  sendMessage: (m: Message) => Promise<void>;
}): Promise<void>;

Responsibilities:

runUnifiedAgentLoop (new — packages/agents/src/unified-loop.ts)

The single agent loop, used at every depth. Pseudocode:

async function* runUnifiedAgentLoop(opts: {
  provider: BaseProvider;
  model: string;
  systemPrompt: string;
  history: ProviderMessage[];
  tools: Tool[];                      // pre-filtered to allowed set
  ctx: ProcessingContext;             // shared AgentMemory + TurnBudget
  depth: number;                      // 0 at root; 0..MAX_DEPTH
  maxIterations: number;
  parentSubtaskId: string | null;     // for event nesting
  signal: AbortSignal;
}): AsyncIterable<ProcessingMessage>;

Loop body (one iteration):

  1. Check budgets (ctx.budget). If exhausted, terminate with budget error.
  2. Stream a model turn (provider.generateMessages). Yield chunks with parent_id = parentSubtaskId.
  3. If the assistant turn has no tool calls → loop ends. Last assistant text becomes this level’s result.
  4. Else collect all tool calls. Partition them by safety class (Tool.parallelSafe):
  5. For each tool call:
  6. Append all tool results to history. Yield tool_call_updates with status: "end" for each.
  7. After maxIterations without a clean exit → terminate with iteration-limit error.

run_subtask tool (new — packages/agents/src/tools/run-subtask-tool.ts)

Input schema:

{
  title: string;                  // short label for the UI card
  instructions: string;           // what the subtask should accomplish
  tools?: string[];               // optional: restrict to a subset of parent's toolbelt
  output_schema?: JSONSchema;     // optional: JSON schema for structured result
}

Structured output (when output_schema is set):

Unstructured output (no output_schema):

Execution is intercepted in the unified loop (it needs runner state — depth, ctx, sendMessage). The tool’s Tool instance exposes only the schema; calling it doesn’t go through tool.process. This is the same pattern as memory_* tools that need the runner context — explicit interceptor list in the loop.

build_workflow tool (new wrapper — editor-context only)

Only included in the toolbelt when:

chatContext.kind === "editor" && registry !== null

If chatContext.kind === "editor" but registry === null, log a server warning and omit the tool (the editor can’t really work without a registry; this is a misconfiguration).

Wraps the existing GraphPlanner so the agent can call it as a single primitive when the user asks for a workflow. Streams planning_update / task_update events upward through the same channel as other tool events.

Input: { objective: string }. Output: the built Graph reference. Applying the graph to the live editor remains the job of the existing ui_* proxies (the model can call ui_paste etc. after build_workflow if it wants the graph rendered).

composeToolbelt (new — packages/websocket/src/agent/toolbelt.ts)

Toolbelt is the intersection of (server-registered tools) ∩ (tools the session’s policy allows) ∩ (tools the context unlocks).

function composeToolbelt(args: {
  chatContext: ChatContext;
  registry: NodeRegistry | null;
  providers: Record<string, BaseProvider>;
  memoryEnabled: boolean;
  clientToolsManifest: Record<string, UIToolManifest>;
  toolBridge: ToolBridge;
  sendMessage: SendFn;
  policy: ToolPolicy;                 // server-resolved per session
}): Tool[];
Tool permission classes

Every Tool instance declares its class:

abstract class Tool {
  abstract readonly permissionClass:
    | "safe"            // pure read, no network, no fs write
    | "knowledge"       // memory_*, search_nodes, etc.
    | "network"         // browser, google_search, MCP read
    | "workspace_write" // read_file (within workspace), write_file (within workspace)
    | "editor_mutate"   // build_workflow, ui_add_node, ui_connect_nodes, ...
    | "subagent"        // run_subtask, finish_subtask
    | "secrets"         // anything that reads/writes user secrets
  ;
  readonly parallelSafe: boolean = true;       // default true; override for mutating tools
  readonly requiresExclusiveLock?: string;     // optional named lock (e.g. "workspace_fs")
}
Session ToolPolicy

A ToolPolicy is built once per session by the runner:

type ToolPolicy = {
  classes: Set<PermissionClass>;   // which classes are enabled
  workspaceRoot: string | null;    // workspace_write tools are confined to this prefix
  allowedMcpServers: string[];     // explicit allowlist (no implicit access)
};

Defaults:

Class chat_context: thread chat_context: editor
safe, knowledge, subagent always on always on
network on on
workspace_write on, workspace-scoped on, workspace-scoped
editor_mutate off on
secrets off (opt-in only) off (opt-in only)

workspace_write tools must enforce their workspaceRoot prefix at runtime (read_file, write_file) — a tool that ignores this is a bug. The policy is the policy; tools are the enforcement points.

Composition
const all = [
  new RunSubtaskTool(),
  new MemoryListTool(), new MemoryReadTool(), new MemoryWriteTool(),
  new ReadFileTool(), new WriteFileTool(),
  new BrowserTool(), new GoogleSearchTool(),
  ...getAllMcpTools({ registry, providers, allowedServers: policy.allowedMcpServers }),
];

if (chatContext.kind === "editor" && registry !== null) {
  all.push(new BuildWorkflowTool({ registry, providers }));
  for (const m of Object.values(clientToolsManifest)) {
    all.push(new UIToolProxy(m, toolBridge, sendMessage));
  }
}

return all.filter((t) => policy.classes.has(t.permissionClass));

The client sends no tool list. data.tools is ignored.

Wire protocol changes

This is a real protocol bump, not a tweak. The following types in packages/protocol/src/messages.ts get new optional fields:

// ToolCallUpdate
interface ToolCallUpdate {
  type: "tool_call_update";
  tool_call_id: string;
  name: string;
  args?: unknown;
  result?: unknown;
  is_error?: boolean;
  // NEW:
  parent_id?: string | null;      // null at root; tool_call_id of enclosing run_subtask otherwise
  depth?: number;                 // 0 at root
  status?: "start" | "end";       // start when args are known; end with result
}

// Chunk
interface Chunk {
  type: "chunk";
  content: string;
  done?: boolean;
  // NEW:
  parent_id?: string | null;      // null at root; tool_call_id of enclosing run_subtask
  depth?: number;                 // 0 at root
}

All new fields are optional. Old clients (mobile, CLI) ignore them safely — chunks render in the main thread and tool calls render at the top level (no nesting). That gives us a backward-compatible rollout.

New event for budget exhaustion (rare; client can show an inline notice):

interface BudgetExceeded {
  type: "budget_exceeded";
  reason: "subtasks" | "llm_calls" | "tool_calls" | "wall_clock" | "tokens" | "bytes";
  limit: number;
  observed: number;
}

Persistence of execution

Streamed subtask events are transient by default, but a compact tree is persisted with the assistant message so thread reload doesn’t lose context.

On Message (assistant role), add an optional execution_tree JSON column:

type ExecutionTreeSnapshot = {
  version: 1;
  nodes: ExecutionTreeNode[];     // flat list
};

type ExecutionTreeNode = {
  id: string;                      // tool_call_id
  parent_id: string | null;
  name: string;                    // tool name; "run_subtask" nodes are container nodes
  title?: string;                  // for run_subtask: the title arg
  args_preview?: string;           // truncated args (≤ 500 chars)
  result_preview?: string;         // truncated result (≤ 500 chars)
  is_error?: boolean;
  duration_ms: number;
  // NB: streamed chunks inside run_subtask are NOT persisted.
};

The renderer reconstructs the nested cards from this tree. Streamed chunk content within a subtask card is not restored on reload — only the title/status/result preview. Trade-off: bounded DB row size vs. full replay. The “full replay” version is a follow-up if users demand it.

UI: nested cards

Two render contexts:

Existing ToolCallCard becomes recursive. New props: parentId, depth, children. Cards default to collapsed when depth > 0.

Mobile: ignore the new protocol fields; render tool calls flat. Acceptable for v1.

Global budgets

TurnBudget lives on the ProcessingContext for the duration of one root turn. It’s checked at three places: subtask spawn, LLM call, tool call.

Budget Default Where checked
MAX_DEPTH 3 run_subtask boundary
MAX_ITERATIONS_PER_LEVEL 20 loop counter
MAX_PARALLEL_PER_TURN 8 fan-out batch size
MAX_TOTAL_SUBTASKS 32 subtask spawn
MAX_TOTAL_LLM_CALLS 60 before every provider.generateMessages
MAX_TOTAL_TOOL_CALLS 200 before every tool dispatch
MAX_WALL_CLOCK_MS 180_000 (3 min) Date.now() check on each iteration
MAX_TOOL_RESULT_BYTES 50_000 per result tool dispatcher; truncates above
MAX_HISTORY_TOKENS 128_000 provider-level pruning (existing)

When any budget is exceeded, the runner emits a budget_exceeded event, terminates the in-progress turn, and surfaces a clear error to the user. The partial assistant text + persisted execution tree are still committed.

Constants live in packages/agents/src/constants.ts. No UI configuration; env overrides only if needed later.

Memory & history

System prompts

Root chat prompt (new — ROOT_CHAT_SYSTEM_PROMPT)

Shape (final wording deferred to implementation):

Subtask prompt

Re-use StepExecutor.buildSystemPrompt() as a base (memory tools, structured output discipline). Extend with one line referencing run_subtask when depth < MAX_DEPTH; omit otherwise.

When output_schema is set, the subtask prompt explicitly requires terminating via finish_subtask (matches the existing finish_step pattern).

Provider compatibility

Not every provider supports tool calling identically. The unified loop’s contract with BaseProvider:

Cost & token accounting

The existing CostCalculator aggregates per-provider calls. The unified loop must:

Tool-result truncation (50 KB cap) keeps individual tool returns from poisoning the next LLM turn. Browser/search-style tools should also do their own pre-truncation in case the cap isn’t enough.

Error handling

Migration

Staged, not big-bang. Each step is independently shippable.

Stage 1 — Plumbing & guardrails (no behaviour change for users)

  1. Extract the existing regular-chat tool loop into a reusable runToolLoop under packages/agents/src/. Behaviour-preserving.
  2. Add permissionClass + parallelSafe to Tool base class; annotate all existing tools. Default policy mirrors current behaviour exactly.
  3. Add the protocol fields (parent_id, depth, status) and the execution_tree column on Message. All optional; old code keeps working.
  4. Add TurnBudget to ProcessingContext. Existing paths get a no-op budget.

Stage 2 — run_subtask behind agent_mode

  1. Implement run_subtask (and finish_subtask). Add to the current MultiModeAgent-driven agent_mode path only.
  2. Replace the forced TaskPlanner with the unified loop inside agent mode. Plain chat is still plain chat.
  3. Add the nested-card renderer in the UI; agent mode now produces nested cards instead of a flat plan tree.

Stage 3 — Routing & editor split

  1. Introduce chat_context on the wire. Default to "thread" if omitted.
  2. Split workflow-editor chat from workflow-run routing. Editor sends chat_context: { kind: "editor", workflowId }; saved-workflow runs keep using workflow_target.
  3. Add build_workflow as a tool in the editor toolbelt.

Stage 4 — Unification

  1. Delete the agent_mode branch in the runner; route all non-workflow, non-media chat through runChatAgent.
  2. The UI selector is hidden (still in code, behind a feature flag) so we can revert quickly.
  3. agent_mode on the wire is ignored (logged at debug for one release).
  4. CLI --agent becomes a no-op with a deprecation warning.

Stage 5 — Cleanup

  1. Remove the AgentModeSelector component, agent_mode / agent_planner store fields, and the wire field.
  2. Remove the CLI --agent flag.
  3. Remove MultiModeAgent from public-API surface (kept as an internal building block where genuinely needed — currently nothing depends on it externally once Stage 4 lands).

Each stage ships its own implementation plan.

Testing

Unit tests (packages/agents/tests/):

Integration tests (packages/websocket/tests/):

E2E (web):

Out of scope

Open questions (deferred to implementation)

Acceptance criteria

Acceptance is checked with fake providers for behaviour and against real providers for smoke. We don’t make assertions about real-model decisions.

  1. The AgentModeSelector is hidden (Stage 4) and then removed (Stage 5); no chat path requires a mode toggle.
  2. With a fake provider that emits zero tool calls, the runner returns one assistant text message and no nested events.
  3. With a fake provider that emits parallel run_subtask calls, the runner spawns the expected number of child loops, persists an execution_tree, and the UI renders nested cards (live + after reload).
  4. With a fake provider attempting recursion past MAX_DEPTH, the run_subtask tool returns a depth-limit error.
  5. Each global budget, when violated by a deterministic fake provider, produces the expected budget_exceeded event and terminates the turn.
  6. With chat_context: { kind: "editor", workflowId } and a wired registry, build_workflow appears in the toolbelt; without editor context, it does not appear; without a registry (misconfig), it is omitted with a warning.
  7. workspace_write tools refuse paths outside policy.workspaceRoot.
  8. agent_mode: true on the wire (Stage 3) behaves identically to today (MultiModeAgent plan mode). At Stage 4 it’s a no-op; at Stage 5 the field is gone.
  9. The CLI --agent flag (Stage 4) emits a deprecation warning and behaves as the unified loop. At Stage 5 it is removed.
  10. All existing chat-related tests pass at every stage; new tests above pass.