Concurrent Same-Workflow Runs — Design

Date: 2026-06-03 Status: Design approved; implementation pending

Goal

Let the same workflow run multiple times concurrently. The editor canvas shows one run’s execution at a time — a “lens” the user selects from the queue overlay — while node results and outputs accumulate across all runs so they can be compared.

Today the backend forces same-workflow runs to be sequential (hasActiveJobForWorkflow), and the frontend keys all node-level state by workflowId:nodeId, so two runs of one workflow would clobber each other. This redesign lifts the serialization and splits node-level state into two categories with different lifetimes.

A prerequisite bug fix already landed: WorkflowRunner.run() now returns the id of the run it initiated (queued or fresh) instead of void, and the sketch/timeline/mini-app hooks use that id rather than re-reading the stale runnerStore.job_id. Without it, a second concurrent run subscribed to the previous run’s job and stranded its updates.

The model: telemetry follows the lens, products accumulate

Every backend message already carries both workflow_id and job_id (stamped in unified-websocket-runner.ts streamJobMessages). We use job_id to split node-level state into two categories:

focusedJobId : workflowId → jobId        // which run the canvas is "watching"

Architecture

Backend — allow concurrency

UnifiedWebSocketRunner.runJob (packages/websocket/src/unified-websocket-runner.ts) currently queues a run when hasActiveJobForWorkflow(req.workflow_id) is true. Drop that condition from the gate so same-workflow runs start immediately, bounded only by the global MAX_CONCURRENT_JOBS cap; runs beyond the cap still queue FIFO and drain as slots free (drainQueue). drainQueue’s hasActiveJobForWorkflow filter is likewise removed.

No message-shape changes: jobs already have isolated ProcessingContexts and every outbound message is stamped with job_id + workflow_id. Persistence, cancel, and reconnect are already per-job_id.

With the cap set to 1, behavior is identical to today (full serialization) — a safe operational fallback.

Frontend — split the stores

State Source message Store today New keying
Node status node_update.status, prediction StatusStore wf:node wf:job:node
Edge activity edge_update ResultsStore.edges wf:edge wf:job:edge
Progress node_progress ResultsStore.progress wf:node wf:job:node
Execution time derived from node_update ExecutionTimeStore wf:node wf:job:node
Node error node_update.error ErrorStore wf:node wf:job:node
Chunks / tasks / tool calls / planning streaming updates ResultsStore.* wf:node wf:job:node
Provider cost node_update.provider_cost ResultsStore wf:node wf:job:node
Result / outputs — media node_update.result, output_update (auto-saved) ResultsStore wf:node (live) DB Asset by (wf, node, job); in-memory = live mirror
Result / outputs — non-media same ResultsStore wf:node wf:node → { job: value } (in-memory, ephemeral)

Execution-telemetry maps gain a job segment in their key; canvas selectors resolve it through focusedJobId. Durable-product maps keep the node key but hold a per-job collection, surfaced across runs.

Runs registry + focus

A new per-workflow registry tracks the live set of runs and the focus:

handleUpdate rework

web/src/stores/workflowUpdates.ts currently routes everything through a single per-workflow runnerStore and uses isRunnerJob to decide whether an update may drive runner state. Under concurrency:

Clearing semantics

WorkflowRunner.run() today clears the whole workflow’s node state on start (clearStatuses(wf), clearResults(wf), …). Under concurrency a new run must not wipe siblings or the accumulated gallery. A fresh jobId has an empty execution slice, so the broad clears are removed; per-job execution state simply starts empty. PropertyValidationStore.clearWorkflow (pre-flight highlights) stays workflow-level.

Canvas rendering

Queue overlay = the run selector

web/src/components/panels/QueueOverlay.tsx is reused as the focus selector. It keeps its jobs.list source (useRunningJobs), collapse/expand, and Running/Enqueued/Cancelled sections, and gains:

Cancel / reconnect

Lifecycle decisions

What stays the same

Testing

Out of scope

Affected files