Kernel & WorkflowRunner review — bugs, issues, design flaws
Review of packages/kernel/ (runner, actor, inbox, graph, io, correlation
analysis, triggers, durable inbox) as of 2026-07-04. Findings are ordered by
severity within each section. Line references are against the source at the
time of review (commit 21230c7).
Fix status (this PR): findings #1, #2, #3, #4, #6, #9, #13, #16, #17 are
fixed; #7, #11, #12 now emit warnings instead of failing silently (see the
follow-up commits on this branch). Still open, pending a design decision:
#5 (control-response attribution needs an event id echoed through the
actor), #8 (RunResult.outputs accumulation semantics), #10 (backpressure
defaults), #14, #15, and the smaller issues.
Bugs
1. Edge status regresses from completed back to active at run end
_incrementEdgeCounter throttles edge_update emissions and marks throttled
edges dirty (runner.ts:1623-1625). _sendEOS emits the terminal
status: "completed" update (runner.ts:1473-1479) but never clears the
edge from _edgeCounterDirty. At run end, _drainActiveEdges calls
_flushEdgeCounters (runner.ts:1643-1654), which re-emits
status: "active" for every dirty edge — including edges already reported
completed. Any edge that received a message within the last
EDGE_UPDATE_MIN_INTERVAL_MS before its source finished ends the run with
completed followed by active as its final message. Fix: clear the edge
from _edgeCounterDirty (and stamp _edgeCounterLastEmitMs) inside
_sendEOS, or have _flushEdgeCounters skip edges whose EOS was already
emitted (_eosSentEdges is available).
2. Behavior-flag hydration is OR-based, so stale saved true flags can never be corrected
Graph.loadFromDict computes
is_streaming_input: descriptorDefaults.is_streaming_input || node.is_streaming_input || false
(graph.ts:437-457), same for is_streaming_output, is_controlled,
is_join_node. The comments claim the registry is the source of truth and
saved JSON is “a cache that gets overwritten”, and input_mode /
output_correlation really are overwritten — but the boolean flags are ORed.
If a node type migrates from streaming to buffered (or stops being a join
node), every saved workflow keeps the stale true and runs the wrong
execution mode, with no way to fix it short of re-saving the graph. The
actor trusts these flags to pick run() vs process() vs _runControlled,
so this silently changes execution semantics. Fix: treat the registry value
as authoritative when the resolver returned a descriptor
(descriptorDefaults.is_streaming_input ?? node.is_streaming_input ?? false
is still wrong; it should be
descriptorDefaults.is_streaming_input ?? false when the registry resolved,
falling back to the saved value only for unresolved/dynamic types).
3. pushInputValue bypasses the input node’s process()
_dispatchInputs runs the input node’s executor so transforming inputs
(StringInput max_length, DocumentFileInput building a DocumentRef,
MessageDeconstructor splitting) emit their declared per-handle outputs
(runner.ts:938-951). pushInputValue (runner.ts:337-367) puts the raw
value straight onto outgoing edges. The same input node behaves differently
depending on whether the value arrived as a start param or was streamed in —
streamed values skip validation/transformation entirely and leak the raw
value to every edge regardless of sourceHandle semantics. It also stamps a
different source_edge_id (external:... vs the real edge id), which will
not match the per-edge lineage-done/closed bookkeeping keyed by real edge
ids.
4. Controlled-node wait can deadlock the whole run
_runControlled refuses to process any control event until every data
handle has produced at least one value (actor.ts:1301-1303,
_waitForDataInputs at actor.ts:1341-1377). The wait loop only exits when
all handles are satisfied or the inbox is fully drained — and “fully
drained” includes __control__, which stays open while the controller is
alive. Two realistic shapes hang forever:
- An upstream node completes without emitting on the controlled node’s data
handle (filter/conditional output). The handle closes but
pendingnever empties, control events are held aside indefinitely, and a controller awaitingsendControlEventnever resolves →_processGraph’sPromise.allnever settles. No timeout exists anywhere on this path. - An agent both controls the node and (directly or transitively) produces its data input: the agent awaits the control response before emitting data; the controlled node awaits data before answering.
The runner already rejects pending control responses when the controlled
actor finishes, but here the controlled actor never finishes. At minimum the
wait should give up when all data handles are closed (drop pending handles
whose upstream count reached zero), mirroring how _runCorrelatedImpl
treats closed empty-scope handles as “use defaults”.
5. sendControlEvent responses are matched to the wrong output when event sources mix
_sendMessages resolves the oldest pending sendControlEvent waiter on
any output emission from the target node (runner.ts:1357-1364). The FIFO
assumption holds only when sendControlEvent is the sole source of control
events. If the same node is also driven by a control edge
(__control_output__ routing) or by dispatchControlEvent, an edge-driven
firing’s outputs resolve a sendControlEvent waiter that belongs to a
different event. The agent then receives another invocation’s result. The
response should be correlated to the event (e.g. stamp an event id on the
ControlEvent and echo it through the actor), not inferred from global output
order.
6. _dispatchInputs falls back to leaking the raw value on missing handles
handleValue = nodeOutputs[edge.sourceHandle] !== undefined ? ... : value
(runner.ts:958-961). The fallback exists for test doubles returning {},
but it applies per-edge: a real input node with several declared outputs
that legitimately omits one (e.g. “no attachments”) gets the raw input value
delivered on that edge instead of nothing. Downstream receives a value of
the wrong type/shape. The fallback should trigger only when process()
returned an entirely empty record, not per-handle.
7. Streaming-input node without run() silently ignores all its inputs
The legacy fallback (actor.ts:388-399) calls process() once with only
the node’s own properties. Data queued on its inbox is never read; it is
reported only as a post-run “pending inbox work” warning. A misdeclared node
(registry says is_streaming_input, implementation has no run) drops all
upstream data while the run still reports completed. This should be a node
error, not a silent fallback.
8. RunResult.outputs keeps only the last firing of multi-fire terminal nodes
Output collection stores result.outputs, which is the actor’s
_latestResult — the last invocation’s record (runner.ts:1099-1110,
actor.ts:473). A terminal node fired once per streamed item contributes
only its final values to RunResult.outputs; earlier firings survive only
as output_update messages. The Record<string, unknown[]> shape suggests
accumulation was intended. Also, Object.values(result.outputs) flattens
multiple output handles into one array, losing handle names, and
_isOutputNode is “no outgoing data edges” (runner.ts:1717-1721), so a
controller agent with only a control edge out also lands in workflow
outputs, and two terminal nodes sharing a name merge into one bucket.
9. TriggerWakeupService defeats DurableInbox’s append serialization
DurableInbox.append chains appends per instance because
getMaxSeq→save is a read-modify-write (durable-inbox.ts:196-222). But
deliverTriggerInput constructs a fresh DurableInbox per call over the
shared store (trigger-wakeup.ts:88-89), so concurrent deliveries for the
same (run, node) each get their own no-op chain, read the same maxSeq, and
persist duplicate seq values — corrupting the ordering that findPending
sorts by. The service should cache one DurableInbox per (runId, nodeId).
Design flaws / risks
10. No backpressure in production, and the correlated scheduler can’t be backpressured at all
No production caller passes bufferLimit, so every inbox is unbounded
(runner.ts:771). Independently, _runCorrelatedImpl eagerly drains the
inbox into actor-local buckets (maxBuckets, sticky maps), so even with a
bufferLimit, buffered nodes release inbox backpressure immediately and
buffering just moves into the actor. The only guards are
maxPendingKeys/maxPendingMessagesPerKey (10k × 10k envelopes). A fast
producer feeding a slow consumer grows the heap without bound; audio-rate
streams make this practical, not theoretical.
11. Unmigrated streaming producers silently collapse to last-value-wins downstream
The analyzer defaults missing output_correlation to single with
repeats: false (correlation-analysis.ts:480). A custom/legacy node that
emits many values on one slot (without declaring chunk/iteration) feeds
a downstream buffered handle classified empty-scope sticky
(actor.ts:568-579), where each arrival overwrites the previous
(actor.ts:813-816) and the node fires once at close with only the last
value. Pre-correlation semantics fired per message. Correctness of every
third-party node now hinges on correlation metadata being declared; the
failure mode is silent data loss. There is no warning when multiple
envelopes overwrite an empty-scope sticky slot — emitting one would make the
migration gap visible.
12. Non-driver duplicate keys are silently dropped inside the actor
In _runCorrelatedImpl with no repeating driver, a key fires once
(fired set, actor.ts:727-738); a second envelope arriving for the same
key on a max-scope handle stays in the actor-local bucket forever. Unlike
inbox-stranded data, _checkPendingInboxWork cannot see it (the envelope
was already popped), so the loss is completely silent.
13. Graph mutates caller-owned node descriptors
_detectControlledNodes writes is_controlled = true onto the node objects
passed in (graph.ts:515-525), and the runner passes the caller’s
graphData.nodes through (bypass rewrite preserves object identity). A
WorkflowRunner.run() therefore mutates the caller’s graph data despite the
readonly typing. Harmless today, but it leaks state across runs and
violates the descriptor contract.
14. Control context exposes every property value to the LLM
_buildControlContext embeds all non-underscore property values of
controlled nodes as tool-schema defaults (runner.ts:1810-1900). If a
controlled node carries a sensitive string property (token, endpoint,
prompt containing PII), it is shipped into the controller LLM’s tool
definitions. Similarly, generation_complete rides all scalar resolved
inputs into persisted asset metadata (actor.ts:149-164). Both need a
notion of secret/sensitive properties to redact.
15. Trigger watchdog restarts forever with no backoff
_watchdogCheck restarts failed/completed jobs every tick
(trigger-manager.ts:337-374). A workflow that fails at startup restarts
every 30 s indefinitely — no backoff, no retry cap, no circuit breaker.
getInstance also silently ignores differing opts on subsequent calls,
and only the latest watchdog check is awaited by shutdown
(_watchdogCheckInFlight is overwritten per tick), so an older,
still-running check can theoretically revive a job after shutdown’s await.
16. Two job ids per run
edge_update messages use this.jobId from the constructor while
job_update uses request.job_id (runner.ts:1629-1635 vs 473-478).
Callers that pass different values get edge updates that clients cannot
associate with the job. One of the two should win (or the constructor id
should be dropped).
17. Runner instance lifecycle races
_resetRunState() runs at the top of _runImpl, so cancel() called
between construction and run() is silently lost (_cancelled reset,
fresh AbortController). A second run() on the same instance while the
first is in flight clobbers all shared state. Cheap fix: latch a
“cancelled before start” flag and reject concurrent run().
Smaller issues
- Inbox arrival queue is O(n²) —
_arrivalis a plain array;_removeFromArrivaldoesindexOf+spliceper consumed envelope andprependdoesunshift(inbox.ts:113,579-585). With deep backlogs (audio-rate streams, list aggregation) this degrades quadratically. A per-handle counter or linked list removes the scans. _notifyWaiters/_notifyPutWaiterswake every waiter while their doc comments say “wake one” (inbox.ts:552-566) — thundering herd on hot streams; eachputwakes all consumers and all blocked producers.put()on a closed inbox silently drops data and returns normally (inbox.ts:202); fine for cancellation, but producers cannot distinguish delivered from dropped._emitNodeStatus("running")fires at actor start, before any input arrives (actor.ts:303), so every node in the graph shows “running” immediately; there is no “waiting for inputs” state.getInputSchema/getOutputSchemaclassify nodes bytype.includes("Input")(graph.ts:772-783) while the runner usestype.startsWith("nodetool.input.")— the two predicates disagree for types like...InputSanitizer, and every schema property is markedrequiredeven when a default exists.- A node that is both
is_streaming_inputandis_controlledtakes the streaming branch and its control events are never consumed (actor.ts:310-401branch order); combinations of behavior flags are not validated. _runControlledignores_listInputHandlesand lineage — multi-edge list inputs to controlled nodes keep only the latest value (actor.ts:1384-1395).- Duplicate
__control_output__routing is possible by edge shape: the runner routes__control_output__to all control edges, then the edge loop routesoutputs[edge.sourceHandle]again (runner.ts:1154-1227). Convention (sourceHandle: "__control__") prevents overlap today, but an edge saved withsourceHandle: "__control_output__"would deliver the event twice; nothing validates control-edge source handles. MemoryDurableInboxStoreis O(n) per operation over one flat array — acceptable for tests, but it is also the default store used byTriggerWakeupServicein production paths, andTriggerInputrecords accumulate until an explicitcleanupProcessedcall.
What is solid
Cycle handling is sound end-to-end (data cycles rejected by the
correlation topo sort, self-loops by validateEdgeEndpoints, control
cycles by DFS). The double-EOS guard (_eosSentEdges), per-unique-controller
__control__ counting, the FIFO queue for control responses replacing the
old single-slot resolver, the held-control-events fix in
_waitForDataInputs, and the pending-start dedup in
TriggerWorkflowManager all show earlier races were found and fixed with
regression tests. The message-retention cap and audio-chunk exclusion in
_emit are well reasoned. The Stryker annotations indicate real mutation
coverage.
Suggested priority
- Flag hydration OR-logic (#2) — silent wrong execution mode on saved graphs.
- Controlled-node deadlock (#4) and control-response attribution (#5) — hangs and wrong agent results at runtime.
- Streaming-migration last-value collapse (#11) + duplicate-key silent drop (#12) — silent data loss; at minimum emit warnings.
pushInputValuetransformation bypass (#3) and dispatch fallback leak (#6).- Edge status regression (#1) — small, contained UI fix.
- Backpressure story (#10) — decide on a default
bufferLimitand a bucket-level cap that actually applies to buffered nodes.