A self-contained prompt for an engineer or coding agent to build the first end-to-end slice of NodeTool Studio. Scope is deliberately thin: prove the thesis (script → voiced, captioned short-form video, edited as a document), nothing more. See
docs/STUDIO_PRD.mdfor the full vision.
Build the first MVP of NodeTool Studio: a transcript-driven editor for
short-form video, docked inside the existing TimelineEditor
(web/src/components/timeline/). The MVP must prove one thing end-to-end —
you write a script, Studio turns each line into a voiced + captioned beat on the
timeline, and editing the script edits the video — then export an MP4.
Do not build the assembly agent, research, b-roll generation, brand kits, auto-reframe, localization, or platform publishing yet. Those are out of scope for this MVP.
I open Studio, paste or type a short script (a handful of lines). Each line becomes a beat with generated voiceover and word-level captions, laid out on the timeline in order. I press play and see/hear it. I delete a line and its clips disappear and the timeline re-flows. I reword a line and its voiceover regenerates. I export an MP4 that matches the preview.
Transcript data model. Add a transcriptLine type to
packages/protocol/src/api-schemas/timeline.ts:
{ id, text, beatStartMs, clipIds[] }, persisted inside the existing
timelineDocument so autosave and export inherit it for free.
web/src/stores/timeline/TimelineTranscriptStore.ts,
beside TimelineStore / TimelineGenerationStore. Holds line↔clipId bindings
and these ops, each expressed as existing TimelineStore mutations:
startMs;startMs across bound clips;dependencyHash
machinery) and trigger regenerate via the existing TimelineGenerationStore
path.text-to-audio clip generation;{ word, startMs, endMs }[]) sourced from the existing transcription
nodes. A single fixed caption style is fine for the MVP.
The voiceover’s duration determines the beat length; captions and startMs
follow from it.Caption rendering. Wire a caption layer into
web/src/components/timeline/preview/sceneModel.ts computeActiveLayers so
captions render identically in live preview and the WebCodecs MP4 export. One
code path, no export-specific rendering.
web/src/components/timeline/TranscriptPanel.tsx,
docked like Inspector:
beatStartMs and select its clips;ClipVersion history).
Use ui_primitives only — no raw MUI imports (repo policy).width/height, so don’t hardcode — just default to vertical.any, functional components, Zustand selectors (never
whole-store), shallow for multi-value selects.npm run check is green (typecheck + lint + test), with tests for the store ops
(add/edit/delete/reorder/reword→stale) and the caption layer in
computeActiveLayers.The script→beat generation step is the one place that touches the assembly
question from the PRD. For the MVP, keep it deterministic — a fixed per-line
pipeline (line text → text-to-audio → caption timing), not an agent. Confirm
this deterministic approach and surface any place the existing generation/staleness
APIs don’t fit the per-beat binding before building. List the specific
existing functions/types you’ll call (in TimelineStore, TimelineGenerationStore,
the text-to-audio node, transcription nodes, and computeActiveLayers) so the
integration points are explicit.