NodeTool ships a lightweight ingestion pipeline for semantic search and retrieval-augmented generation (RAG) tasks. The indexing logic is split across @nodetool/vectorstore (store and embedding) and @nodetool/deploy (collection routes).
Overview
- Collection metadata (
CollectionResponsein@nodetool/protocolpackages/protocol/src/api-types.ts) stores ingest configuration, including an optional workflow ID. - Vector store – the default backend is SQLite-vec (
@nodetool/vectorstorepackages/vectorstore/src/sqlite-vec-store.ts), with a Chroma-compatible chunking helper inpackages/vectorstore/src/chroma-client.ts. - Indexing route –
indexFileToCollection()(@nodetool/deploypackages/deploy/src/collection-routes.ts) orchestrates ingestion based on collection metadata.
Default Flow
indexFileToCollection()resolves the target collection viagetCollection()(@nodetool/vectorstorepackages/vectorstore/src/index.ts).- If the collection specifies a custom workflow ID, the service executes it by constructing a
RunJobRequest(@nodetool/protocolpackages/protocol/src/api-types.ts) withCollectionInputandFileInputnodes populated. - Otherwise, it falls back to the default ingestion path, which splits the document with
splitDocument()(@nodetool/vectorstorepackages/vectorstore/src/chroma-client.ts), embeds it, and stores embeddings in SQLite-vec.
Messages & Progress
While custom workflows run, the service streams JobUpdate, NodeUpdate, and progress messages (from @nodetool/protocol packages/protocol/src/messages.ts). Tests under packages/deploy/tests/collection-routes.test.ts cover expected message sequences.
Configuring Chroma
Environment variables:
| Variable | Description | Default |
|---|---|---|
CHROMA_URL |
Remote Chroma server URL | None (use local DB) |
CHROMA_PATH |
Local data directory | ~/.local/share/nodetool/chroma |
The SQLite-vec store uses local storage by default. For remote Chroma (legacy), set CHROMA_TOKEN if authentication is required.
Custom Ingestion Workflows
Collections can reference bespoke workflows to process files before embedding. The workflow should expect:
- A
CollectionInputnode receivingCollection(name=…). - A
FileInputnode receivingFilePath(path=…).
Return values can include summaries, metadata, or alternate embeddings. Review packages/deploy/tests/collection-routes.test.ts for a template.
CLI & API Integration
POST /collections/{name}/index(see@nodetool/websocketpackages/websocket/src/collection-api.ts) triggers ingestion via HTTP.- The MCP server (
@nodetool/websocketpackages/websocket/src/mcp-server.ts) exposes commands for IDE plug-ins to index assets. - Admin routes under
@nodetool/deploypackages/deploy/src/admin-routes.tsprovide remote ingestion endpoints for deployed servers.
Troubleshooting
- Missing collection metadata – ensure the collection exists and includes the required
workflowentry when using custom workflows. - Chroma connection errors – verify
CHROMA_URL/CHROMA_TOKENand network reachability; fall back to local mode by clearing the URL. - Large files – increase
CHROMA_PATHdisk quota or configure cloud storage; the default ingestion workflow streams chunks to reduce memory usage.
Related Documentation
- Providers – selecting embedding models for ingestion nodes.
- Workflow API – details on
RunJobRequest. - Storage Guide – configuring persistent storage for uploaded documents.