Interaction flow

GriMoire does not have one unified ingress pipeline. It has two — voice and text — that converge only at the tool runtime and HIE layers.

User input

Voice or text

→

Ingress (channel-specific)

Voice path

WebRTC realtime session

Text path

HTTP SSE completions

→

Shared tool dispatch

Both paths converge here

→

Store + blocks

Zustand state

→

HIE feedback

Context → model

The convergence rule

Ingress differs by channel — voice uses WebRTC, text uses HTTP streaming
Execution converges at tool dispatch — both paths call handleFunctionCall
Context returns through HIE — block tracking, visual injection, interaction normalization

Both paths share the same system prompt, tool registry, runtime handlers, block store, and HIE singleton. But they do not share routing logic, history ownership, or orchestration depth.

Text path

The text path has more client-side orchestration than the voice path because the HTTP completions API is stateless — the client must handle routing, history, and multi-step planning that the realtime model manages server-side. It runs when voice is not connected.

User types message

→

Turn creation + HIE grounding

Visual state injected before model call

→

History sanitized

Prepare conversation for model

→

Pre-model routing (sequential)

Compound workflow?

Multi-step plan via fast model

Forced override?

From visible blocks

First-turn routing?

Intent classifier

→

Tool loop

SSE stream → tool calls → shared dispatch

→

Block push + HIE feedback

Turn creation and grounding

When the user types a message:

The system creates a new turn and decides whether to inherit the current thread or start a new root turn — based on visible blocks, reference titles, reset phrases, and contextual follow-up patterns.
The HIE injects the current visual state as a reminder before the model is queried.

This means typed requests are explicitly grounded in current UI state before any model call happens.

Pre-model checks

Before the model sees the message, the text path runs three sequential checks. If any check handles the request, later checks are skipped.

Compound workflow planning

The fast model analyzes the message for multi-step intents expressed in a single sentence. If a compound plan is accepted, the normal tool loop is skipped entirely and the workflow steps run directly. This is text-path-only — voice does not pre-plan compound workflows.

9 workflow families are supported:

Workflow	Steps
Search + recap	Search → recap visible results
Search + recap + email	Search → recap → compose email
Search + recap + Teams chat	Search → recap → share to Teams chat
Search + recap + Teams channel	Search → recap → post to Teams channel
Search + summarize document	Search → read and summarize top result
Search + summarize + email	Search → summarize → compose email
Find person + email	Find person → compose email to them
Search emails + summarize	Search mailbox → summarize best match
Recap + reply all	Recap visible content → reply all to mail discussion

Each workflow executes step by step with a progress tracker block showing real-time status. The system auto-selects the best result based on hints (first, top, best, latest) and stops gracefully when user choice is needed or results are empty.

Forced contextual overrides

If no compound workflow was selected, the text path may override the first tool call using visible UI context:

User says	Override tool	Condition
"summarize document 3"	`read_file_content`	Numbered result visible
"open the first one"	`show_file_details`	Result block visible
"send this by email"	`show_compose_form`	Actionable block visible
"browse this library"	`browse_document_library`	Site/library context visible

This override layer is built from currently visible blocks and HIE-compatible references.

First-turn routing

If the conversation is on its first real user turn — no prior user/assistant/tool history — a classifier may force a specific tool such as search_sharepoint, search_people, search_emails, or research_public_web. The fast path uses the fast model; the fallback uses heuristics.

Once the conversation has history, this check is skipped entirely.

Tool loop

If no pre-model check handled the request, the tool loop starts:

Send message history to chat completions with SSE streaming
Stream assistant text to the UI
Capture tool calls from the stream
Execute each tool call via the shared tool dispatch layer (awaited)
Append tool results back to history
Repeat until the model stops calling tools or a guard trips

Guard	Value
Max tool iterations	10
Max no-progress iterations	3
Tool timeout	60s
Chat 429 retries	3

The text runtime also blocks duplicate failed tool calls, suppresses redundant display tools when a data tool already rendered the UI, and truncates tool results before sending them back.

Voice path

The voice path is simpler in routing but trickier in timing.

User speaks

→

Server VAD

Speech start/stop detection

→

Transcript arrives

Turn created via HIE policy

→

Model plans tool call

Server-side reasoning

→

Sync function output

Placeholder for async tools

→

Block update + HIE context

→

Deferred response.create

Model speaks after results exist

Key differences from text

No compound workflow planner — the realtime model handles multi-step reasoning itself
No forced tool overrides — the model receives HIE context and chooses tools naturally
Server-managed conversation state — history lives on the realtime server, not locally
Synchronous function outputs — tool callbacks must return immediately

The deferred response pattern

Voice tool callbacks must return to the realtime session immediately. For async tools like search, MCP calls, and content reads, the handler returns a placeholder like "results will appear shortly".

The critical mechanism: the voice runtime does not immediately ask the model to respond when the output looks like a placeholder. Instead, it waits for the actual block update and HIE context injection. This prevents the model from speaking before the UI results exist.

Text-over-voice

When a user types while voice is connected, the message is sent via the WebRTC data channel. From that point, it follows the voice path — the realtime model handles it with the same tool dispatch and deferred response mechanism.

Shared tool layer

Both voice and text converge at the shared tool dispatch layer.

Shared tool dispatch

→

Capture source context

From HIE state

→

Dispatch by category

Content

MCP

UI & Personal

→

Store + HIE notify

Every tool run inherits source context from HIE:

source block and source artifact
source task kind
turn lineage
target context for form prefill

That captured source context is the main bridge between previous UI state and the next tool action.

Runtime handlers update the UI through store helpers that atomically update the store, notify HIE, and sometimes emit explicit artifact events.

Action panel feedback

The action panel is not passive. It sends state back into HIE.

User action	HIE event	Triggers response?
Click a result item	`block.interaction.click-result`	Sometimes
Select items	`block.interaction.select`	No
Focus on a block	`task.focused`	No
Click recap button	`task.recap.requested`	Yes
Submit a form	`form.submitted`	Yes
Cancel a form	`form.cancelled`	Yes
Dismiss a block	`block.removed`	No

Response-triggering interactions can create a new assistant turn without the user typing or speaking again. This is why GriMoire often behaves like a mixed UI-plus-chat runtime rather than a pure chatbot.

Request analysis: text vs voice

The same utterance can take a more guided path in text mode than in voice mode. Here is the analysis order for each channel:

Text mode (6 steps)

1. HIE turn-start policy

2. Compound workflow planner

3. Contextual forced override

4. First-turn routing classifier

5. Normal model tool planning

6. HIE feedback after blocks render

→

Voice mode (4 steps)

1. HIE turn-start policy

2. Heuristic first-turn observation

3. Normal realtime model tool planning

4. HIE feedback after blocks render

Text mode has three extra pre-model routing stages (compound, override, first-turn) that do not exist in voice mode. Voice relies more heavily on the realtime model's own tool planning, grounded by the HIE context that was already injected.

End-to-end traces

Trace 1: Typed "find documents about SPFx"

Turn created

New root (no prior context)

→

Visual state injected

No blocks yet

→

First-turn routing

Classifies as search_sharepoint

→

search_sharepoint executes

Loading block appears

→

Results update block

Fused results rendered

→

HIE injects numbered context

[Visual context: 1) Architecture Guide ...]

After this, follow-up typed messages can reference results by number. The text path's contextual override layer makes "summarize document 3" resolve directly to the right file.

Trace 2: Typed follow-up "summarize document 3"

Turn inherited

Same root as search turn

→

Visual state injected

Search results block

→

Contextual override fires

Resolves to read_file_content

→

Summary card created

info-card block

→

HIE emits artifact.result.ready

Artifact kind: summary

The contextual override bypassed the model entirely for tool selection. The model still generates the summary text, but it did not need to figure out which tool to call or which file to read.

Trace 3: Voice follow-up "send it by email"

Transcript arrives

Inherited turn

→

Model chooses show_compose_form

Based on HIE context

→

Form block appears

HIE emits form.opened

→

User edits and submits

→

HIE records form.submitted

Artifact status: submitted

Voice had no compound planner or contextual override. The realtime model chose the right tool because HIE had already injected the summary card's context, including its content and artifact lineage.

The convergence rule​

Text path​

Turn creation and grounding​

Pre-model checks​

Compound workflow planning​

Forced contextual overrides​

First-turn routing​

Tool loop​

Voice path​

Key differences from text​

The deferred response pattern​

Text-over-voice​

Shared tool layer​

Action panel feedback​

Request analysis: text vs voice​

End-to-end traces​

Trace 1: Typed "find documents about SPFx"​

Trace 2: Typed follow-up "summarize document 3"​

Trace 3: Voice follow-up "send it by email"​

The convergence rule

Text path

Turn creation and grounding

Pre-model checks

Compound workflow planning

Forced contextual overrides

First-turn routing

Tool loop

Voice path

Key differences from text

The deferred response pattern

Text-over-voice

Shared tool layer

Action panel feedback

Request analysis: text vs voice

End-to-end traces

Trace 1: Typed "find documents about SPFx"

Trace 2: Typed follow-up "summarize document 3"

Trace 3: Voice follow-up "send it by email"