Skip to main content

Interaction flow

GriMoire does not have one unified ingress pipeline. It has two — voice and text — that converge only at the tool runtime and HIE layers.

User input
Voice or text
Ingress (channel-specific)
Voice path
WebRTC realtime session
Text path
HTTP SSE completions
Shared tool dispatch
Both paths converge here
Store + blocks
Zustand state
HIE feedback
Context → model

The convergence rule

  • Ingress differs by channel — voice uses WebRTC, text uses HTTP streaming
  • Execution converges at tool dispatch — both paths call handleFunctionCall
  • Context returns through HIE — block tracking, visual injection, interaction normalization

Both paths share the same system prompt, tool registry, runtime handlers, block store, and HIE singleton. But they do not share routing logic, history ownership, or orchestration depth.

Text path

The text path has more client-side orchestration than the voice path because the HTTP completions API is stateless — the client must handle routing, history, and multi-step planning that the realtime model manages server-side. It runs when voice is not connected.

User types message
Turn creation + HIE grounding
Visual state injected before model call
History sanitized
Prepare conversation for model
Pre-model routing (sequential)
Compound workflow?
Multi-step plan via fast model
Forced override?
From visible blocks
First-turn routing?
Intent classifier
Tool loop
SSE stream → tool calls → shared dispatch
Block push + HIE feedback

Turn creation and grounding

When the user types a message:

  1. The system creates a new turn and decides whether to inherit the current thread or start a new root turn — based on visible blocks, reference titles, reset phrases, and contextual follow-up patterns.
  2. The HIE injects the current visual state as a reminder before the model is queried.

This means typed requests are explicitly grounded in current UI state before any model call happens.

Pre-model checks

Before the model sees the message, the text path runs three sequential checks. If any check handles the request, later checks are skipped.

Compound workflow planning

The fast model analyzes the message for multi-step intents expressed in a single sentence. If a compound plan is accepted, the normal tool loop is skipped entirely and the workflow steps run directly. This is text-path-only — voice does not pre-plan compound workflows.

9 workflow families are supported:

WorkflowSteps
Search + recapSearch → recap visible results
Search + recap + emailSearch → recap → compose email
Search + recap + Teams chatSearch → recap → share to Teams chat
Search + recap + Teams channelSearch → recap → post to Teams channel
Search + summarize documentSearch → read and summarize top result
Search + summarize + emailSearch → summarize → compose email
Find person + emailFind person → compose email to them
Search emails + summarizeSearch mailbox → summarize best match
Recap + reply allRecap visible content → reply all to mail discussion

Each workflow executes step by step with a progress tracker block showing real-time status. The system auto-selects the best result based on hints (first, top, best, latest) and stops gracefully when user choice is needed or results are empty.

Forced contextual overrides

If no compound workflow was selected, the text path may override the first tool call using visible UI context:

User saysOverride toolCondition
"summarize document 3"read_file_contentNumbered result visible
"open the first one"show_file_detailsResult block visible
"send this by email"show_compose_formActionable block visible
"browse this library"browse_document_librarySite/library context visible

This override layer is built from currently visible blocks and HIE-compatible references.

First-turn routing

If the conversation is on its first real user turn — no prior user/assistant/tool history — a classifier may force a specific tool such as search_sharepoint, search_people, search_emails, or research_public_web. The fast path uses the fast model; the fallback uses heuristics.

Once the conversation has history, this check is skipped entirely.

Tool loop

If no pre-model check handled the request, the tool loop starts:

  1. Send message history to chat completions with SSE streaming
  2. Stream assistant text to the UI
  3. Capture tool calls from the stream
  4. Execute each tool call via the shared tool dispatch layer (awaited)
  5. Append tool results back to history
  6. Repeat until the model stops calling tools or a guard trips
GuardValue
Max tool iterations10
Max no-progress iterations3
Tool timeout60s
Chat 429 retries3

The text runtime also blocks duplicate failed tool calls, suppresses redundant display tools when a data tool already rendered the UI, and truncates tool results before sending them back.

Voice path

The voice path is simpler in routing but trickier in timing.

User speaks
Server VAD
Speech start/stop detection
Transcript arrives
Turn created via HIE policy
Model plans tool call
Server-side reasoning
Sync function output
Placeholder for async tools
Block update + HIE context
Deferred response.create
Model speaks after results exist

Key differences from text

  • No compound workflow planner — the realtime model handles multi-step reasoning itself
  • No forced tool overrides — the model receives HIE context and chooses tools naturally
  • Server-managed conversation state — history lives on the realtime server, not locally
  • Synchronous function outputs — tool callbacks must return immediately

The deferred response pattern

Voice tool callbacks must return to the realtime session immediately. For async tools like search, MCP calls, and content reads, the handler returns a placeholder like "results will appear shortly".

The critical mechanism: the voice runtime does not immediately ask the model to respond when the output looks like a placeholder. Instead, it waits for the actual block update and HIE context injection. This prevents the model from speaking before the UI results exist.

Text-over-voice

When a user types while voice is connected, the message is sent via the WebRTC data channel. From that point, it follows the voice path — the realtime model handles it with the same tool dispatch and deferred response mechanism.

Shared tool layer

Both voice and text converge at the shared tool dispatch layer.

Shared tool dispatch
Capture source context
From HIE state
Dispatch by category
Search
Content
MCP
UI & Personal
Store + HIE notify

Every tool run inherits source context from HIE:

  • source block and source artifact
  • source task kind
  • turn lineage
  • target context for form prefill

That captured source context is the main bridge between previous UI state and the next tool action.

Runtime handlers update the UI through store helpers that atomically update the store, notify HIE, and sometimes emit explicit artifact events.

Action panel feedback

The action panel is not passive. It sends state back into HIE.

User actionHIE eventTriggers response?
Click a result itemblock.interaction.click-resultSometimes
Select itemsblock.interaction.selectNo
Focus on a blocktask.focusedNo
Click recap buttontask.recap.requestedYes
Submit a formform.submittedYes
Cancel a formform.cancelledYes
Dismiss a blockblock.removedNo

Response-triggering interactions can create a new assistant turn without the user typing or speaking again. This is why GriMoire often behaves like a mixed UI-plus-chat runtime rather than a pure chatbot.

Request analysis: text vs voice

The same utterance can take a more guided path in text mode than in voice mode. Here is the analysis order for each channel:

Text mode (6 steps)
1. HIE turn-start policy
2. Compound workflow planner
3. Contextual forced override
4. First-turn routing classifier
5. Normal model tool planning
6. HIE feedback after blocks render
Voice mode (4 steps)
1. HIE turn-start policy
2. Heuristic first-turn observation
3. Normal realtime model tool planning
4. HIE feedback after blocks render

Text mode has three extra pre-model routing stages (compound, override, first-turn) that do not exist in voice mode. Voice relies more heavily on the realtime model's own tool planning, grounded by the HIE context that was already injected.

End-to-end traces

Trace 1: Typed "find documents about SPFx"

Turn created
New root (no prior context)
Visual state injected
No blocks yet
First-turn routing
Classifies as search_sharepoint
search_sharepoint executes
Loading block appears
Results update block
Fused results rendered
HIE injects numbered context
[Visual context: 1) Architecture Guide ...]

After this, follow-up typed messages can reference results by number. The text path's contextual override layer makes "summarize document 3" resolve directly to the right file.

Trace 2: Typed follow-up "summarize document 3"

Turn inherited
Same root as search turn
Visual state injected
Search results block
Contextual override fires
Resolves to read_file_content
Summary card created
info-card block
HIE emits artifact.result.ready
Artifact kind: summary

The contextual override bypassed the model entirely for tool selection. The model still generates the summary text, but it did not need to figure out which tool to call or which file to read.

Trace 3: Voice follow-up "send it by email"

Transcript arrives
Inherited turn
Model chooses show_compose_form
Based on HIE context
Form block appears
HIE emits form.opened
User edits and submits
HIE records form.submitted
Artifact status: submitted

Voice had no compound planner or contextual override. The realtime model chose the right tool because HIE had already injected the summary card's context, including its content and artifact lineage.