Interaction flow
GriMoire does not have one unified ingress pipeline. It has two — voice and text — that converge only at the tool runtime and HIE layers.
The convergence rule
- Ingress differs by channel — voice uses WebRTC, text uses HTTP streaming
- Execution converges at tool dispatch — both paths call
handleFunctionCall - Context returns through HIE — block tracking, visual injection, interaction normalization
Both paths share the same system prompt, tool registry, runtime handlers, block store, and HIE singleton. But they do not share routing logic, history ownership, or orchestration depth.
Text path
The text path has more client-side orchestration than the voice path because the HTTP completions API is stateless — the client must handle routing, history, and multi-step planning that the realtime model manages server-side. It runs when voice is not connected.
Turn creation and grounding
When the user types a message:
- The system creates a new turn and decides whether to inherit the current thread or start a new root turn — based on visible blocks, reference titles, reset phrases, and contextual follow-up patterns.
- The HIE injects the current visual state as a reminder before the model is queried.
This means typed requests are explicitly grounded in current UI state before any model call happens.
Pre-model checks
Before the model sees the message, the text path runs three sequential checks. If any check handles the request, later checks are skipped.
Compound workflow planning
The fast model analyzes the message for multi-step intents expressed in a single sentence. If a compound plan is accepted, the normal tool loop is skipped entirely and the workflow steps run directly. This is text-path-only — voice does not pre-plan compound workflows.
9 workflow families are supported:
| Workflow | Steps |
|---|---|
| Search + recap | Search → recap visible results |
| Search + recap + email | Search → recap → compose email |
| Search + recap + Teams chat | Search → recap → share to Teams chat |
| Search + recap + Teams channel | Search → recap → post to Teams channel |
| Search + summarize document | Search → read and summarize top result |
| Search + summarize + email | Search → summarize → compose email |
| Find person + email | Find person → compose email to them |
| Search emails + summarize | Search mailbox → summarize best match |
| Recap + reply all | Recap visible content → reply all to mail discussion |
Each workflow executes step by step with a progress tracker block showing real-time status. The system auto-selects the best result based on hints (first, top, best, latest) and stops gracefully when user choice is needed or results are empty.
Forced contextual overrides
If no compound workflow was selected, the text path may override the first tool call using visible UI context:
| User says | Override tool | Condition |
|---|---|---|
| "summarize document 3" | read_file_content | Numbered result visible |
| "open the first one" | show_file_details | Result block visible |
| "send this by email" | show_compose_form | Actionable block visible |
| "browse this library" | browse_document_library | Site/library context visible |
This override layer is built from currently visible blocks and HIE-compatible references.
First-turn routing
If the conversation is on its first real user turn — no prior user/assistant/tool history — a classifier may force a specific tool such as search_sharepoint, search_people, search_emails, or research_public_web. The fast path uses the fast model; the fallback uses heuristics.
Once the conversation has history, this check is skipped entirely.
Tool loop
If no pre-model check handled the request, the tool loop starts:
- Send message history to chat completions with SSE streaming
- Stream assistant text to the UI
- Capture tool calls from the stream
- Execute each tool call via the shared tool dispatch layer (awaited)
- Append tool results back to history
- Repeat until the model stops calling tools or a guard trips
| Guard | Value |
|---|---|
| Max tool iterations | 10 |
| Max no-progress iterations | 3 |
| Tool timeout | 60s |
| Chat 429 retries | 3 |
The text runtime also blocks duplicate failed tool calls, suppresses redundant display tools when a data tool already rendered the UI, and truncates tool results before sending them back.
Voice path
The voice path is simpler in routing but trickier in timing.
Key differences from text
- No compound workflow planner — the realtime model handles multi-step reasoning itself
- No forced tool overrides — the model receives HIE context and chooses tools naturally
- Server-managed conversation state — history lives on the realtime server, not locally
- Synchronous function outputs — tool callbacks must return immediately
The deferred response pattern
Voice tool callbacks must return to the realtime session immediately. For async tools like search, MCP calls, and content reads, the handler returns a placeholder like "results will appear shortly".
The critical mechanism: the voice runtime does not immediately ask the model to respond when the output looks like a placeholder. Instead, it waits for the actual block update and HIE context injection. This prevents the model from speaking before the UI results exist.
Text-over-voice
When a user types while voice is connected, the message is sent via the WebRTC data channel. From that point, it follows the voice path — the realtime model handles it with the same tool dispatch and deferred response mechanism.
Shared tool layer
Both voice and text converge at the shared tool dispatch layer.
Every tool run inherits source context from HIE:
- source block and source artifact
- source task kind
- turn lineage
- target context for form prefill
That captured source context is the main bridge between previous UI state and the next tool action.
Runtime handlers update the UI through store helpers that atomically update the store, notify HIE, and sometimes emit explicit artifact events.
Action panel feedback
The action panel is not passive. It sends state back into HIE.
| User action | HIE event | Triggers response? |
|---|---|---|
| Click a result item | block.interaction.click-result | Sometimes |
| Select items | block.interaction.select | No |
| Focus on a block | task.focused | No |
| Click recap button | task.recap.requested | Yes |
| Submit a form | form.submitted | Yes |
| Cancel a form | form.cancelled | Yes |
| Dismiss a block | block.removed | No |
Response-triggering interactions can create a new assistant turn without the user typing or speaking again. This is why GriMoire often behaves like a mixed UI-plus-chat runtime rather than a pure chatbot.
Request analysis: text vs voice
The same utterance can take a more guided path in text mode than in voice mode. Here is the analysis order for each channel:
Text mode has three extra pre-model routing stages (compound, override, first-turn) that do not exist in voice mode. Voice relies more heavily on the realtime model's own tool planning, grounded by the HIE context that was already injected.
End-to-end traces
Trace 1: Typed "find documents about SPFx"
After this, follow-up typed messages can reference results by number. The text path's contextual override layer makes "summarize document 3" resolve directly to the right file.
Trace 2: Typed follow-up "summarize document 3"
The contextual override bypassed the model entirely for tool selection. The model still generates the summary text, but it did not need to figure out which tool to call or which file to read.
Trace 3: Voice follow-up "send it by email"
Voice had no compound planner or contextual override. The realtime model chose the right tool because HIE had already injected the summary card's context, including its content and artifact lineage.