Case Study
Voice-Native OS
Close the loop: speak intent into a microphone, watch a multi-agent pipeline create a ticket, dispatch it to an isolated executor, verify the output, and merge to main — without touching a keyboard.
Demo video coming — see the demo script for the exact recording plan.
The gap this closes
The Master OS had a complete agentic pipeline: tickets, FSM state machine, isolated git worktrees, subagent executors, verifier subagents, pre-merge gate, and FF-merge to main. The one missing piece: every interaction required a keyboard and a terminal window.
The voice layer closes that gap. The goal was not "voice assistant as novelty" —
it was voice as a first-class dispatch channel that routes into the same
production pipeline as a typed /start command. No special-case paths,
no toy demo environment.
Architecture
Voice pipeline
Dispatch pipeline
Design decision: self-hosted, zero cloud audio
All audio processing happens on the Mac running the OS. LiveKit is self-hosted in Docker. No audio bytes leave the machine. This satisfies the operator-lens-in-house doctrine: operational audio of a running AI system is treated the same as trading signals — it does not leave the silo.
The practical cost: near-zero. LiveKit's Docker image is Apache-2.0. Deepgram
free tier covers 12,000 minutes per month (far beyond personal use). TTS falls back
to the built-in macOS say command at zero marginal cost.
Design decision: LLM function-calling, not intent parsing
The agent does not use a custom NLP parser or keyword matching to route voice commands.
It uses Claude Haiku's native function-calling: the system prompt declares three
tools — create_ticket, dispatch_ticket, and
query_queue — and the model routes each utterance to the right tool.
This means the routing inherits the model's language understanding out of the box.
"File a ticket," "create a new task," and "add this to the queue" all route to
create_ticket without explicit pattern matching. The model handles
linguistic variation; the function call handles execution.
Design decision: existing pipeline, not voice-specific execution
The voice agent creates tickets in the same tickets.db as every
other intake path. There is no "voice branch" of the pipeline. Once a ticket
exists in status=ready, the rest of the system is identical
regardless of whether it was created by voice, by a typed /ticket
command, or by an automated backlog miner.
This is the correct pattern for adding a new intake channel to an existing system: the new channel terminates into the existing abstraction (the ticket FSM) rather than creating a parallel execution path. Parallel paths accumulate technical debt and create divergent behavior under edge cases.
What the demo shows
The 60-second demo follows this sequence:
- Speak intent (10 sec): "Create a ticket for Sandbox to write the concept density doctrine validation pass."
- Watch transcription + tool call (5 sec): LiveKit playground shows real-time
STT output. The agent terminal shows the LLM routing to
create_ticketand the subprocess call toscripts/tickets.py create. - Ticket appears in queue (3 sec): The ticket watcher panel refreshes.
New row in
status=ready. - Speak "dispatch it" (5 sec): Agent calls
dispatch_ticket. Status changes toin_progress. - Zoom out (10 sec): What just happened enters a production pipeline — FSM-gated, worktree-isolated, verifier-reviewed, pre-merge-gate enforced. All from a 10-second voice command.
Latency breakdown
~300–400ms for a 10-word utterance. Deepgram Nova-2 streams word-by-word — the agent doesn't wait for end-of-utterance to start processing.
~700–900ms for the function-call decision. Haiku was chosen specifically for speed — it doesn't need to reason about the task, only route it. This is ~70% of total latency.
~50–80ms. Python subprocess call to tickets.py create with an SQLite write.
Deterministic and fast.
~200ms for macOS say. Longer for Cartesia cloud TTS but with
significantly better voice quality.
Total end-to-end: ~1.5–2s from end-of-utterance to ticket confirmed. The hard latency budget is the LLM call — swap in a faster model or local inference to push below 1s.
What comes next
This spike established the voice-to-dispatch channel. The production roadmap has three follow-on phases:
Phase 2: Always-on routing
Replace the "click to speak" trigger with a wake-word detection loop so the agent is always listening during work sessions. The architecture is ready — this is a change to the LiveKit room configuration and a wake-word model integration.
Phase 3: Telegram voice notes
Telegram's voice notes already arrive as OGG files in the inbox. The intake pipeline can route those directly to the same transcription layer, extending voice dispatch to mobile without a separate app. The phone becomes the OS microphone.
Phase 4: Bidirectional briefing
The agent currently confirms actions. The next step is proactive briefing: the OS speaks the morning report — "You have 4 tickets ready, 2 blocked, 1 pending Sean approval" — as an audio brief. The same agent infrastructure supports both intake and output.