Case Study

Voice-Native OS

Close the loop: speak intent into a microphone, watch a multi-agent pipeline create a ticket, dispatch it to an isolated executor, verify the output, and merge to main — without touching a keyboard.

System Master OS — Voice Layer

Shipped June 2, 2026

Latency ~1.5s end-to-end (STT → ticket created)

Demo video coming — see the demo script for the exact recording plan.

The gap this closes

The Master OS had a complete agentic pipeline: tickets, FSM state machine, isolated git worktrees, subagent executors, verifier subagents, pre-merge gate, and FF-merge to main. The one missing piece: every interaction required a keyboard and a terminal window.

The voice layer closes that gap. The goal was not "voice assistant as novelty" — it was voice as a first-class dispatch channel that routes into the same production pipeline as a typed /start command. No special-case paths, no toy demo environment.

Architecture

Microphone / Browser

↓

LiveKit room (ws://localhost:7880)

↓

LiveKit Agents Worker scripts/voice/livekit_agent.py

Voice pipeline

STT: Deepgram Nova-2

↓

LLM: Claude Haiku (function-calling)

↓

TTS: macOS say

Dispatch pipeline

scripts/tickets.py create

↓

tickets.db (FSM)

↓

Ticket executor → merge

Design decision: self-hosted, zero cloud audio

All audio processing happens on the Mac running the OS. LiveKit is self-hosted in Docker. No audio bytes leave the machine. This satisfies the operator-lens-in-house doctrine: operational audio of a running AI system is treated the same as trading signals — it does not leave the silo.

The practical cost: near-zero. LiveKit's Docker image is Apache-2.0. Deepgram free tier covers 12,000 minutes per month (far beyond personal use). TTS falls back to the built-in macOS say command at zero marginal cost.

Design decision: LLM function-calling, not intent parsing

The agent does not use a custom NLP parser or keyword matching to route voice commands. It uses Claude Haiku's native function-calling: the system prompt declares three tools — create_ticket, dispatch_ticket, and query_queue — and the model routes each utterance to the right tool.

This means the routing inherits the model's language understanding out of the box. "File a ticket," "create a new task," and "add this to the queue" all route to create_ticket without explicit pattern matching. The model handles linguistic variation; the function call handles execution.

Design decision: existing pipeline, not voice-specific execution

The voice agent creates tickets in the same tickets.db as every other intake path. There is no "voice branch" of the pipeline. Once a ticket exists in status=ready, the rest of the system is identical regardless of whether it was created by voice, by a typed /ticket command, or by an automated backlog miner.

This is the correct pattern for adding a new intake channel to an existing system: the new channel terminates into the existing abstraction (the ticket FSM) rather than creating a parallel execution path. Parallel paths accumulate technical debt and create divergent behavior under edge cases.

What the demo shows

The 60-second demo follows this sequence:

Speak intent (10 sec): "Create a ticket for Sandbox to write the concept density doctrine validation pass."
Watch transcription + tool call (5 sec): LiveKit playground shows real-time STT output. The agent terminal shows the LLM routing to create_ticket and the subprocess call to scripts/tickets.py create.
Ticket appears in queue (3 sec): The ticket watcher panel refreshes. New row in status=ready.
Speak "dispatch it" (5 sec): Agent calls dispatch_ticket. Status changes to in_progress.
Zoom out (10 sec): What just happened enters a production pipeline — FSM-gated, worktree-isolated, verifier-reviewed, pre-merge-gate enforced. All from a 10-second voice command.

Latency breakdown

STT (Deepgram)

~300–400ms for a 10-word utterance. Deepgram Nova-2 streams word-by-word — the agent doesn't wait for end-of-utterance to start processing.

LLM routing (Haiku)

~700–900ms for the function-call decision. Haiku was chosen specifically for speed — it doesn't need to reason about the task, only route it. This is ~70% of total latency.

Ticket creation

~50–80ms. Python subprocess call to tickets.py create with an SQLite write. Deterministic and fast.

TTS confirmation

~200ms for macOS say. Longer for Cartesia cloud TTS but with significantly better voice quality.

Total end-to-end: ~1.5–2s from end-of-utterance to ticket confirmed. The hard latency budget is the LLM call — swap in a faster model or local inference to push below 1s.

What comes next

This spike established the voice-to-dispatch channel. The production roadmap has three follow-on phases:

Phase 2: Always-on routing

Replace the "click to speak" trigger with a wake-word detection loop so the agent is always listening during work sessions. The architecture is ready — this is a change to the LiveKit room configuration and a wake-word model integration.

Phase 3: Telegram voice notes

Telegram's voice notes already arrive as OGG files in the inbox. The intake pipeline can route those directly to the same transcription layer, extending voice dispatch to mobile without a separate app. The phone becomes the OS microphone.

Phase 4: Bidirectional briefing

The agent currently confirms actions. The next step is proactive briefing: the OS speaks the morning report — "You have 4 tickets ready, 2 blocked, 1 pending Sean approval" — as an audio brief. The same agent infrastructure supports both intake and output.