31.【Voice AI Design】Decomposing Voice AI with an FSM

Speaking, Silence, and Interruptions Were All States

tags: [“Voice Generation AI”, “FSM”, “Design Philosophy”, “Control Structure”, “Generative AI”]

🧩 Decomposing Voice AI with an FSM

— 🔄 Speaking, Silence, and Interruptions Were All States

In the previous article (30), we reached the following conclusion:

🎧 Voice generation AI is not a generation problem — it is a control problem

In this article, we go one level deeper and
fully decompose voice AI as an FSM (Finite State Machine).

🎯 Purpose of This Article

To understand voice AI behavior not as “feelings,” but as states
To make failure points explicit as missing state transitions
To establish a design foundation for the next article (where sound is actually produced)

👉 No sound yet.
Before making noise, we build a structure that does not break.

❌ Common Failure Pattern (No FSM)

Many voice AIs implicitly assume states like:

It’s talking
It’s probably listening
It’s sort of waiting

This is equivalent to not defining states at all.

As a result:

Infinite speaking
Silent freezing
Immediate failure on interruption

occur inevitably.

🧠 Voice AI Is a Collection of States

All voice AI behavior can be reduced to these questions:

What is it doing right now?
When is it allowed to move on?
What event should interrupt it?

👉 Answering these questions is the role of an FSM.

📦 Minimal FSM Structure (Start Here)

A minimal FSM for voice AI requires only the following.

🧩 States

💤 Idle
Waiting state (doing nothing)
👂 Listening
Receiving audio input
🧠 Thinking
Preparing utterance content (LLM, etc.)
🔊 Speaking
Outputting audio
✋ Interrupted
An interruption has occurred
🚨 Error / Fallback
Exception and failure handling

🔄 Overall State Transition Flow

💤 Idle
  │ (Audio input starts)
  ▼
👂 Listening
  │ (Input ends)
  ▼
🧠 Thinking
  │ (Generation complete)
  ▼
🔊 Speaking
  │ (Utterance finished)
  ▼
💤 Idle

🔊 Speaking
  │ (Human starts speaking)
  ▼
✋ Interrupted
  │ (Decision made)
  ├──▶ 👂 Listening
  └──▶ 💤 Idle

All states
  │ (Exception)
  ▼
🚨 Error / Fallback

👉 This covers almost all voice AI behavior.

⚠️ Important: The FSM Lives “Outside” Audio

A common misconception:

❌ The audio engine holds the state
❌ The LLM manages the conversation state

The correct model is:

🧩 The FSM stands outside both audio and the LLM

The FSM forcefully controls:

When audio input is enabled
When audio output must stop
When the LLM is allowed to run

🤖 States Where LLMs Are Allowed — and Not Allowed

⭕ Where LLMs Are Allowed

🧠 Thinking
→ Generating utterance content only

❌ Where LLMs Must Not Be Used

🔊 Speaking (real-time control)
👂 Listening (interruption judgment)
✋ Interrupted (immediate decisions)

👉
Placing an LLM in time-critical states guarantees failure.

🧊 Silence Is Not a Bug

Silent freezes are common, but from an FSM perspective,
they are simply undefined states.

Did Speaking finish?
Is Thinking still running?
Should it return to Idle?

👉 The transition conditions were never defined.

🧩 The Biggest Advantage of FSM-ization

Behavior becomes predictable
There is always a safe return path
Debugging becomes possible
Design can be validated before any sound is produced

This has nothing to do with model performance.

🧩 FSM Visualization Demo

A simple demo is provided to visualize
the voice AI FSM and its state transitions.

▶ Demo
https://samizo-aitl.github.io/qiita-articles/demos/audio-fsm-visual/

This demo produces no sound and focuses entirely on
understanding states and transitions.

📌 Summary

🎧 Voice AI is fundamentally a state-transition machine
🧩 Any voice AI without an explicit FSM will break
🤖 LLMs are responsible only for the Thinking state
✋ Interruptions and silence are not “exceptions”
🏗 The FSM must be completed before producing sound

🔜 Next Article (32)

🔊 Connecting minimal audio output to the FSM
🎚 A minimal implementation that abandons “naturalness”
🧪 A voice AI that prioritizes not breaking

Next time, sound will actually be produced —
but only sound that obeys the FSM.

(The GitHub-hosted Markdown is the canonical source; the Qiita version is a curated extract.)