31.【Voice AI Design】Decomposing Voice AI with an FSM

Speaking, Silence, and Interruptions Were All States

tags: [“Voice Generation AI”, “FSM”, “Design Philosophy”, “Control Structure”, “Generative AI”]


🧩 Decomposing Voice AI with an FSM

— 🔄 Speaking, Silence, and Interruptions Were All States

In the previous article (30), we reached the following conclusion:

🎧 Voice generation AI is not a generation problem — it is a control problem

In this article, we go one level deeper and
fully decompose voice AI as an FSM (Finite State Machine).


🎯 Purpose of This Article

👉 No sound yet.
Before making noise, we build a structure that does not break.


❌ Common Failure Pattern (No FSM)

Many voice AIs implicitly assume states like:

This is equivalent to not defining states at all.

As a result:

occur inevitably.


🧠 Voice AI Is a Collection of States

All voice AI behavior can be reduced to these questions:

👉 Answering these questions is the role of an FSM.


📦 Minimal FSM Structure (Start Here)

A minimal FSM for voice AI requires only the following.

🧩 States


🔄 Overall State Transition Flow

💤 Idle
  │ (Audio input starts)
  ▼
👂 Listening
  │ (Input ends)
  ▼
🧠 Thinking
  │ (Generation complete)
  ▼
🔊 Speaking
  │ (Utterance finished)
  ▼
💤 Idle

🔊 Speaking
  │ (Human starts speaking)
  ▼
✋ Interrupted
  │ (Decision made)
  ├──▶ 👂 Listening
  └──▶ 💤 Idle

All states
  │ (Exception)
  ▼
🚨 Error / Fallback

👉 This covers almost all voice AI behavior.


⚠️ Important: The FSM Lives “Outside” Audio

A common misconception:

The correct model is:

🧩 The FSM stands outside both audio and the LLM

The FSM forcefully controls:


🤖 States Where LLMs Are Allowed — and Not Allowed

⭕ Where LLMs Are Allowed

❌ Where LLMs Must Not Be Used

👉
Placing an LLM in time-critical states guarantees failure.


🧊 Silence Is Not a Bug

Silent freezes are common, but from an FSM perspective,
they are simply undefined states.

👉 The transition conditions were never defined.


🧩 The Biggest Advantage of FSM-ization

This has nothing to do with model performance.


🧩 FSM Visualization Demo

A simple demo is provided to visualize
the voice AI FSM and its state transitions.

▶ Demo
https://samizo-aitl.github.io/qiita-articles/demos/audio-fsm-visual/

This demo produces no sound and focuses entirely on
understanding states and transitions.


📌 Summary


🔜 Next Article (32)

Next time, sound will actually be produced
but only sound that obeys the FSM.


(The GitHub-hosted Markdown is the canonical source; the Qiita version is a curated extract.)