33.【Voice AI Design】Interruptions Break Voice AI

What Really Happens When a Human Speaks During Speaking

tags: [“Voice Generation AI”, “FSM”, “Interruption”, “Design Philosophy”, “Control Structure”]

✋ Interruptions Break Voice AI

— 🔥 What Happens When a Human Speaks While the AI Is Speaking

This is the moment where voice AI fails most dramatically.

🔊 The AI is speaking
👤 A human starts speaking

Most voice AI demos
deliberately avoid this moment.

The reason is simple:

They were never designed for it.

🎯 Purpose of This Article

To show that interruptions are not exceptions
To treat speaking interruption as an FSM extension
To define the minimum conditions for a voice AI that does not break

👉 We still don’t need conversational polish.
First, it must not break.

❌ Common Failure Patterns

① 🔊 Keeps Talking

Ignores human speech
Microphone input and speaker output collide

② 🧊 Silent Freeze

TTS stops
System state becomes unknown

③ 🤯 Dual State

Is it Speaking or Listening?
The FSM collapses

👉 All of these are caused by undefined state transitions.

🧠 Interruption Is Not an “Error”

A critical perspective:

✋ Interruption is normal behavior

In human conversation:

Talking over each other
Backchannel responses
Self-corrections

happen constantly.

Designing voice AI under the assumption
that it will never be interrupted
is guaranteed failure.

🧩 Adding an Interruption State to the FSM

We extend the FSM from article (32)
by adding a dedicated interruption state.

📦 Added States

✋ Interrupted
External speech detected during Speaking
👂 Listening
Accepting input after interruption

🔄 FSM with Interruption (Full View)

💤 Idle
  │
  ▼
🔊 Speaking
  │ (Human speech detected)
  ▼
✋ Interrupted
  │
  ├──▶ 👂 Listening   (Listen to the human)
  └──▶ 💤 Idle        (Abort and stop)

🔑 Key points:

Speaking → Interrupted is immediate
Thinking is NOT involved
The FSM forcibly takes control

⏱ Do NOT Be “Smart” About Interruption Detection

A common mistake here:

❌ Using an LLM to judge interruption
❌ Understanding content before stopping

The correct approach:

🎚 Decide immediately using physical signals

Examples:

Volume threshold
Microphone input level
VAD (Voice Activity Detection)

👉 Meaning comes later.

🔊 What Must Happen During Speaking Interruption

The moment interruption is detected:

🔇 Immediately stop TTS
🧩 Transition FSM to Interrupted
🎤 Separate output from input
👂 Move to Listening

Changing this order causes failure.

🚫 What NOT to Do During Interruption Handling

❌ Wait until the AI finishes speaking
❌ Try to stop “naturally”
❌ Preserve conversational flow

👉 That is performance, not control.

🧠 The Biggest Benefit of an Interruption FSM

No overlapping speech
No silent freezes
The current state is always known
Debugging becomes possible

🎯 This alone places the system
within the minimum acceptable range for real-world voice AI.

✋ Interruption Demo

A demo is provided to observe
interruption behavior under FSM control.

▶ Demo
https://samizo-aitl.github.io/qiita-articles/demos/audio-fsm-interrupt/

This demo demonstrates:

Immediate audio stop
Speaking → Interrupted transition
FSM-driven control

📌 Summary

✋ Interruption is normal behavior
🔊 Speaking must stop immediately
🧩 The FSM holds authority
🎚 Detection is based on physical signals
🤖 LLMs must NOT decide interruptions

🔜 Next Article (34)

🤖 Safely reconnecting the LLM
Why it belongs only in the Thinking state
A minimal and safe Voice × LLM architecture

Next time,
the LLM finally returns — safely.

(The GitHub-hosted Markdown is the canonical source; the Qiita version is a curated extract.)