34.【Voice AI Design】Safely Connecting an LLM
Why It Belongs Only in the Thinking State
tags: [“Voice Generation AI”, “LLM”, “FSM”, “Control Structure”, “Generative AI Design”]
🤖 Safely Connecting an LLM
— 🧩 Why It Belongs Only in the Thinking State
Through the previous articles, the voice AI has already been transformed into a stable system:
- 🧩 Fully managed by an FSM
- 🔊 Speaking is under strict control
- ✋ Interruptions no longer cause collapse
In other words, the skeleton is complete.
Only now is it safe to
bring the LLM back into the system.
🎯 Purpose of This Article
- Reconfirm where placing an LLM causes failure
- Determine where placing an LLM does not break the system
- Present a minimal and safe Voice × LLM coupling structure
👉 We are still not building a “chat AI.”
First, we decide on a placement that never causes accidents.
❌ Places Where an LLM Must NOT Be Used (Reconfirmed)
🔊 Speaking
- Strict real-time constraints
- Immediate stopping required
- Streaming control involved
👉 LLMs are too slow
👂 Listening
- Interruption decisions must be immediate
- Physical signals are sufficient
👉 Semantic understanding is unnecessary
✋ Interrupted
- Immediate transitions are critical
- Any delay leads to failure
👉 There is no room for an LLM
⭕ The Only Place Where an LLM Is Allowed
🧠 Thinking
The Thinking state:
- Has relaxed time constraints
- Produces no sound
- Is not interruption-sensitive
- Returns results as text
👉 A perfect match for LLM characteristics.
🧩 The Correct FSM × LLM Structure
🧩 FSM
├─ 👂 Listening
├─ 🧠 Thinking ──▶ 🤖 LLM (utterance content generation)
├─ 🔊 Speaking
└─ ✋ Interrupted
Key principles:
- FSM → LLM is one-way
- The LLM never changes state
- The LLM never issues commands
👉 The LLM is a component, not a controller.
🔌 Information You May — and Must Not — Pass to the LLM
⭕ Safe to Pass
- The latest input text
- The purpose of the utterance (e.g., reply, confirmation, rejection)
- Maximum character length
- Tone constraints (e.g., neutral, formal)
❌ Never Pass
- State transition rules
- Interruption logic
- Timing control
- Continuation decisions
👉 The moment control information is passed, the system breaks.
🧪 Pseudocode (FSM × LLM)
if state == "Thinking":
text = llm_generate(prompt)
state = "Speaking"
play_tts(text)
Key points:
- The LLM is called only once per Thinking state
- Its output is just a string
- The FSM alone decides the next state
🚫 Common and Dangerous Anti-Patterns
❌ Asking the LLM “What should we do next?”
→ State management collapses
❌ Asking the LLM “Should I keep speaking?”
→ Infinite speech
❌ Letting the LLM decide interruptions
→ Immediate failure
🧠 Why This Structure Is Stable
- The FSM controls time
- The FSM controls input and output
- The FSM absorbs exceptions
- The LLM generates meaning only
👉 Responsibility separation is complete.
📌 Summary
- 🤖 Place the LLM only in the Thinking state
- 🧩 The FSM always retains control
- 🔊 Never put an LLM in Speaking
- ✋ Never use an LLM for interruption decisions
- 🏗 The LLM is “external intelligence,” not the core
🏁 Series Conclusion
In this series, we fixed the structure of voice AI by establishing that:
- 🎧 Voice generation AI is a control problem, not a generation problem
- 🧩 Systems centered on FSMs do not collapse
- ✋ Interruptions are normal behavior, not exceptions
- 🤖 LLMs must be confined strictly to the Thinking state
Even so, many people will still say:
🎭 “I want it to feel more conversational”
🤝 “I want it to feel like talking to a human”
But that is not a design objective — it is a desire for performance.
The minimum requirement for a voice AI to appear intelligent
is not imitation of conversation.
It is simply this:
- States are explicit
- Transitions are predictable
- The system can always recover from failure
That is control integrity.
This series ends by
breaking the illusion of conversational AI.
Anything beyond this
can be explored later — if and when it is truly needed.
(The GitHub-hosted Markdown is the canonical source; the Qiita version is a curated extract.)