Diagnosing Agent Coordination Failures
Your multi-agent system is failing. But how do you describe what's wrong?
"The agents are... not working together" doesn't help your debugging session. You need a diagnostic vocabulary—precise terms that map symptoms to root causes to solutions.
The Problem with Ad-Hoc Diagnosis
Most teams describe agent failures vaguely:
- "It's stuck"
- "The responses conflict"
- "It's too slow"
- "Something's wrong with the handoffs"
These descriptions don't point to solutions. They're symptoms, not diagnoses.
Musical Diagnostics
We developed a vocabulary borrowed from music—where coordinating 100+ performers in real-time has been solved for centuries.
Here are the core diagnostic terms:
Timing Failures
| Symptom | Diagnosis | What's Happening |
|---|---|---|
| Deadlock / Hanging | Stuck Fermata | Agent is waiting for a signal that never comes |
| Skipping steps, hallucinating | Rushing | Agent proceeds without required context |
| Can't make decisions | Dragging | Over-deliberation, no urgency signal |
| Acts too early | False Entry | Missing barrier synchronization |
Coherence Failures
| Symptom | Diagnosis | What's Happening |
|---|---|---|
| Contradictory outputs | Harmonic Clash | No shared constraint on who "wins" |
| Repeating same error | Deaf Agent | Not receiving feedback from verifier |
| Context mismatch | Dissonance | Different agents have different reference frames |
| Wanders off-task | Improvisation Drift | Missing explicit constraints/boundaries |
From Diagnosis to Fix
Each diagnosis maps to a pattern—a named solution you can implement.
Stuck Fermata → Escalation Pattern
Problem: Agent A sends a fermata (hold) to Agent B, but B never responds.
Fix: Add timeout with automatic escalation.
from hct_mcp_signals import fermata
hold = fermata(
source="legal_agent",
reason="Compliance review required",
hold_type="human",
timeout_ms=300000, # 5 minutes
on_timeout="escalate" # Auto-route to manager
)
Harmonic Clash → Shared Progressions Pattern
Problem: Finance says "approve" while Legal says "reject".
Fix: Define a voice hierarchy in your reference frame.
# reference_frame.yaml
shared_progressions:
compliance: "Legal has final voice"
budget: "Finance has final voice"
product: "Product has final voice"
Rushing → Tempo Control Pattern
Problem: Agent hallucinates because it feels "rushed".
Fix: Use explicit urgency signals and Chain-of-Verification.
# Instead of just sending content, send with tempo
signal = cue(
source="orchestrator",
targets=["analyst"],
urgency=5, # Not urgent - take your time
tempo="andante" # Walking pace
)
The Full Library
We've documented 15 diagnostic patterns covering:
- 5 Timing failures
- 5 Coherence failures
- 5 Resource failures
Each includes:
- How to spot it in your logs
- The root cause
- Code to fix it
Explore the Patterns Library →
Naming Things Matters
The act of naming a problem is half the battle. When you can say "we have a Stuck Fermata problem," your team immediately knows:
- What symptom to look for
- What the root cause is
- Which pattern to apply
That's the power of a shared diagnostic vocabulary.
Next time your agents misbehave, don't say "it's broken." Say "it's a Harmonic Clash" and reach for the pattern.