6.1 KiB
Silence Detection Implementation
What Was Added
Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.
Problem
The new ONNX server requires manually sending a {"type": "final"} command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.
Solution
Added silence tracking in voice_receiver.py:
- Track audio timestamps: Record when the last audio chunk was sent
- Detect silence: Start a timer after each audio chunk
- Send final command: If no new audio arrives within 1.5 seconds, send
{"type": "final"} - Cancel on new audio: Reset the timer if more audio arrives
Implementation Details
New Attributes
self.last_audio_time: Dict[int, float] = {} # Track last audio per user
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
self.silence_timeout = 1.5 # Seconds of silence before "final"
New Method
async def _detect_silence(self, user_id: int):
"""
Wait for silence timeout and send 'final' command to STT.
Called after each audio chunk.
"""
await asyncio.sleep(self.silence_timeout)
stt_client = self.stt_clients.get(user_id)
if stt_client and stt_client.is_connected():
await stt_client.send_final()
Integration
- Called after sending each audio chunk
- Cancels previous silence task if new audio arrives
- Automatically cleaned up when stopping listening
Testing
Test 1: Basic Transcription
- Join voice channel
- Run
!miku listen - Speak a sentence and wait 1.5 seconds
- Expected: Final transcript appears and is sent to LlamaCPP
Test 2: Continuous Speech
- Start listening
- Speak multiple sentences with pauses < 1.5s between them
- Expected: Partial transcripts update, final sent after last sentence
Test 3: Multiple Users
- Have 2+ users in voice channel
- Each runs
!miku listen - Both speak (taking turns or simultaneously)
- Expected: Each user's speech is transcribed independently
Configuration
Silence Timeout
Default: 1.5 seconds
To adjust, edit voice_receiver.py:
self.silence_timeout = 1.5 # Change this value
Recommendations:
- Too short (< 1.0s): May cut off during natural pauses in speech
- Too long (> 3.0s): User waits too long for response
- Sweet spot: 1.5-2.0s works well for conversational speech
Monitoring
Check Logs for Silence Detection
docker logs miku-bot 2>&1 | grep "Silence detected"
Expected output:
[DEBUG] Silence detected for user 209381657369772032, requesting final transcript
Check Final Transcripts
docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"
Check STT Processing
docker logs miku-stt 2>&1 | grep "Final transcription"
Debugging
Issue: No Final Transcript
Symptoms: Partial transcripts appear but never finalize
Debug steps:
-
Check if silence detection is triggering:
docker logs miku-bot 2>&1 | grep "Silence detected" -
Check if final command is being sent:
docker logs miku-stt 2>&1 | grep "type.*final" -
Increase log level in stt_client.py:
logger.setLevel(logging.DEBUG)
Issue: Cuts Off Mid-Sentence
Symptoms: Final transcript triggers during natural pauses
Solution: Increase silence timeout:
self.silence_timeout = 2.0 # or 2.5
Issue: Too Slow to Respond
Symptoms: Long wait after user stops speaking
Solution: Decrease silence timeout:
self.silence_timeout = 1.0 # or 1.2
Architecture
Discord Voice → voice_receiver.py
↓
[Audio Chunk Received]
↓
┌─────────────────────┐
│ send_audio() │
│ to STT server │
└─────────────────────┘
↓
┌─────────────────────┐
│ Start silence │
│ detection timer │
│ (1.5s countdown) │
└─────────────────────┘
↓
┌──────┴──────┐
│ │
More audio No more audio
arrives for 1.5s
│ │
↓ ↓
Cancel timer ┌──────────────┐
Start new │ send_final() │
│ to STT │
└──────────────┘
↓
┌─────────────────┐
│ Final transcript│
│ → LlamaCPP │
└─────────────────┘
Files Modified
-
bot/utils/voice_receiver.py
- Added
last_audio_timetracking - Added
silence_tasksmanagement - Added
_detect_silence()method - Integrated silence detection in
_send_audio_chunk() - Added cleanup in
stop_listening()
- Added
-
bot/utils/stt_client.py (previously)
- Added
send_final()method - Added
send_reset()method - Updated protocol handler
- Added
Next Steps
- Test thoroughly with different speech patterns
- Tune silence timeout based on user feedback
- Consider VAD integration for more accurate speech end detection
- Add metrics to track transcription latency
Status: ✅ READY FOR TESTING
The system now:
- ✅ Connects to ONNX STT server (port 8766)
- ✅ Uses CUDA GPU acceleration (cuDNN 9)
- ✅ Receives partial transcripts
- ✅ Automatically detects silence
- ✅ Sends final command after 1.5s silence
- ✅ Forwards final transcript to LlamaCPP
Test it now with !miku listen!