Files

koko210Serve 2934efba22 Implemented experimental real production ready voice chat, relegated old flow to voice debug mode. New Web UI panel for Voice Chat.

2026-01-20 23:06:17 +02:00

6.1 KiB

Raw Blame History

Silence Detection Implementation

What Was Added

Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.

Problem

The new ONNX server requires manually sending a {"type": "final"} command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.

Solution

Added silence tracking in voice_receiver.py:

Track audio timestamps: Record when the last audio chunk was sent
Detect silence: Start a timer after each audio chunk
Send final command: If no new audio arrives within 1.5 seconds, send {"type": "final"}
Cancel on new audio: Reset the timer if more audio arrives

Implementation Details

New Attributes

self.last_audio_time: Dict[int, float] = {}      # Track last audio per user
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
self.silence_timeout = 1.5  # Seconds of silence before "final"

New Method

async def _detect_silence(self, user_id: int):
    """
    Wait for silence timeout and send 'final' command to STT.
    Called after each audio chunk.
    """
    await asyncio.sleep(self.silence_timeout)
    stt_client = self.stt_clients.get(user_id)
    if stt_client and stt_client.is_connected():
        await stt_client.send_final()

Integration

Called after sending each audio chunk
Cancels previous silence task if new audio arrives
Automatically cleaned up when stopping listening

Testing

Test 1: Basic Transcription

Join voice channel
Run !miku listen
Speak a sentence and wait 1.5 seconds
Expected: Final transcript appears and is sent to LlamaCPP

Test 2: Continuous Speech

Start listening
Speak multiple sentences with pauses < 1.5s between them
Expected: Partial transcripts update, final sent after last sentence

Test 3: Multiple Users

Have 2+ users in voice channel
Each runs !miku listen
Both speak (taking turns or simultaneously)
Expected: Each user's speech is transcribed independently

Configuration

Silence Timeout

Default: 1.5 seconds

To adjust, edit voice_receiver.py:

self.silence_timeout = 1.5  # Change this value

Recommendations:

Too short (< 1.0s): May cut off during natural pauses in speech
Too long (> 3.0s): User waits too long for response
Sweet spot: 1.5-2.0s works well for conversational speech

Monitoring

Check Logs for Silence Detection

docker logs miku-bot 2>&1 | grep "Silence detected"

Expected output:

[DEBUG] Silence detected for user 209381657369772032, requesting final transcript

Check Final Transcripts

docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"

Check STT Processing

docker logs miku-stt 2>&1 | grep "Final transcription"

Debugging

Issue: No Final Transcript

Symptoms: Partial transcripts appear but never finalize

Debug steps:

Check if silence detection is triggering:

docker logs miku-bot 2>&1 | grep "Silence detected"

Check if final command is being sent:

docker logs miku-stt 2>&1 | grep "type.*final"

Increase log level in stt_client.py:
```
logger.setLevel(logging.DEBUG)
```

Issue: Cuts Off Mid-Sentence

Symptoms: Final transcript triggers during natural pauses

Solution: Increase silence timeout:

self.silence_timeout = 2.0  # or 2.5

Issue: Too Slow to Respond

Symptoms: Long wait after user stops speaking

Solution: Decrease silence timeout:

self.silence_timeout = 1.0  # or 1.2

Architecture

Discord Voice → voice_receiver.py
                     ↓
            [Audio Chunk Received]
                     ↓
         ┌─────────────────────┐
         │  send_audio()       │
         │  to STT server      │
         └─────────────────────┘
                     ↓
         ┌─────────────────────┐
         │  Start silence      │
         │  detection timer    │
         │  (1.5s countdown)   │
         └─────────────────────┘
                     ↓
              ┌──────┴──────┐
              │             │
        More audio    No more audio
        arrives       for 1.5s
              │             │
              ↓             ↓
         Cancel timer  ┌──────────────┐
         Start new     │ send_final() │
                       │ to STT       │
                       └──────────────┘
                             ↓
                    ┌─────────────────┐
                    │ Final transcript│
                    │ → LlamaCPP     │
                    └─────────────────┘

Files Modified

bot/utils/voice_receiver.py
- Added last_audio_time tracking
- Added silence_tasks management
- Added _detect_silence() method
- Integrated silence detection in _send_audio_chunk()
- Added cleanup in stop_listening()
bot/utils/stt_client.py (previously)
- Added send_final() method
- Added send_reset() method
- Updated protocol handler

Next Steps

Test thoroughly with different speech patterns
Tune silence timeout based on user feedback
Consider VAD integration for more accurate speech end detection
Add metrics to track transcription latency

Status: ✅ READY FOR TESTING

The system now:

✅ Connects to ONNX STT server (port 8766)
✅ Uses CUDA GPU acceleration (cuDNN 9)
✅ Receives partial transcripts
✅ Automatically detects silence
✅ Sends final command after 1.5s silence
✅ Forwards final transcript to LlamaCPP

Test it now with !miku listen!

6.1 KiB Raw Blame History