Files
miku-discord/SILENCE_DETECTION.md

6.1 KiB

Silence Detection Implementation

What Was Added

Implemented automatic silence detection to trigger final transcriptions in the new ONNX-based STT system.

Problem

The new ONNX server requires manually sending a {"type": "final"} command to get the complete transcription. Without this, partial transcripts would appear but never be finalized and sent to LlamaCPP.

Solution

Added silence tracking in voice_receiver.py:

  1. Track audio timestamps: Record when the last audio chunk was sent
  2. Detect silence: Start a timer after each audio chunk
  3. Send final command: If no new audio arrives within 1.5 seconds, send {"type": "final"}
  4. Cancel on new audio: Reset the timer if more audio arrives

Implementation Details

New Attributes

self.last_audio_time: Dict[int, float] = {}      # Track last audio per user
self.silence_tasks: Dict[int, asyncio.Task] = {} # Silence detection tasks
self.silence_timeout = 1.5  # Seconds of silence before "final"

New Method

async def _detect_silence(self, user_id: int):
    """
    Wait for silence timeout and send 'final' command to STT.
    Called after each audio chunk.
    """
    await asyncio.sleep(self.silence_timeout)
    stt_client = self.stt_clients.get(user_id)
    if stt_client and stt_client.is_connected():
        await stt_client.send_final()

Integration

  • Called after sending each audio chunk
  • Cancels previous silence task if new audio arrives
  • Automatically cleaned up when stopping listening

Testing

Test 1: Basic Transcription

  1. Join voice channel
  2. Run !miku listen
  3. Speak a sentence and wait 1.5 seconds
  4. Expected: Final transcript appears and is sent to LlamaCPP

Test 2: Continuous Speech

  1. Start listening
  2. Speak multiple sentences with pauses < 1.5s between them
  3. Expected: Partial transcripts update, final sent after last sentence

Test 3: Multiple Users

  1. Have 2+ users in voice channel
  2. Each runs !miku listen
  3. Both speak (taking turns or simultaneously)
  4. Expected: Each user's speech is transcribed independently

Configuration

Silence Timeout

Default: 1.5 seconds

To adjust, edit voice_receiver.py:

self.silence_timeout = 1.5  # Change this value

Recommendations:

  • Too short (< 1.0s): May cut off during natural pauses in speech
  • Too long (> 3.0s): User waits too long for response
  • Sweet spot: 1.5-2.0s works well for conversational speech

Monitoring

Check Logs for Silence Detection

docker logs miku-bot 2>&1 | grep "Silence detected"

Expected output:

[DEBUG] Silence detected for user 209381657369772032, requesting final transcript

Check Final Transcripts

docker logs miku-bot 2>&1 | grep "FINAL TRANSCRIPT"

Check STT Processing

docker logs miku-stt 2>&1 | grep "Final transcription"

Debugging

Issue: No Final Transcript

Symptoms: Partial transcripts appear but never finalize

Debug steps:

  1. Check if silence detection is triggering:

    docker logs miku-bot 2>&1 | grep "Silence detected"
    
  2. Check if final command is being sent:

    docker logs miku-stt 2>&1 | grep "type.*final"
    
  3. Increase log level in stt_client.py:

    logger.setLevel(logging.DEBUG)
    

Issue: Cuts Off Mid-Sentence

Symptoms: Final transcript triggers during natural pauses

Solution: Increase silence timeout:

self.silence_timeout = 2.0  # or 2.5

Issue: Too Slow to Respond

Symptoms: Long wait after user stops speaking

Solution: Decrease silence timeout:

self.silence_timeout = 1.0  # or 1.2

Architecture

Discord Voice → voice_receiver.py
                     ↓
            [Audio Chunk Received]
                     ↓
         ┌─────────────────────┐
         │  send_audio()       │
         │  to STT server      │
         └─────────────────────┘
                     ↓
         ┌─────────────────────┐
         │  Start silence      │
         │  detection timer    │
         │  (1.5s countdown)   │
         └─────────────────────┘
                     ↓
              ┌──────┴──────┐
              │             │
        More audio    No more audio
        arrives       for 1.5s
              │             │
              ↓             ↓
         Cancel timer  ┌──────────────┐
         Start new     │ send_final() │
                       │ to STT       │
                       └──────────────┘
                             ↓
                    ┌─────────────────┐
                    │ Final transcript│
                    │ → LlamaCPP     │
                    └─────────────────┘

Files Modified

  1. bot/utils/voice_receiver.py

    • Added last_audio_time tracking
    • Added silence_tasks management
    • Added _detect_silence() method
    • Integrated silence detection in _send_audio_chunk()
    • Added cleanup in stop_listening()
  2. bot/utils/stt_client.py (previously)

    • Added send_final() method
    • Added send_reset() method
    • Updated protocol handler

Next Steps

  1. Test thoroughly with different speech patterns
  2. Tune silence timeout based on user feedback
  3. Consider VAD integration for more accurate speech end detection
  4. Add metrics to track transcription latency

Status: READY FOR TESTING

The system now:

  • Connects to ONNX STT server (port 8766)
  • Uses CUDA GPU acceleration (cuDNN 9)
  • Receives partial transcripts
  • Automatically detects silence
  • Sends final command after 1.5s silence
  • Forwards final transcript to LlamaCPP

Test it now with !miku listen!