HIGH: Add Circuit Breakers for Critical Services #28

Closed
opened 2026-02-16 22:51:16 +02:00 by Koko210 · 1 comment
Owner

Critical external services (Cheshire Cat, LLM, STT) lack circuit breakers, causing cascading failures when services are down.

Where It Occurs

  • cat-plugins/cat_client.py - Cheshire Cat calls
  • bot/utils/llm.py - LLM API calls
  • bot/stt_client.py - Speech-to-text service
  • bot/utils/voice_audio.py - Voice processing

Why This Is a Problem

  1. Cascading Failures: One failing service affects entire bot
  2. Resource Exhaustion: Rapid retries consume bandwidth/memory
  3. Slow Recovery: Requests pile up, delaying recovery
  4. No Graceful Degradation: Bot shows errors instead of fallback behavior

What Can Go Wrong

Scenario 1: Cheshire Cat Downtime

  1. Cheshire Cat service goes down
  2. Bot tries memory consolidation
  3. Request times out
  4. Bot retries immediately
  5. All requests queue up
  6. Bot becomes unresponsive
  7. Users experience complete failure

Scenario 2: LLM Service Overload

  1. LLM API gets overloaded (503/429)
  2. Multiple users send messages
  3. Each request hits API, times out
  4. Bot waits for timeouts on all requests
  5. Event loop blocked
  6. Voice chat becomes laggy/unusable
  7. Entire bot degraded

Proposed Fix

Implement circuit breaker pattern with fallback:

# bot/utils/circuit_breaker.py - NEW FILE
import asyncio
import time
from enum import Enum
from typing import Optional, Callable, Any
from functools import wraps
import logging

logger = logging.getLogger(__name__)

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 60.0,
        success_threshold: int = 2
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time: Optional[float] = None
        
        self.lock = asyncio.Lock()

    async def call(self, func: Callable, *args, **kwargs) -> Any:
        async with self.lock:
            if self.state == CircuitState.OPEN:
                if self._should_attempt_reset():
                    self.state = CircuitState.HALF_OPEN
                    logger.info(f"Circuit breaker entering HALF_OPEN state")
                else:
                    raise CircuitBreakerOpenError("Circuit breaker is OPEN")

        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise

    def _should_attempt_reset(self) -> bool:
        if self.last_failure_time is None:
            return False
        return time.time() - self.last_failure_time >= self.recovery_timeout

    async def _on_success(self):
        async with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
                    self.success_count = 0
            else:
                self.failure_count = 0

    async def _on_failure(self):
        async with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

class CircuitBreakerOpenError(Exception):
    pass

# Create circuit breakers for critical services
cheshire_cat_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60.0)
llm_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30.0)
stt_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30.0)

# Example usage with fallback
async def call_cheshire_cat_with_fallback(message: str):
    try:
        result = await cheshire_cat_breaker.call(call_cheshire_cat_api, message)
        return result
    except CircuitBreakerOpenError:
        logger.warning("Cheshire Cat circuit open, using fallback")
        return await fallback_memory_lookup(message)

Severity

HIGH - Lack of circuit breakers causes cascading failures and complete bot unresponsiveness.

Files Affected

cat-plugins/cat_client.py, bot/utils/llm.py, bot/stt_client.py, bot/utils/voice_audio.py, new file: bot/utils/circuit_breaker.py

Critical external services (Cheshire Cat, LLM, STT) lack circuit breakers, causing cascading failures when services are down. ## Where It Occurs - cat-plugins/cat_client.py - Cheshire Cat calls - bot/utils/llm.py - LLM API calls - bot/stt_client.py - Speech-to-text service - bot/utils/voice_audio.py - Voice processing ## Why This Is a Problem 1. Cascading Failures: One failing service affects entire bot 2. Resource Exhaustion: Rapid retries consume bandwidth/memory 3. Slow Recovery: Requests pile up, delaying recovery 4. No Graceful Degradation: Bot shows errors instead of fallback behavior ## What Can Go Wrong ### Scenario 1: Cheshire Cat Downtime 1. Cheshire Cat service goes down 2. Bot tries memory consolidation 3. Request times out 4. Bot retries immediately 5. All requests queue up 6. Bot becomes unresponsive 7. Users experience complete failure ### Scenario 2: LLM Service Overload 1. LLM API gets overloaded (503/429) 2. Multiple users send messages 3. Each request hits API, times out 4. Bot waits for timeouts on all requests 5. Event loop blocked 6. Voice chat becomes laggy/unusable 7. Entire bot degraded ## Proposed Fix Implement circuit breaker pattern with fallback: ```python # bot/utils/circuit_breaker.py - NEW FILE import asyncio import time from enum import Enum from typing import Optional, Callable, Any from functools import wraps import logging logger = logging.getLogger(__name__) class CircuitState(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__( self, failure_threshold: int = 5, recovery_timeout: float = 60.0, success_threshold: int = 2 ): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.success_threshold = success_threshold self.state = CircuitState.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time: Optional[float] = None self.lock = asyncio.Lock() async def call(self, func: Callable, *args, **kwargs) -> Any: async with self.lock: if self.state == CircuitState.OPEN: if self._should_attempt_reset(): self.state = CircuitState.HALF_OPEN logger.info(f"Circuit breaker entering HALF_OPEN state") else: raise CircuitBreakerOpenError("Circuit breaker is OPEN") try: result = await func(*args, **kwargs) await self._on_success() return result except Exception as e: await self._on_failure() raise def _should_attempt_reset(self) -> bool: if self.last_failure_time is None: return False return time.time() - self.last_failure_time >= self.recovery_timeout async def _on_success(self): async with self.lock: if self.state == CircuitState.HALF_OPEN: self.success_count += 1 if self.success_count >= self.success_threshold: self.state = CircuitState.CLOSED self.failure_count = 0 self.success_count = 0 else: self.failure_count = 0 async def _on_failure(self): async with self.lock: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = CircuitState.OPEN class CircuitBreakerOpenError(Exception): pass # Create circuit breakers for critical services cheshire_cat_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60.0) llm_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30.0) stt_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30.0) # Example usage with fallback async def call_cheshire_cat_with_fallback(message: str): try: result = await cheshire_cat_breaker.call(call_cheshire_cat_api, message) return result except CircuitBreakerOpenError: logger.warning("Cheshire Cat circuit open, using fallback") return await fallback_memory_lookup(message) ``` ## Severity HIGH - Lack of circuit breakers causes cascading failures and complete bot unresponsiveness. ## Files Affected cat-plugins/cat_client.py, bot/utils/llm.py, bot/stt_client.py, bot/utils/voice_audio.py, new file: bot/utils/circuit_breaker.py
Author
Owner

Closing as Already Implemented - A circuit breaker already exists for the Cheshire Cat service in bot/utils/cat_client.py lines 45-100. The CatAdapter class has full circuit breaker functionality: consecutive failure tracking (max 3 failures), 60-second cooldown period, automatic state transitions, and graceful fallback to direct LLM queries when the circuit is open. The implementation matches the pattern proposed in this issue.

Closing as Already Implemented - A circuit breaker already exists for the Cheshire Cat service in bot/utils/cat_client.py lines 45-100. The CatAdapter class has full circuit breaker functionality: consecutive failure tracking (max 3 failures), 60-second cooldown period, automatic state transitions, and graceful fallback to direct LLM queries when the circuit is open. The implementation matches the pattern proposed in this issue.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Koko210/miku-discord#28