HIGH: Add Retry Logic for External API Calls #27

Open
opened 2026-02-16 22:50:36 +02:00 by Koko210 · 0 comments
Owner

External API calls (Cheshire Cat, LLM, voice services) lack retry logic, causing failures when services are temporarily unavailable.

Where It Occurs

  • cat-plugins/cat_client.py - Cheshire Cat API calls
  • bot/utils/llm.py - LLM API calls
  • bot/bot.py - Voice service calls
  • Various HTTP/WebSocket calls

Why This Is a Problem

  1. Fragility: Temporary failures cause immediate errors
  2. Poor UX: Users see errors for transient issues
  3. Data Loss: Failed requests are not retried
  4. No Backoff: Rapid retries overwhelm services

What Can Go Wrong

Scenario 1: Temporary Network Glitch

  1. User sends message requiring LLM call
  2. Network has 2-second hiccup
  3. LLM API call fails with timeout
  4. Bot shows error to user
  5. Network is fine 1 second later, but request not retried
  6. User frustrated

Scenario 2: Service Overload

  1. Cheshire Cat service overloaded (503)
  2. Memory consolidation request fails
  3. Bot shows error message
  4. Service recovers in 30 seconds
  5. User has to manually retry
  6. Poor user experience

Proposed Fix

Implement retry logic with exponential backoff:

# bot/utils/retry.py - NEW FILE
import asyncio
import functools
from typing import Callable, Type, Tuple
import logging

logger = logging.getLogger(__name__)

class RetryConfig:
    def __init__(
        self,
        max_attempts: int = 3,
        initial_delay: float = 1.0,
        max_delay: float = 60.0,
        backoff_multiplier: float = 2.0,
        retryable_exceptions: Tuple[Type[Exception], ...] = (
            TimeoutError,
            ConnectionError,
            aiohttp.ClientError,
        )
    ):
        self.max_attempts = max_attempts
        self.initial_delay = initial_delay
        self.max_delay = max_delay
        self.backoff_multiplier = backoff_multiplier
        self.retryable_exceptions = retryable_exceptions

def with_retry(config: RetryConfig):
    """Decorator to add retry logic to async functions"""
    def decorator(func: Callable):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            delay = config.initial_delay
            for attempt in range(config.max_attempts):
                try:
                    return await func(*args, **kwargs)
                except config.retryable_exceptions as e:
                    if attempt == config.max_attempts - 1:
                        logger.error(f"Function {func.__name__} failed after {config.max_attempts} attempts: {e}")
                        raise
                    
                    logger.warning(f"Function {func.__name__} failed (attempt {attempt + 1}/{config.max_attempts}): {e}")
                    await asyncio.sleep(min(delay, config.max_delay))
                    delay *= config.backoff_multiplier
            
            raise Exception("Max retry attempts exceeded")
        return wrapper
    return decorator

# Example usage
RETRY_CONFIG = RetryConfig(
    max_attempts=3,
    initial_delay=1.0,
    max_delay=60.0,
    backoff_multiplier=2.0,
)

@with_retry(RETRY_CONFIG)
async def call_llm_api(prompt: str):
    async with aiohttp.ClientSession() as session:
        async with session.post(LLM_URL, json={"prompt": prompt}) as resp:
            return await resp.json()

Severity

HIGH - Lack of retry logic causes frequent failures and poor UX.

Files Affected

cat-plugins/cat_client.py, bot/utils/llm.py, bot/bot.py, new file: bot/utils/retry.py

External API calls (Cheshire Cat, LLM, voice services) lack retry logic, causing failures when services are temporarily unavailable. ## Where It Occurs - cat-plugins/cat_client.py - Cheshire Cat API calls - bot/utils/llm.py - LLM API calls - bot/bot.py - Voice service calls - Various HTTP/WebSocket calls ## Why This Is a Problem 1. Fragility: Temporary failures cause immediate errors 2. Poor UX: Users see errors for transient issues 3. Data Loss: Failed requests are not retried 4. No Backoff: Rapid retries overwhelm services ## What Can Go Wrong ### Scenario 1: Temporary Network Glitch 1. User sends message requiring LLM call 2. Network has 2-second hiccup 3. LLM API call fails with timeout 4. Bot shows error to user 5. Network is fine 1 second later, but request not retried 6. User frustrated ### Scenario 2: Service Overload 1. Cheshire Cat service overloaded (503) 2. Memory consolidation request fails 3. Bot shows error message 4. Service recovers in 30 seconds 5. User has to manually retry 6. Poor user experience ## Proposed Fix Implement retry logic with exponential backoff: ```python # bot/utils/retry.py - NEW FILE import asyncio import functools from typing import Callable, Type, Tuple import logging logger = logging.getLogger(__name__) class RetryConfig: def __init__( self, max_attempts: int = 3, initial_delay: float = 1.0, max_delay: float = 60.0, backoff_multiplier: float = 2.0, retryable_exceptions: Tuple[Type[Exception], ...] = ( TimeoutError, ConnectionError, aiohttp.ClientError, ) ): self.max_attempts = max_attempts self.initial_delay = initial_delay self.max_delay = max_delay self.backoff_multiplier = backoff_multiplier self.retryable_exceptions = retryable_exceptions def with_retry(config: RetryConfig): """Decorator to add retry logic to async functions""" def decorator(func: Callable): @functools.wraps(func) async def wrapper(*args, **kwargs): delay = config.initial_delay for attempt in range(config.max_attempts): try: return await func(*args, **kwargs) except config.retryable_exceptions as e: if attempt == config.max_attempts - 1: logger.error(f"Function {func.__name__} failed after {config.max_attempts} attempts: {e}") raise logger.warning(f"Function {func.__name__} failed (attempt {attempt + 1}/{config.max_attempts}): {e}") await asyncio.sleep(min(delay, config.max_delay)) delay *= config.backoff_multiplier raise Exception("Max retry attempts exceeded") return wrapper return decorator # Example usage RETRY_CONFIG = RetryConfig( max_attempts=3, initial_delay=1.0, max_delay=60.0, backoff_multiplier=2.0, ) @with_retry(RETRY_CONFIG) async def call_llm_api(prompt: str): async with aiohttp.ClientSession() as session: async with session.post(LLM_URL, json={"prompt": prompt}) as resp: return await resp.json() ``` ## Severity HIGH - Lack of retry logic causes frequent failures and poor UX. ## Files Affected cat-plugins/cat_client.py, bot/utils/llm.py, bot/bot.py, new file: bot/utils/retry.py
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Koko210/miku-discord#27