Files
miku-discord/cheshire-cat/POST_OPTIMIZATION_ANALYSIS.md
koko210Serve ae1e0aa144 add: cheshire-cat configuration, tooling, tests, and documentation
Configuration:
- .env.example, .gitignore, compose.yml (main docker compose)
- docker-compose-amd.yml (ROCm), docker-compose-macos.yml
- start.sh, stop.sh convenience scripts
- LICENSE (Apache 2.0, from upstream Cheshire Cat)

Memory management utilities:
- analyze_consolidation.py, manual_consolidation.py, verify_consolidation.py
- check_memories.py, extract_declarative_facts.py, store_declarative_facts.py
- compare_systems.py (system comparison tool)
- benchmark_cat.py, streaming_benchmark.py, streaming_benchmark_v2.py

Test suite:
- quick_test.py, test_setup.py, test_setup_simple.py
- test_consolidation_direct.py, test_declarative_recall.py, test_recall.py
- test_end_to_end.py, test_full_pipeline.py
- test_phase2.py, test_phase2_comprehensive.py

Documentation:
- README.md, QUICK_START.txt, TEST_README.md, SETUP_COMPLETE.md
- PHASE2_IMPLEMENTATION_NOTES.md, PHASE2_TEST_RESULTS.md
- POST_OPTIMIZATION_ANALYSIS.md
2026-03-04 00:51:14 +02:00

5.9 KiB

Cheshire Cat RAG Viability - Post-Optimization Results

Executive Summary

Status: NOW VIABLE FOR VOICE CHAT

After disabling KV cache offloading to CPU in llama-swap, Cheshire Cat's RAG approach is now competitive with direct context loading for real-time voice chat applications.

Performance Comparison

Time To First Token (TTFT) - Critical Metric for Voice Chat

Method Previous Current Improvement
🐱 Cheshire Cat (RAG) 1578ms 504ms +68%
📄 Direct + Full Context 904ms 451ms +50%
Direct + Minimal 210ms 145ms +31%

Total Generation Time

Method Previous Current Improvement
🐱 Cheshire Cat 10.5s 4.2s +60%
📄 Direct + Full Context 8.3s 1.2s +85%
Direct + Minimal 6.4s 0.8s +87%

Voice Chat Viability Assessment

Before Optimization

  • Cheshire Cat: 1578ms - TOO SLOW
  • Current System: 904ms - GOOD
  • Minimal: 210ms - EXCELLENT

After Optimization

  • Cheshire Cat: 504ms - GOOD
  • Current System: 451ms - EXCELLENT
  • Minimal: 145ms - EXCELLENT

Target: <1000ms for voice chat All methods now pass!

Key Findings

1. Cheshire Cat is Now Competitive

  • 504ms mean TTFT is excellent for voice chat
  • Only 53ms slower than current approach (10% difference)
  • Median TTFT: 393ms - even better than mean

2. All Systems Dramatically Improved

  • Current system: 904ms → 451ms (2x faster)
  • Cheshire Cat: 1578ms → 504ms (3x faster)
  • Total generation times cut by 60-87% across the board

3. KV Cache Optimization Impact

Disabling CPU offloading provided:

  • Faster token generation once model is warmed up
  • Consistent low latency across queries
  • Dramatic improvement in total response times

Trade-offs Analysis

Cheshire Cat (RAG) Advantages

Scalability: Can handle much larger knowledge bases (100s of MB) Dynamic Updates: Add new context without reloading bot Memory Efficiency: Only loads relevant context (not entire 10KB every time) Semantic Search: Better at finding relevant info from large datasets Now Fast Enough: 504ms TTFT is excellent for voice chat

Cheshire Cat Disadvantages

⚠️ Slightly slower (53ms) than direct loading ⚠️ More complex infrastructure (Qdrant, embeddings) ⚠️ Requires Docker container management ⚠️ Learning curve for plugin development

Current System (Direct Loading) Advantages

Simplest approach: Load context, query LLM Slightly faster: 451ms vs 504ms (10% faster) No external dependencies: Just llama-swap Proven and stable: Already working in production

Current System Disadvantages

⚠️ Not scalable: 10KB context works, but 100KB would cause issues ⚠️ Static context: Must restart bot to update knowledge ⚠️ Loads everything: Can't selectively retrieve relevant info ⚠️ Token waste: Sends full context even when only small part is relevant

Recommendations

For Current 10KB Knowledge Base

Recommendation: Keep current system

Reasons:

  • Marginally faster (451ms vs 504ms)
  • Already working and stable
  • Simple architecture
  • Knowledge base is small enough for direct loading

For Future Growth (>50KB Knowledge Base)

Recommendation: Migrate to Cheshire Cat

Reasons:

  • RAG scales better with knowledge base size
  • 504ms TTFT is excellent and won't degrade much with more data
  • Can add new knowledge dynamically
  • Better semantic retrieval from large datasets

Hybrid Approach (Advanced)

Consider using both:

  • Direct loading for core personality (small, always needed)
  • Cheshire Cat for extended knowledge (songs, friends, lore details)
  • Combine responses for best of both worlds

Migration Path (If Chosen)

Phase 1: Parallel Testing (1-2 weeks)

  • Run both systems side-by-side
  • Compare response quality
  • Monitor latency in production
  • Gather user feedback

Phase 2: Gradual Migration (2-4 weeks)

  • Start with non-critical features
  • Migrate DM responses first
  • Keep server responses on current system initially
  • Monitor error rates

Phase 3: Full Migration (1 week)

  • Switch all responses to Cheshire Cat
  • Decommission old context loading
  • Monitor performance

Phase 4: Optimization (Ongoing)

  • Tune RAG retrieval settings
  • Optimize embedding model
  • Add new knowledge dynamically
  • Explore GPU embeddings if needed

Technical Notes

Current Cheshire Cat Configuration

  • LLM: darkidol (llama-swap-amd)
  • Embedder: FastEmbed CPU (BAAI/bge-large-en-v1.5-quantized)
  • Vector DB: Qdrant v1.9.1
  • Knowledge: 3 files uploaded (~10KB total)
  • Plugin: Miku personality (custom)

Performance Settings

  • KV Cache: Offload to CPU DISABLED
  • Temperature: 0.8
  • Max Tokens: 150 (streaming)
  • Model: darkidol (uncensored Llama 3.1 8B)

Estimated Resource Usage

  • Cheshire Cat: ~500MB RAM, negligible CPU (GPU embeddings could reduce further)
  • Qdrant: ~100MB RAM
  • Storage: ~50MB (embeddings + indices)
  • Total Overhead: ~600MB RAM, ~50MB disk

Conclusion

The KV cache optimization has transformed Cheshire Cat from unviable (1578ms) to viable (504ms) for voice chat. Both systems now perform excellently, with Cheshire Cat offering better scalability at a marginal 53ms latency cost.

For current needs: Stick with direct loading (simpler, proven) For future growth: Cheshire Cat is now a strong option

The infrastructure is already set up and tested, so migration could happen whenever knowledge base growth demands it.


Benchmark Date: January 30, 2026 Optimization: KV cache offload to CPU disabled Test Queries: 10 varied questions Success Rate: 100% across all methods