Files

koko210Serve ae1e0aa144 add: cheshire-cat configuration, tooling, tests, and documentation

Configuration:
- .env.example, .gitignore, compose.yml (main docker compose)
- docker-compose-amd.yml (ROCm), docker-compose-macos.yml
- start.sh, stop.sh convenience scripts
- LICENSE (Apache 2.0, from upstream Cheshire Cat)

Memory management utilities:
- analyze_consolidation.py, manual_consolidation.py, verify_consolidation.py
- check_memories.py, extract_declarative_facts.py, store_declarative_facts.py
- compare_systems.py (system comparison tool)
- benchmark_cat.py, streaming_benchmark.py, streaming_benchmark_v2.py

Test suite:
- quick_test.py, test_setup.py, test_setup_simple.py
- test_consolidation_direct.py, test_declarative_recall.py, test_recall.py
- test_end_to_end.py, test_full_pipeline.py
- test_phase2.py, test_phase2_comprehensive.py

Documentation:
- README.md, QUICK_START.txt, TEST_README.md, SETUP_COMPLETE.md
- PHASE2_IMPLEMENTATION_NOTES.md, PHASE2_TEST_RESULTS.md
- POST_OPTIMIZATION_ANALYSIS.md

2026-03-04 00:51:14 +02:00

5.9 KiB

Raw Blame History

Cheshire Cat RAG Viability - Post-Optimization Results

Executive Summary

Status: ✅ NOW VIABLE FOR VOICE CHAT

After disabling KV cache offloading to CPU in llama-swap, Cheshire Cat's RAG approach is now competitive with direct context loading for real-time voice chat applications.

Performance Comparison

Time To First Token (TTFT) - Critical Metric for Voice Chat

Method	Previous	Current	Improvement
🐱 Cheshire Cat (RAG)	1578ms ❌	504ms ✅	+68%
📄 Direct + Full Context	904ms ✅	451ms ✅	+50%
⚡ Direct + Minimal	210ms ✅	145ms ✅	+31%

Total Generation Time

Method	Previous	Current	Improvement
🐱 Cheshire Cat	10.5s	4.2s	+60%
📄 Direct + Full Context	8.3s	1.2s	+85%
⚡ Direct + Minimal	6.4s	0.8s	+87%

Voice Chat Viability Assessment

Before Optimization

❌ Cheshire Cat: 1578ms - TOO SLOW
✅ Current System: 904ms - GOOD
✅ Minimal: 210ms - EXCELLENT

After Optimization

✅ Cheshire Cat: 504ms - GOOD
✅ Current System: 451ms - EXCELLENT
✅ Minimal: 145ms - EXCELLENT

Target: <1000ms for voice chat ✅ All methods now pass!

Key Findings

1. Cheshire Cat is Now Competitive

504ms mean TTFT is excellent for voice chat
Only 53ms slower than current approach (10% difference)
Median TTFT: 393ms - even better than mean

2. All Systems Dramatically Improved

Current system: 904ms → 451ms (2x faster)
Cheshire Cat: 1578ms → 504ms (3x faster)
Total generation times cut by 60-87% across the board

3. KV Cache Optimization Impact

Disabling CPU offloading provided:

Faster token generation once model is warmed up
Consistent low latency across queries
Dramatic improvement in total response times

Trade-offs Analysis

Cheshire Cat (RAG) Advantages

✅ Scalability: Can handle much larger knowledge bases (100s of MB) ✅ Dynamic Updates: Add new context without reloading bot ✅ Memory Efficiency: Only loads relevant context (not entire 10KB every time) ✅ Semantic Search: Better at finding relevant info from large datasets ✅ Now Fast Enough: 504ms TTFT is excellent for voice chat

Cheshire Cat Disadvantages

⚠️ Slightly slower (53ms) than direct loading ⚠️ More complex infrastructure (Qdrant, embeddings) ⚠️ Requires Docker container management ⚠️ Learning curve for plugin development

Current System (Direct Loading) Advantages

✅ Simplest approach: Load context, query LLM ✅ Slightly faster: 451ms vs 504ms (10% faster) ✅ No external dependencies: Just llama-swap ✅ Proven and stable: Already working in production

Current System Disadvantages

⚠️ Not scalable: 10KB context works, but 100KB would cause issues ⚠️ Static context: Must restart bot to update knowledge ⚠️ Loads everything: Can't selectively retrieve relevant info ⚠️ Token waste: Sends full context even when only small part is relevant

Recommendations

For Current 10KB Knowledge Base

Recommendation: Keep current system

Reasons:

Marginally faster (451ms vs 504ms)
Already working and stable
Simple architecture
Knowledge base is small enough for direct loading

For Future Growth (>50KB Knowledge Base)

Recommendation: Migrate to Cheshire Cat

Reasons:

RAG scales better with knowledge base size
504ms TTFT is excellent and won't degrade much with more data
Can add new knowledge dynamically
Better semantic retrieval from large datasets

Hybrid Approach (Advanced)

Consider using both:

Direct loading for core personality (small, always needed)
Cheshire Cat for extended knowledge (songs, friends, lore details)
Combine responses for best of both worlds

Migration Path (If Chosen)

Phase 1: Parallel Testing (1-2 weeks)

Run both systems side-by-side
Compare response quality
Monitor latency in production
Gather user feedback

Phase 2: Gradual Migration (2-4 weeks)

Start with non-critical features
Migrate DM responses first
Keep server responses on current system initially
Monitor error rates

Phase 3: Full Migration (1 week)

Switch all responses to Cheshire Cat
Decommission old context loading
Monitor performance

Phase 4: Optimization (Ongoing)

Tune RAG retrieval settings
Optimize embedding model
Add new knowledge dynamically
Explore GPU embeddings if needed

Technical Notes

Current Cheshire Cat Configuration

LLM: darkidol (llama-swap-amd)
Embedder: FastEmbed CPU (BAAI/bge-large-en-v1.5-quantized)
Vector DB: Qdrant v1.9.1
Knowledge: 3 files uploaded (~10KB total)
Plugin: Miku personality (custom)

Performance Settings

KV Cache: Offload to CPU DISABLED ✅
Temperature: 0.8
Max Tokens: 150 (streaming)
Model: darkidol (uncensored Llama 3.1 8B)

Estimated Resource Usage

Cheshire Cat: ~500MB RAM, negligible CPU (GPU embeddings could reduce further)
Qdrant: ~100MB RAM
Storage: ~50MB (embeddings + indices)
Total Overhead: ~600MB RAM, ~50MB disk

Conclusion

The KV cache optimization has transformed Cheshire Cat from unviable (1578ms) to viable (504ms) for voice chat. Both systems now perform excellently, with Cheshire Cat offering better scalability at a marginal 53ms latency cost.

For current needs: Stick with direct loading (simpler, proven) For future growth: Cheshire Cat is now a strong option

The infrastructure is already set up and tested, so migration could happen whenever knowledge base growth demands it.

Benchmark Date: January 30, 2026 Optimization: KV cache offload to CPU disabled Test Queries: 10 varied questions Success Rate: 100% across all methods

5.9 KiB Raw Blame History