miku-discord/cheshire-cat/TEST_README.md

# Cheshire Cat Test Environment for Miku Bot

This is a standalone test environment for evaluating Cheshire Cat AI as a potential memory/context system for the Miku Discord bot.

## 🎯 Goals

1. **Test performance** - Measure latency, overhead, and real-time viability
2. **Evaluate memory** - Compare RAG-based context retrieval vs full context loading
3. **Benchmark CPU impact** - Assess performance on AMD FX-6100
4. **Make informed decision** - Data-driven choice on integration

## 📁 Directory Structure

```
cheshire-cat/
├── cat/                    # Cat data (created on first run)
│   ├── data/              # Cat's internal data
│   ├── plugins/           # Custom plugins
│   ├── static/            # Static assets
│   └── long_term_memory/  # Qdrant vector storage
├── .env                    # Environment configuration
├── docker-compose.test.yml # Docker setup
├── test_setup.py          # Initial setup script
├── benchmark_cat.py       # Comprehensive benchmarks
├── compare_systems.py     # Compare Cat vs current system
└── TEST_README.md         # This file
```

## 🚀 Quick Start

### 1. Prerequisites

- Docker and Docker Compose installed
- Miku bot's llama-swap service running
- Python 3.8+ with requests library

```bash
pip3 install requests
```

### 2. Start Cheshire Cat

```bash
# From the cheshire-cat directory
docker-compose -f docker-compose.test.yml up -d
```

Wait ~30 seconds for services to start.

### 3. Configure and Test

```bash
# Run setup script (configures LLM, uploads knowledge base)
python3 test_setup.py
```

This will:
- ✅ Wait for Cat to be ready
- ✅ Configure Cat to use llama-swap
- ✅ Upload miku_lore.txt, miku_prompt.txt, miku_lyrics.txt
- ✅ Run test queries

### 4. Run Benchmarks

```bash
# Comprehensive performance benchmark
python3 benchmark_cat.py
```

This tests:
- Simple greetings (low complexity)
- Factual queries (medium complexity)
- Memory recall (high complexity)
- Voice chat simulation (rapid-fire queries)

### 5. Compare with Current System

```bash
# Side-by-side comparison
python3 compare_systems.py
```

Compares latency between:
- 🐱 Cheshire Cat (RAG-based context)
- 📦 Current system (full context loading)

## 🔍 What to Look For

### ✅ Good Signs (Proceed with Integration)

- Mean latency < 1500ms
- P95 latency < 2000ms
- Consistent performance across query types
- RAG retrieves relevant context accurately

### ⚠️ Warning Signs (Reconsider)

- Mean latency > 2000ms
- High variance (large stdev)
- RAG misses important context
- Frequent errors or timeouts

### ❌ Stop Signs (Don't Use)

- Mean latency > 3000ms
- P95 latency > 5000ms
- RAG retrieval quality is poor
- System crashes or hangs

## 📊 Understanding the Results

### Latency Metrics

- **Mean**: Average response time
- **Median**: Middle value (less affected by outliers)
- **P95**: 95% of queries are faster than this
- **P99**: 99% of queries are faster than this

### Voice Chat Target

For real-time voice chat:
- Target: < 2000ms total latency
- Acceptable: 1000-1500ms mean
- Borderline: 1500-2000ms mean
- Too slow: > 2000ms mean

### FX-6100 Considerations

Your CPU may add overhead:
- Embedding generation: ~600ms
- Vector search: ~100-200ms
- Total Cat overhead: ~800ms

**With GPU embeddings**, this drops to ~250ms.

## 🛠️ Troubleshooting

### Cat won't start

```bash
# Check logs
docker logs miku_cheshire_cat_test

# Check if ports are in use
sudo netstat -tlnp | grep 1865
```

### Can't connect to llama-swap

The compose file tries to connect via:
1. External network: `miku-discord_default`
2. Host network: `host.docker.internal`

If both fail, check llama-swap URL in test_setup.py and adjust.

### Embeddings are slow

Try GPU acceleration in docker-compose.test.yml (requires spare VRAM).

### Knowledge upload fails

Upload files manually via admin panel:
- http://localhost:1865/admin
- Go to "Rabbit Hole" tab
- Drag and drop files

## 🔗 Useful Endpoints

- **Admin Panel**: http://localhost:1865/admin
- **API Docs**: http://localhost:1865/docs
- **Qdrant Dashboard**: http://localhost:6333/dashboard
- **Health Check**: http://localhost:1865/

## 📝 Decision Criteria

After running benchmarks, consider:

| Metric | Target | Your Result |
|--------|--------|-------------|
| Mean latency | < 1500ms | _____ ms |
| P95 latency | < 2000ms | _____ ms |
| Success rate | > 95% | _____ % |
| RAG accuracy | Good | _____ |

**Decision:**
- ✅ All targets met → **Integrate with bot**
- ⚠️ Some targets met → **Try GPU embeddings or hybrid approach**
- ❌ Targets not met → **Stick with current system**

## 🧹 Cleanup

```bash
# Stop services
docker-compose -f docker-compose.test.yml down

# Remove volumes (deletes all data)
docker-compose -f docker-compose.test.yml down -v
```

---

**Remember**: This is a test environment. Don't integrate with production bot until you're confident in the results!