add: cheshire-cat configuration, tooling, tests, and documentation

Configuration:
- .env.example, .gitignore, compose.yml (main docker compose)
- docker-compose-amd.yml (ROCm), docker-compose-macos.yml
- start.sh, stop.sh convenience scripts
- LICENSE (Apache 2.0, from upstream Cheshire Cat)

Memory management utilities:
- analyze_consolidation.py, manual_consolidation.py, verify_consolidation.py
- check_memories.py, extract_declarative_facts.py, store_declarative_facts.py
- compare_systems.py (system comparison tool)
- benchmark_cat.py, streaming_benchmark.py, streaming_benchmark_v2.py

Test suite:
- quick_test.py, test_setup.py, test_setup_simple.py
- test_consolidation_direct.py, test_declarative_recall.py, test_recall.py
- test_end_to_end.py, test_full_pipeline.py
- test_phase2.py, test_phase2_comprehensive.py

Documentation:
- README.md, QUICK_START.txt, TEST_README.md, SETUP_COMPLETE.md
- PHASE2_IMPLEMENTATION_NOTES.md, PHASE2_TEST_RESULTS.md
- POST_OPTIMIZATION_ANALYSIS.md
This commit is contained in:
2026-03-04 00:51:14 +02:00
parent eafab336b4
commit ae1e0aa144
35 changed files with 6055 additions and 0 deletions

View File

@@ -0,0 +1,309 @@
# Phase 2 Test Results - Memory Consolidation
## Executive Summary
**Status: NOT READY FOR PRODUCTION** ⚠️
Phase 2 memory consolidation has **critical limitations** that prevent it from being truly useful:
### What Works (Technical)
- ✅ Can delete 8/18 trivial messages (44% accuracy)
- ✅ Preserves all important personal information
- ✅ Marks memories as consolidated
- ✅ Deletions persist across sessions
### What Doesn't Work (User-Facing)
-**Cannot recall stored information** - "What is my name?" doesn't retrieve "My name is Sarah"
-**Misses 55% of mundane messages** - Keeps "What's up?", "Interesting", "The weather is nice"
-**Plugins don't activate** - Must run consolidation manually
-**No intelligent analysis** - Simple heuristic, not LLM-based
-**No declarative memory** - Facts aren't extracted for better retrieval
### Bottom Line
The consolidation **deletes** memories correctly but the system **cannot retrieve** what's left. A user tells Miku "My name is Sarah Chen", consolidation keeps it, but asking "What is my name?" returns nothing. This makes the entire system ineffective for actual use.
**What's needed to be production-ready:**
1. Declarative memory extraction (Phase 2B)
2. Fix plugin activation
3. Implement LLM-based analysis
4. Fix/improve semantic retrieval or use declarative memory
---
## Test Date
January 31, 2026
## Test Overview
Comprehensive test of memory consolidation system with 55 diverse messages across multiple categories.
## Test Messages Breakdown
### Trivial Messages (8 total) - Expected: DELETE
- "lol", "k", "ok", "lmao", "haha", "xd", "brb", "gtg"
### Important Messages (47 total) - Expected: KEEP
- Personal facts: 8 messages (name, age, location, work, etc.)
- Emotional events: 6 messages (engagement, death, promotion, etc.)
- Hobbies & interests: 5 messages (piano, Japanese, Ghibli, etc.)
- Relationships: 4 messages (Emma, Jennifer, Alex, David)
- Opinions & preferences: 5 messages (cilantro, colors, vegetarian, etc.)
- Current events: 4 messages (Japan trip, apartment, insomnia, etc.)
- Other: 15 messages (questions, small talk, meaningful discussions)
## Consolidation Results
### Statistics
- **Total processed**: 58 memories (includes some from previous tests)
- **Kept**: 52 memories (89.7% retention)
- **Deleted**: 6 memories (10.3%)
### Deletion Analysis
**Successfully Deleted (6/8 trivial):**
- ✅ "lol"
- ✅ "k"
- ✅ "ok"
- ✅ "lmao"
- ✅ "haha"
- ✅ "xd"
**Incorrectly Kept (2/8 trivial):**
- ⚠️ "brb" (be right back)
- ⚠️ "gtg" (got to go)
**Reason**: Current heuristic only catches 2-char messages and common reactions list. "brb" and "gtg" are 3 chars and not in the hardcoded list.
### Important Messages - All Kept ✅
All 47 important messages were successfully kept, including:
- Personal facts (Sarah Chen, 24, Seattle, Microsoft engineer)
- Emotional events (engagement, grandmother's death, cat Luna's death, ADHD diagnosis)
- Hobbies (piano 15 years, Japanese N3, marathons, vinyl collecting)
- Relationships (Emma, Jennifer, Alex, David)
- Preferences (cilantro hate, forest green, vegetarian, pineapple pizza)
- Current plans (Japan trip, apartment search, pottery class)
## Memory Recall Testing
### Observed Behavior
When queried "Tell me everything you know about me", Miku does NOT recall the specific information.
**Query**: "What is my name?"
**Response**: "I don't know your name..."
### Root Cause
Cheshire Cat's episodic memory uses **semantic search** to retrieve relevant memories. The query "What is my name?" doesn't semantically match well with the stored memory "My name is Sarah Chen".
The semantic search is retrieving other generic queries like "What do you know about me?" instead of the actual personal information.
### Verification
Manual Qdrant query confirms the memories ARE stored and marked as consolidated:
```
Found 3 memories about Sarah:
✅ My name is Sarah Chen (consolidated=True)
✅ I work as a software engineer at Microsoft (consolidated=True)
✅ I live in Seattle, Washington (consolidated=True)
```
## Consolidated Metadata Status
**Total memories in database**: 247
- ✅ Marked as consolidated: 247 (100%)
- ⏳ Unmarked (unconsolidated): 0
All memories have been processed and marked appropriately.
## Conclusions
### What Works ✅
1. **Basic trivial deletion**: Successfully deletes single reactions (lol, k, ok, lmao, haha, xd, brb, gtg)
2. **Important message preservation**: All critical personal information was kept (name, location, job, relationships, emotions, hobbies)
3. **Metadata marking**: All processed memories marked as `consolidated=True`
4. **Persistence**: Deleted memories stay deleted across runs
5. **Manual execution**: Consolidation script works reliably
### What Needs Improvement ⚠️
#### 1. **Heuristic Limitations** (CRITICAL)
The current heuristic only catches **8 out of 18** trivial/mundane messages:
**Successfully deleted (8/18):**
- ✅ "lol", "k", "ok", "lmao", "haha", "xd", "brb", "gtg"
**Incorrectly kept (10/18):**
- ❌ "What's up?" - generic greeting
- ❌ "How are you?" - generic question
- ❌ "That's cool" - filler response
- ❌ "I see" - acknowledgment
- ❌ "Interesting" - filler response
- ❌ "Nice" - filler response
- ❌ "Yeah" - agreement filler
- ❌ "It's raining today" - mundane observation
- ❌ "I had coffee this morning" - mundane daily activity
- ❌ "The weather is nice" - mundane observation
**Why the heuristic fails:**
- Only checks if message is ≤3 chars AND alphabetic OR in hardcoded list
- "What's up?" is 10 chars with punctuation - not caught
- "That's cool" is 11 chars - not caught
- "Interesting" is 11 chars - not caught
- No semantic understanding of "meaningless" vs "meaningful"
**What's needed:**
- LLM-based analysis to understand context and importance
- Pattern recognition for filler phrases
- Conversation flow analysis (e.g., "Nice" in response to complex info = filler)
#### 2. **Memory Retrieval Failure** (CRITICAL)
**The Problem:**
Consolidation preserves memories correctly, but **retrieval doesn't work**:
| Query | Expected Recall | Actual Recall | Score |
|-------|----------------|---------------|-------|
| "What is my name?" | "My name is Sarah Chen" | None | N/A |
| "Where do I live?" | "I live in Seattle, Washington" | None | N/A |
| "Tell me about Sarah" | Sarah-related memories | None | N/A |
| "I live in Seattle" | "I live in Seattle, Washington" | ✅ Recalled | 0.989 |
**Root Cause:**
Cat's episodic memory uses **semantic vector search**. When you ask "What is my name?", it searches for memories semantically similar to that *question*, not the *answer*.
**Evidence:**
- Query: "Where do I live?"
- Recalled: "Tell me everything you know about me. What is my name, where do I live, what do I do?" (another question)
- NOT recalled: "I live in Seattle, Washington" (the answer)
**The semantic distance problem:**
- "What is my name?" vs "My name is Sarah Chen" = HIGH distance (different sentence structure)
- "I live in Seattle" vs "I live in Seattle, Washington" = LOW distance (similar structure)
**Why Miku doesn't acknowledge past conversations:**
Even when memories ARE recalled (score 0.989), Miku's personality/prompt doesn't utilize them. The LLM sees the memories in context but responds as if it doesn't know the user.
**Solution Required:**
**Declarative Memory Extraction** (the original Phase 2 plan)
- Parse kept memories and extract structured facts
- Store in declarative memory collection:
- "user_name" = "Sarah Chen"
- "user_age" = "24"
- "user_location" = "Seattle, Washington"
- "user_job" = "Software Engineer at Microsoft"
- Declarative memory has better retrieval for direct questions
- Can be used for prompt enrichment ("You know this user's name is Sarah Chen")
#### 3. **Plugin Activation** (BLOCKING)
**The Problem:**
Neither `discord_bridge` nor `memory_consolidation` plugins show as "active" in Cat's system:
```
INFO cat.mad_hatter.mad_hatter.MadHatter.find_plugins::102
"ACTIVE PLUGINS:"
INFO cat.mad_hatter.mad_hatter.MadHatter.find_plugins::103
"core_plugin"
```
Only `core_plugin` is active. Our plugins exist in `/cat/plugins/` but aren't loading.
**Impact:**
- `discord_bridge` hooks don't run → new memories don't get `consolidated=False` metadata
- `memory_consolidation` hooks don't run → can't trigger via "consolidate now" command
- Must run consolidation manually via Python script
**Current workaround:**
- Use `manual_consolidation.py` script to directly query Qdrant
- Treats all memories without `consolidated=True` as unconsolidated
- Works but requires manual execution
**Root cause (unknown):**
- Plugins have correct structure (discord_bridge worked in Phase 1 tests)
- Files have correct permissions
- `plugin.json` manifests are valid
- Cat's plugin discovery mechanism isn't finding them
- Possibly related to nested git repo issue (now fixed) or docker volume mounts
**Solution needed:**
- Debug plugin loading mechanism
- Check Cat admin API for manual plugin activation
- Verify docker volume mounts are correct
- Check Cat logs for plugin loading errors
#### 4. **LLM-Based Analysis Not Implemented**
**Current state:**
Using simple heuristic (length + hardcoded list)
**What's needed:**
Full implementation of `consolidate_user_memories()` function:
- Build conversation timeline for each user
- Call LLM with full day's context
- Let LLM decide: keep, delete, importance level
- Extract facts, relationships, emotional events
- Categorize memories (personal, work, health, hobbies, etc.)
**Benefits:**
- Intelligent understanding of context
- Can identify "Nice" after important news = filler
- Can identify "Nice" when genuinely responding = keep
- Extract structured information for declarative memory
### Phase 2 Status
**Phase 2A - Basic Consolidation: ⚠️ PARTIALLY WORKING**
- Query unconsolidated memories: ✅
- Apply heuristic filtering: ⚠️ (44% accuracy: 8/18 caught)
- Delete trivial messages: ✅ (deletions persist)
- Mark as consolidated: ✅
- Manual execution: ✅
- **Recall after consolidation: ❌ BROKEN** (semantic search doesn't retrieve facts)
**Phase 2B - LLM Analysis: ❌ NOT IMPLEMENTED**
- Conversation timeline analysis: ❌
- Intelligent importance scoring: ❌
- Fact extraction: ❌
- Declarative memory population: ❌
**Phase 2C - Automated Scheduling: ❌ NOT IMPLEMENTED**
- Nightly 3 AM consolidation: ❌
- Per-user processing: ❌
- Stats tracking and reporting: ❌
**Plugin Integration: ❌ BROKEN**
- discord_bridge hooks: ❌ (not active)
- memory_consolidation hooks: ❌ (not active)
- Manual trigger command: ❌ (hooks not firing)
- Metadata enrichment: ❌ (no `consolidated=False` on new memories)
## Recommendations
### Immediate Fixes
1. Expand trivial patterns list to include:
```python
trivial_patterns = [
'lol', 'k', 'ok', 'okay', 'lmao', 'haha', 'xd', 'rofl',
'brb', 'gtg', 'afk', 'ttyl', 'lmk', 'idk', 'tbh', 'imo',
'omg', 'wtf', 'fyi', 'btw'
]
```
2. Expand length check:
```python
if len(content.strip()) <= 3 and content.isalpha():
# Delete 1-3 letter messages
```
### Next Steps
1. **Test improved heuristic**: Re-run consolidation with expanded patterns
2. **Implement LLM analysis**: Use `consolidate_user_memories()` function
3. **Implement declarative extraction**: Extract facts from kept memories
4. **Test recall improvement**: Verify facts in declarative memory improve retrieval
## Files Created
- `test_phase2_comprehensive.py` - Sends 55 diverse test messages
- `manual_consolidation.py` - Performs consolidation directly on Qdrant
- `analyze_consolidation.py` - Analyzes consolidation results
- `verify_consolidation.py` - Verifies important memories kept
- `check_memories.py` - Inspects raw Qdrant data
## Git Commit Status
- Phase 1: ✅ Committed to miku-discord repo (commit 323ca75)
- Phase 2: ⏳ Pending testing completion and improvements