# RVC Container Build Fixes

## Summary
Successfully built RVC Docker container (63.6GB) with AMD RX 6800 GPU support and ROCm 6.4.

## Critical Issues and Solutions

### 1. PyTorch Version Override
**Problem**: pip installing requirements upgraded torch 2.5.1+git8420923 (ROCm) to 2.8.0 (CUDA)

**Root Cause**: Base image `rocm/pytorch:rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.5.1` has custom torch build not available in PyPI

**Solution**: Created `constraints.txt` to pin exact torch version:
```text
torch==2.5.1+git8420923
torchvision==0.20.1a0+04d8fc4
torchaudio
```

### 2. Torchaudio Compatibility
**Problem**: torchaudio 2.5.1 (standard) requires CUDA libraries, crashes with "libtorch_cuda.so not found"

**Root Cause**: No torchaudio 2.5.1+rocm6.4 available in PyTorch repository

**Solution**: Install torchaudio 2.5.1+rocm6.2 (ABI compatible with ROCm 6.4):
```dockerfile
pip install --no-cache-dir torchaudio==2.5.1+rocm6.2 --index-url https://download.pytorch.org/whl/rocm6.2
```

### 3. scipy/numpy/numba Version Conflicts
**Problem**: 
- scipy 1.10.1 installed with numpy 1.21.2 → ABI mismatch
- numba required numpy <1.23, but scipy needs >=1.19.5
- Upgrading scipy caused numba to break

**Root Cause**: requirements-rvc.txt had mismatched versions from different dependency resolution

**Solution**: Force install matching versions from bare metal:
```bash
pip install --no-cache-dir numpy==1.23.5 scipy==1.15.3 numba==0.56.4
```

### 4. apex C++ Extension Incompatibility
**Problem**: apex fused_layer_norm_cuda extension failed with undefined symbol error:
```
ImportError: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
```

**Root Cause**: apex compiled for PyTorch 2.3, incompatible with 2.5.1

**Solution**: Remove apex (not needed for inference):
```dockerfile
pip uninstall -y apex || true
```

## Final Dockerfile RUN Command

```dockerfile
RUN pip install --no-cache-dir pip==24.0 && \
    pip install --no-cache-dir -c constraints.txt -r requirements-rvc.txt && \
    pip uninstall -y apex || true && \
    pip install --no-cache-dir torchaudio==2.5.1+rocm6.2 --index-url https://download.pytorch.org/whl/rocm6.2 && \
    pip install --no-cache-dir numpy==1.23.5 scipy==1.15.3 numba==0.56.4
```

## Docker Compose Configuration

### GPU Passthrough (ROCm)
```yaml
rvc:
  devices:
    - /dev/kfd:/dev/kfd
    - /dev/dri:/dev/dri
  group_add:
    - "989"  # render group (numeric for container compatibility)
    - "985"  # video group
  environment:
    - HSA_OVERRIDE_GFX_VERSION=10.3.0  # RX 6800 (gfx1030)
    - HSA_FORCE_FINE_GRAIN_PCIE=1
```

## Verification

### Successful Startup Logs
```
2026-01-15 20:07:41 | INFO | configs.config | Found GPU AMD Radeon RX 6800
2026-01-15 20:07:41 | INFO | configs.config | Half-precision floating-point: True, device: cuda:0
2026-01-15 20:07:41 | INFO | __main__ | ✓ Connected to Soprano server at tcp://soprano:5555
2026-01-15 20:07:49 | INFO | __main__ | ✓ RVC model loaded (version: v2, target SR: 48000Hz)
2026-01-15 20:07:49 | INFO | __main__ | ✓ Pipeline ready! API accepting requests on port 8765
INFO:     Uvicorn running on http://0.0.0.0:8765 (Press CTRL+C to quit)
```

### Health Check
```bash
$ curl http://localhost:8765/health
{
  "status": "healthy",
  "soprano_connected": true,
  "rvc_initialized": true,
  "pipeline_ready": true
}
```

## Container Stats
- **Base Image**: rocm/pytorch:rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.5.1 (~12GB)
- **Final Size**: 63.6GB
- **Python**: 3.10.x
- **pip**: 24.0
- **PyTorch**: 2.5.1+git8420923 (ROCm 6.4)
- **Torchaudio**: 2.5.1+rocm6.2
- **GPU**: AMD RX 6800 (16GB VRAM, gfx1030)
- **Status**: ✅ Healthy and working

## Build Time
- Multi-stage build: ~75 minutes
- Single command fixes in running container: ~2 minutes

## Lessons Learned

1. **Base image PyTorch versions are sacred** - Don't let pip "upgrade" them
2. **Constraints files are essential** for complex PyTorch environments
3. **ROCm versions don't always match** - 6.2 torchaudio works with 6.4 torch
4. **apex is problematic** - Remove when not needed
5. **Numeric group IDs** required for GPU device access in containers
6. **Manual container fixes** can identify solutions before long rebuilds
7. **Multi-stage builds** don't save much space when base image is large

## Next Steps
- [ ] Test GPU performance (target: >0.9x realtime)
- [ ] Verify end-to-end synthesis pipeline
- [ ] Archive builder stage to /4TB/Docker/
- [ ] Document complete deployment process