Iggy Core, Product Requirements Document (PRD)
PRD: Iggy Core (CPU Optimized Edition)
1. Objective
Build a secure, personal AI agent using local CPU-based inference. The system must provide a "Digital Twin" experience with high security and acceptable performance on commodity server hardware (no GPU).
2. Infrastructure & Model
- Model: Gemma-2-9B-IT (Quantization: Q4_K_M).
- Engine: llama-server (CPU mode).
- Optimization: Must utilize --threads flag configured to match physical cores (not logical) to prevent context-switching overhead.
- Memory Target: \~6GB RAM for model + 1GB for OS/Bridge.
3. The "Shield" (Security & Resource Defense)
Since CPU inference is slow, resource exhaustion is a major threat.
- Request Queueing: Implement a FIFO (First-In-First-Out) queue. Only 1 inference task can run at a time to prevent CPU thrashing.
- Hard Rate Limit: 5 requests per minute per IP.
- Soft Throttling: Introduce a 1-second delay for any user sending >3 messages in a minute.
- Honeypot: Regex-flagged malicious prompts are diverted to a static "Searching archives..." delay (5s) followed by a pre-written refusal.
4. RAG Implementation (Build-Time Only)
- Zero-Runtime Indexing: No Python or LanceDB in the production environment.
- Process:
- Build script generates embeddings for /data/*.md.
- Embeddings saved to a single vectors.json.
- Node.js index.js loads this into memory on start for cosine similarity search.
5. User Experience & Perceived Performance
Since CPU inference is slower than GPU, the UI must mask latency to remain "reasonable."
- Mandatory SSE Streaming: Response must start appearing within \<800ms.
- "Thinking" State UI: A subtle pulse or "Thinking..." indicator must appear immediately upon submission and disappear once the first token arrives.
- Token Smoothing: The frontend should implement a small buffer (50-100ms) to ensure tokens appear as a steady stream rather than "bursty" chunks.
- Stateless/Persistent: Use localStorage for conversation history so the server doesn't have to store session state.
6. Development Tasks
- Bridge Setup: Express server with express-rate-limit and slow-down.
- Inference Bridge: Connect to llama-server via local fetch; implement the request queue.
- Static RAG: Build the JS-only vector matcher.
- Defense: Add the "Iggy Shield" regex and honeypot middleware.
- UI: React chat window with streaming text support and "Thinking" state indicators.
7. Success Constraints
- Concurrency: System must handle 1 active user while queueing 5 others without crashing.
- Security: Zero successful prompt injections during "ignore previous instructions" testing.
- Transparency: Public link to source_truth.md provided in UI.