Iggy Core, Product Requirements Document (PRD)

PRD: Iggy Core (CPU Optimized Edition)

1. Objective

Build a secure, personal AI agent using local CPU-based inference. The system must provide a "Digital Twin" experience with high security and acceptable performance on commodity server hardware (no GPU).

2. Infrastructure & Model

Model: Gemma-2-9B-IT (Quantization: Q4_K_M).
Engine: llama-server (CPU mode).
Optimization: Must utilize --threads flag configured to match physical cores (not logical) to prevent context-switching overhead.
Memory Target: \~6GB RAM for model + 1GB for OS/Bridge.

3. The "Shield" (Security & Resource Defense)

Since CPU inference is slow, resource exhaustion is a major threat.

Request Queueing: Implement a FIFO (First-In-First-Out) queue. Only 1 inference task can run at a time to prevent CPU thrashing.
Hard Rate Limit: 5 requests per minute per IP.
Soft Throttling: Introduce a 1-second delay for any user sending >3 messages in a minute.
Honeypot: Regex-flagged malicious prompts are diverted to a static "Searching archives..." delay (5s) followed by a pre-written refusal.

4. RAG Implementation (Build-Time Only)

Zero-Runtime Indexing: No Python or LanceDB in the production environment.
Process:
Build script generates embeddings for /data/*.md.
Embeddings saved to a single vectors.json.
Node.js index.js loads this into memory on start for cosine similarity search.

5. User Experience & Perceived Performance

Since CPU inference is slower than GPU, the UI must mask latency to remain "reasonable."

Mandatory SSE Streaming: Response must start appearing within \<800ms.
"Thinking" State UI: A subtle pulse or "Thinking..." indicator must appear immediately upon submission and disappear once the first token arrives.
Token Smoothing: The frontend should implement a small buffer (50-100ms) to ensure tokens appear as a steady stream rather than "bursty" chunks.
Stateless/Persistent: Use localStorage for conversation history so the server doesn't have to store session state.

6. Development Tasks

Bridge Setup: Express server with express-rate-limit and slow-down.
Inference Bridge: Connect to llama-server via local fetch; implement the request queue.
Static RAG: Build the JS-only vector matcher.
Defense: Add the "Iggy Shield" regex and honeypot middleware.
UI: React chat window with streaming text support and "Thinking" state indicators.

7. Success Constraints

Concurrency: System must handle 1 active user while queueing 5 others without crashing.
Security: Zero successful prompt injections during "ignore previous instructions" testing.
Transparency: Public link to source_truth.md provided in UI.