FRAKTAG Evolution: From RAG Script to Modular Knowledge Engine

By Andrei Roman

Principal Architect, The Foundry

After the pivot to human-supervised ingestion (see Introducing FRAKTAG), I faced new problems. The system needed to run on three different stacks (cloud for production, local for privacy, Apple Silicon for development). Magic string prompts scattered across the codebase made this impossible to test. Local models (Ollama, MLX) crashed when fed the same context that worked fine on GPT-4.1.

Here are the architectural pivots that solved those problems.

Problem 1: LLMs Are Stochastic and Break Everything

LLMs are random. Non-deterministic. Embedding prompt strings directly into business logic made the system brittle. Local quantized models (looking at you, 4-bit) sometimes output invalid JSON.

Solution: The LLM Nugget Architecture

Wrapped prompts into Nugget classes. Each Nugget encapsulates the prompt, input variables, and output parsing/validation logic (extractJSON, regex cleanup). Examples: GlobalMapScan, AssessNeighborhood, GenerateGist.

This enforces type safety on stochastic outputs. Treats an LLM call like a typed function. The rest of the application always receives valid objects, or fails gracefully with specific error handling, regardless of the model's mood.

Enables isolated testing. Run fkt test-nugget against different models (Qwen3-coder vs GPT-4.1-mini). Verify instruction adherence before full system integration. Prompt engineering becomes software engineering.

Problem 2: The Code Could Not Run in Two Places

The codebase needed to run in mutually exclusive environments. Local workstations (stateful, low latency, filesystem access) and AWS Lambda (stateless, high latency, S3 access). Hardcoding fs.readFile or OpenAI calls made the code non-portable.

Solution: Hexagonal Architecture (Ports and Adapters)

Extracted I/O into interfaces (IStorage, ILLMAdapter). Core engine logic (Fractalizer, Navigator) became pure TypeScript. Does not care if reading from Mac SSD or S3 bucket. Does not care if talking to local quantization via Python or GPT-4.1 via HTTPS.

Three production setups now work:

Cloud (SaaS): OpenAIAdapter + S3Storage + AWS Lambda. High concurrency (10+ parallel requests).
AMD Strix Halo (Local): OllamaAdapter + JsonStorage. Optimized for Linux/Windows workstations with dedicated NPUs and GPUs.
Apple Silicon (Local): MLXAdapter + JsonStorage. Custom Python sidecar (mlx_runner.py) unifies mlx-lm (chat) and sentence-transformers (embeddings) into single API. Bypasses Docker and Ollama on Mac.

Same core logic. LLMs are interchangeable CPUs.

Problem 3: Vector Search Alone Misses Structural Context

Standard RAG (vector soup) relies solely on semantic similarity. Cannot answer structural questions like What are the main topics in the Engineering folder because it lacks concept of hierarchy. Treats root-level policy document same as footnote in sub-folder.

Solution: Strict Taxonomy and Multi-Modal Retrieval

Enforced strict schema (Folder vs Document vs Fragment). Persisted as graph (tree) structure. Enabled combining vector search (semantic match) with graph traversal (structural drill-down).

Multi-stage ensemble strategy:

Vector Neighborhood (The Scout): Fast semantic search finds deep content fragments. Embedding similarity catches exact keyword matches.

Global Map Scan (The Strategist): LLM analyzes text representation of high-level tree structure. Finds relevant branches vector search misses due to vocabulary mismatch. Example: a Risk folder that does not explicitly mention query keywords but is structurally relevant.

Precision Drill: System recursively explores candidate branches. Reads local context. Decides whether to drill deeper.

Vector search alone misses context. Graph traversal alone is slow. Ensemble combines both.

Problem 4: Conversation History Killed Performance

Storing all conversations in single conversations.json tree file created O(N) performance degradation. As history grew, I/O cost of reading/writing a simple Hello increased linearly. Bloated memory usage.

Solution: One Tree Per Session

Sharded conversations into individual files (conv-{uuid}.json). Loading a session now takes constant time regardless of total history size. O(1) performance. Prevents context bleed where vector index of one massive conversation tree accidentally pollutes results of another.

Session trees contain linkedContext metadata pointing to external reference trees (knowledge bases) they discuss. Conversations become first-class knowledge artifacts. Not throwaway logs.

Knowledge bases are self-contained directories (kb.json + content/indexes/trees). Git-versionable. Shareable via USB. Syncable to S3. No external database dependencies.

Problem 5: Local GPUs Thrash Under Parallel Load

Cloud APIs (OpenAI) scale horizontally. Fire 10 requests at once. Local hardware (Apple Silicon, AMD) scales vertically. Firing 10 requests at once causes VRAM thrashing. GPU constantly swaps model weights or KV caches. Leads to timeouts, crashes, garbage output.

Solution: Concurrency Control (Semaphores and Locks)

Implemented async Semaphore in adapters and asyncio.Lock in Python runner. Forces serialization on local devices (queue: 1) to ensure GPU is dedicated to one generation task at a time. Prevents quantization collapse. Allows parallelism on cloud configuration.

Experimental for now. Defaulting to serial calls.

Problem 6: Local Models Hallucinate on Large Context

Modern models claim 128k context windows. Reasoning accuracy degrades sharply as context fills up (Lost in the Middle phenomenon), especially with quantization. Feeding 200k char map to local 30B model resulted in gibberish.

Solution: Chunking Discipline

Reduced Global Map Scan chunk size from 200k to 25k characters. Hard-limited context to 32k tokens in config. Reliability over capacity. Better to perform 5 small, accurate serial scans than 1 massive, hallucinated parallel scan. Respects effective attention span of quantized local models.

Standardized on 8-bit or 6-bit quantization for local models (Qwen 30B). 4-bit models suffer syntax degradation (broken JSON) during complex reasoning.

Context is managed resource. RAM not infinite. Budget it.

What This Achieved

FRAKTAG evolved from RAG script into modular knowledge engine that solves six technical problems:

Type safety on stochastic outputs. Nugget architecture treats LLM calls as typed functions.

Runtime agnosticism. Hexagonal architecture enables same code on AWS Lambda, AMD workstations, Apple Silicon.

Multi-modal retrieval. Combines vector search (fast, imprecise) with graph traversal (slow, structural). Ensemble beats either alone.

O(1) conversation performance. Sharding conversations into individual trees prevents linear degradation.

Hardware constraint handling. Semaphores prevent VRAM thrashing on local GPUs.

Reliability over capacity. Chunking discipline prevents hallucination on quantized models.

Repo: https://github.com/andreirx/FRAKTAG

Discussion (0)

No comments yet. Be the first!