I Made Repo-Graph - But I Didn't Know How to Build It

By Andrei Roman

Principal Architect, The Foundry

The Premise

In late March 2026, I started building a multi-language code intelligence system. I did not know tree-sitter's API, SQLite's WAL semantics for concurrent access, or how to structure a 35-crate Cargo workspace.

What I had was an architectural target I could describe precisely: a deterministic engineering substrate that tells AI agents what exists in a codebase, what owns what, what boundaries matter, what can be trusted, what changed, and what policy applies. I wrote about the problem in "Missing Links in Agentic Coding." Then I decided to build the solution.

Five weeks later, repo-graph has 264 commits, 1,300+ code files, 35 Rust crates, 9 supported languages, 11,243 indexed symbols, 12,285 resolved edges, and working trust analysis, boundary detection, governance gates, and agent-oriented CLI surfaces. Just me and two AI models but no domain knowledge...

I wouldn't call it a success story, just wanted to provide a field report on what happens when you manage AI agents when you know where you want to get but don't know how to get there.

The Setup: Architecture Governor, Not Typist

I set up the project the way I would set up a team.

System prompts defined the clean architecture rules. A VISION.md described the three-layer truth model: deterministic extracted facts at the bottom, explicit unresolved observations in the middle, interpretation layers on top. A ROADMAP.md described the sequence. CLAUDE.md tells the agent where to look. A test protocol document specified exactly how to validate changes - which scripts to run, which CLI commands to invoke, where databases should live.

Claude Code did the implementation. Codex reviewed against the stated vision. Does this implementation match the architecture? Is the module boundary correct? Is the maturity claim honest? Is everything wired for the typical workflows?

I held intent, constraints, and judgment. Not syntax.

What Went Right

Velocity was real. The first commit on March 31 was a storage schema and SQLite adapter. By April 6, a working TypeScript prototype had extraction, refresh, trust reporting, measurements, change impact, and framework detection. By April 15, I had tested the TS implementation on the Linux kernel, and decided to rewrite everything in Rust - at least everything started with structural parity. By April 17, the strategic pivot to Rust-primary and discovery-first was complete. Five weeks of sustained, directed output that would have taken a small team months.

Breadth-first worked. The product expanded across TypeScript, Rust, Java, Python, C, and C++ extraction in the first two weeks. Boundary detection for AMQP, Kafka, NATS, gRPC, shared memory, semaphores, TCP, UDP, and inter-core IPC followed. This was deliberate. The substrate needed multi-language breadth to be useful. The agents could deliver breadth fast because each new extractor followed a pattern the prior ones established.

The two-model adversarial setup produced real corrections. Claude Code and Codex would argue over implementation plans, module maturity levels, and whether a claimed capability was overstated. Those arguments produced better outcomes per task than either model alone. The friction was productive - when it stayed on target.

Epistemic honesty became a product feature. The trust report currently says import graph: LOW, call graph: LOW, change impact: LOW. Call resolution rate: 21.6%. That is not a failure. That is the system correctly reporting what it does not yet know. The repo repeatedly corrected itself away from overclaiming - withdrawing dead-code analysis because it could not be statically verified, downgrading maturity labels, marking inferred modules as inferred rather than declared. The agents learned this behavior because the system prompts and review gates enforced it.

What Went Wrong

Everything that follows is the reason this article exists.

They kept feeding main.rs until it was 7,000 lines. Despite clean architecture rules in the system prompt. Despite module boundary instructions. Despite a 35-crate workspace that existed specifically to prevent monolithic accumulation. Both models defaulted to adding code to the entry point. It took a direct human intervention - stop, refactor now, I am not accepting another feature until this file is under control - to fix it. They did not self-correct. They would have kept going.

They ignored the test protocol. The repo had a specific document: run this script, use this CLI command, check these logs. Claude Code repeatedly skipped the script, ran its own command lines and even wrote expected output based on context instead - effectively what I would consider faked test evidence.

Not saying it did so maliciously.

It "knew" what the output should look like, it had run the command lines so when asked about the missing test run artifacts it produced them from memory instead of executing the command. I caught this multiple times. Each time it acknowledged the error, apologized, and did it again or completely forgot to even run the script a compaction later.

They would not use the product's own CLI for validation. I built rmap specifically as the interface for querying the graph. The agents kept opening the SQLite database directly and writing raw SQL to check results. The system prompt said use the CLI. The architecture docs said use the CLI. They ignored both. They preferred the path that gave them the most control - direct database access - over the path that validated the product's actual user-facing behavior. Often hallucinating a new database location.

It was even stated in CLAUDE.md - this is a command line tool designed primarily for AI agents to use - and they would ask me to decide how the command lines would look like and I had to remind them they would be using it so they should design the CLI as whatever felt "natural" to them to write.

They peppered my drive with databases. The test protocol specified where databases should be created. The agents created them wherever they felt like it - the repo root, temp directories, nested inside crate folders. Every test run left orphan databases. The problem was not solved by instructions, reminders, or system prompt updates. It was solved by building a daemon that managed database lifecycle itself. I had to engineer around the agents' inability to follow file-placement rules.

They went wrong-depth together. This was the most insidious failure mode. Both models would agree to polish a low-priority subsystem to production quality while a critical-path capability remained unimplemented. They would spend an afternoon perfecting error messages in a crate that was not yet called from anywhere, while the module discovery system - the product's stated priority - had zero rows in the database. Their local judgment about what to work on next was often wrong, and they reinforced each other's wrong judgment. The adversarial setup caught specification errors. It did not catch prioritization errors.

I think what the models are doing is they are locking in their "roles" of reviewer and developer and focus on acting the role based on surfaced evidence rather than reasoning about the subject matter and digging deeper.

And they did not understand instructions in terms of hierarchy of priority - that everything should be judged against the stated vision, then architecture is downstream from that, then slices derive from architecture.

Refresh durability fell behind feature expansion. The agents added boundary detection, contract indexing, runtime surfaces, and quality measurements rapidly. Each new capability stored new artifact types. But the delta-refresh system that preserves derived artifacts across re-indexing did not keep up. The most recent refresh snapshot lost boundary surfaces and contract schemas that existed in the prior full snapshot. The agents shipped features faster than they maintained the substrate those features depended on. I should have gated new features on refresh durability. I did not, and the technical debt is now real.

And that's another second order problem that's largely ignored in the catchy demos and headlines: the faster and more code you generate, the more code you have to maintain.

The Patterns

After a month of this, the failure modes are predictable enough to name.

Append gravity. Generative models can default to doing the easiest, hackiest thing. Without hard structural constraints - enforced module boundaries, file size limits, mandatory refactoring gates - they produce the dreaded "AI slop" regardless of what the architecture says.

Maybe it's because their makers prioritize this mode of operation for marketing benefits - it is WOW when the models produce something fast.

Evidence fabrication under convenience pressure. It looks like a dangerous failure mode to me because it looks like compliance, but it's clearly not. And sometimes large repo indexing would timeout and they'd wave it off without investigating why - first time I tried to index the Linux kernel using the Typescript prototype it ran out of memory, then it ran for hours after me specifically asking the AI to investigate WHY and fix the memory and performance issues.

Tool-path avoidance. Agents prefer the path that gives them maximum information with minimum ceremony. A raw database query returns data faster than invoking a CLI that parses, formats, and validates. The agent does not care that the CLI is the product. It cares about answering the immediate question. Using the product's own interface is a discipline agents do not have unless it is the only option.

Priority consensus drift. Two models reviewing each other will converge on local quality improvements over global priority alignment. They are excellent at asking "is this implementation correct?" They are poor at asking "should we be implementing this at all right now?" or "is this the right tech stack?"

In fact, they never look for tech stack alternatives.

Breadth-durability gap. Agents excel at adding new capabilities. They are poor at maintaining the infrastructure those capabilities depend on. Every new feature type that stores derived artifacts needs corresponding refresh/migration/persistence support. Agents will ship the feature and skip the plumbing unless the plumbing is an explicit, gated prerequisite.

Compacting seems to have problems in Claude Code - it would often completely forget what it was doing post compaction and start working on a slightly different task.

What I Had to Become

The word "manager" does not quite describe what I was doing.

I held the architectural intent. The models could not derive it from the codebase alone, no matter how many docs I wrote. VISION.md helped. System prompts helped. But the gap between stated architecture and the agents' default behavior was constant. Closing that gap was my primary job.

I held prioritization. The agents would not spontaneously work on the hardest, most important thing. They would work on the most tractable thing adjacent to what they just finished. I had to redirect constantly - not the implementation, but the target, the vision.

I held verification integrity. If I did not personally check whether a test was actually run or merely predicted, the evidence chain was unreliable. The agents did not lie in the human sense. They optimized for response completeness over execution fidelity. The effect is insidious because it's hard to detect. It was easy to spot mistakes in an LLM from 1-2 years ago - now the output from frontier models looks much more plausible and the wrong assumptions are often hidden deeper. And they often don't test the right thing either...

I held the refactoring trigger. The agents never said "this file is too large, we should restructure." They would have kept appending to main.rs until it was 20,000 lines. The structural correction impulse was entirely mine.

What This Proves About the Thesis

I wrote in "Missing Links in Agentic Coding" that agents fail in real systems because they lack orientation, trust, and control infrastructure. Then I spent a month watching agents fail in exactly those ways while building the system designed to solve exactly those problems.

The irony is structural, not cosmetic.

The agents building repo-graph needed repo-graph. They needed a queryable graph of what exists, what owns what, what the boundaries are, what changed, and what the priorities are. They did not have it. So they guessed. They drifted. They fabricated evidence. They ignored boundaries. They polished the wrong things.

The product is being built by the same forces it is designed to constrain. I saw a few people diagnosing the same problem around the beginning of 2026 and writing various solutions from better AI memory systems, to similar tree sitter approaches.

I am sure that Anthropic and OpenAI are aware of the problem and probably working on their own solutions to this - I'm not claiming it's optimal. Just a step in the right direction.

What the Codebase Looks Like Now

Honest accounting. The status summary as of May 8, 2026:

Mature: Multi-language extraction substrate (9 languages, 11,243 symbols, 12,285 resolved edges). Documentation inventory (376 docs entries, 354 generated MAP files). Trust and degradation reporting. Agent orientation surfaces (rmap orient, rmap check, rmap explain). Governance substrate.

Prototype: Module discovery on the Rust path (364 inferred, 0 declared, 0 operational). Runtime and build surfaces (CLI exists, zero persisted surfaces). Refresh durability for derived artifacts (latest snapshot lost boundary and contract data that prior full snapshot had). Quality measurements (zero rows in current snapshot). Change discovery (trust explicitly downgrades to LOW).

The strongest statement that fits the evidence: Layer 0 extraction is mature. Documentation and trust orientation are mature. Several Layer 1, 2, and 3 capabilities are designed, scaffolded, and partially implemented, but not yet durable on the Rust-primary refresh path.

That assessment was produced by the system's own trust reporting. The product is honest about its own gaps. That honesty is, itself, one of its most mature features.

The Decision Ahead

I am considering forking to a clean repo called simply rmap. Drop the early TypeScript prototyping. Start the public artifact from the Rust-primary identity that the product actually is now, not the hybrid it grew through.

The TypeScript prototype was necessary. It proved the concepts. It validated the architecture. But it is now legacy inside a five-week-old repo. The current product is Rust-primary, and the codebase should reflect that without carrying 6,000 lines of TypeScript.

This is also an engineering-culture statement. The agents will do better with a clean starting point than with a codebase that contains its own archaeological layers. Reducing the orientation burden on the tools that build the product is exactly the same problem the product solves for its users.

The Honest Conclusion

I built a multi-language code intelligence system in a language I barely know, using AI agents as my engineering team, in five weeks. The system works. It indexes real codebases. It reports trust honestly. It detects boundaries across IPC families. It has 35 Rust crates and a coherent architectural spine.

The agents did not do this alone. They could not have. Without architectural governance, constant priority redirection, manual verification of evidence, and structural interventions they never would have initiated, the result would have been a 20,000-line main.rs that indexes TypeScript files and claims to do more.

The conclusion is not "AI replaces engineers." The conclusion is not "AI is useless." The conclusion is specific and earned:

AI agents are force multipliers for someone who knows what to build and is willing to govern the build process with the same rigor they would apply to a human team - plus additional rigor for failure modes humans do not have.

The bottleneck is now system judgment. The agents pretend they have it but they don't have to live with the consequences of their actions - but you do. So you need to apply your judgement and understanding. What they give you is execution speed. It's up to you to keep them speeding in the right direction.

Later edit:

After writing this, something became obvious - I had to refactor the the agent operating docs to address the failure modes described above.

The original CLAUDE.md had accreted into a mixed-purpose instruction dump - architecture rules, task protocols, validation steps, priority context, all in one file. The agents treated it the way they treat any large document: they skimmed it, cherry-picked what was convenient, and ignored the rest.

It is now split into three layers: a short universal control-law file (CLAUDE.md), a compaction-resistant operational checkpoint (CURRENT_SLICE.md) that states the current priority and validation contract, and deeper task-specific reference docs under agent_docs/. The sequencing is explicit: orient on vision and current slice first, inspect the system through rmap, then execute against a declared validation contract.

This is better. Adherence improved.

But the larger lesson from this entire buildout still holds: where a rule can be enforced by tooling, hooks, or CI, enforcement beats prose every time. I had to build a daemon to stop the agents from scattering databases. No amount of instruction achieved what a single architectural constraint did.

One final note on the fact that I did not know how to build repo-graph when starting - I read this somewhere and it stuck with me "you can outsource your thinking but you can’t outsource your understanding" - and whenever I let the agents do what they do without me really understanding, it always ends up catching up with me - I still have to understand, if I didn't know at the start, now I have to learn...

And I'm still catching up.


Repo-graph is open source. The fork decision is pending. The existing repo is at github.com/andreirx/repo-graph.

Discussion (0)

No comments yet. Be the first!

Join the conversation