Docs · graph retrieval

Graph retrieval

Personalized PageRank with HippoRAG node-specificity

The problem

Given a query — "where does user authentication happen?" — retrieve the smallest subgraph that is sufficient for Claude to answer. Flat vector retrieval misses multi-hop: the file that validates a JWT may not contain the word "authentication" anywhere; it's reached via auth-middleware.ts → jwt-verifier.ts → crypto-utils.ts.

HippoRAG (NeurIPS 2024) reports a ~20% improvement over flat retrieval on multi-hop QA by using Personalized PageRank over an entity-linked graph. Causalist borrows this.

Personalized PageRank

Starting from a query, we identify seed nodes via entity mention and semantic matching. We then compute the stationary distribution of a random walk that restarts with probability α\alpha at the seeds:

r  =  (1α)Mr  +  αsr \;=\; (1 - \alpha)\, M r \;+\; \alpha s

where

  • rRVr \in \mathbb{R}^{|V|} is the retrieved-relevance score for every node,
  • MM is the column-stochastic transition matrix of the graph (edge weights from confidence × inverse degree of the source),
  • ss is the seed distribution (one-hot on query-relevant nodes, or a soft distribution from semantic similarity),
  • α[0,1]\alpha \in [0, 1] is the restart probability, typically 0.150.15.

We solve this iteratively via power iteration until r(k+1)r(k)1<105\lVert r^{(k+1)} - r^{(k)} \rVert_1 < 10^{-5}. Typical convergence is 30–50 iterations.

Node specificity (the HippoRAG trick)

A naive PPR over a codebase drowns the retrieval in hub nodes — main.ts, index.ts, types.ts — that connect to everything. HippoRAG applies a node-specificity weight to the seed distribution:

si    log ⁣(Ndegree(i))s_i \;\propto\; \log\!\left(\frac{N}{\text{degree}(i)}\right)

where N=VN = |V|. Hubs get small seed weight; specific, narrow nodes get large seed weight. The effect is dramatic: retrieval becomes about the distinctive files for a query, not the central ones.

Subgraph extraction

Once rr is computed, we extract the top-kk highest-scoring nodes (typically k=20k = 20) and all edges between them. The resulting subgraph is what Claude sees as context.

Topological ordering

Causal Graphs Meet Thoughts shows that putting cause nodes before effect nodes in the prompt, so chain-of-thought aligns with graph traversal, improves reasoning accuracy on causal queries. We sort the retrieved subgraph topologically before serialization:

prompt_order(n)  =  topo_rank(n)\text{prompt\_order}(n) \;=\; \text{topo\_rank}(n)

with ties broken by descending PPR score. Cycles (which shouldn't exist but do during indexing) are broken by removing the lowest-confidence edge in the cycle.