Docs · verification

Verification

Contradictions, DAG invariants, interventional edges

Structural verification

Every time the graph is updated, Causalist runs:

  1. Cycle detection on causal edges. caused, prevented, and causes-to-fail must form a DAG. We use Tarjan's SCC algorithm in O(V+E)O(V + E); any non-singleton SCC is a contradiction.
  2. Pairwise contradiction detection. Edges (X,Y,cause)(X, Y, \text{cause}) and (X,Y,prevent)(X, Y, \text{prevent}) with overlapping timestamps are incompatible; we surface both and ask the user (or the agent's introspection loop) to resolve.
  3. Confidence floor. Edges below c=0.2c = 0.2 are dimmed in the viewer; edges below c=0.05c = 0.05 are dropped from reasoning context entirely.

From structural to interventional

Today's edges are observational or LLM-inferred. The research program for Causalist v2 is to promote them to interventional where possible.

Mutation testing

Given a function ff, a test suite TT, and an existing calls edge (f,g)(f, g):

  1. Apply a semantic mutation to gg (e.g. flip a comparison, return a constant).
  2. Re-run TT. Let Tfail(g)T_{\text{fail}}(g') be the set of newly-failing tests.
  3. For each tTfail(g)t \in T_{\text{fail}}(g'), promote or create the edge (g,t,causes-to-fail)(g, t, \texttt{causes-to-fail}) with confidence
c  =  1    P(flake(t))c \;=\; 1 \;-\; P(\text{flake}(t))

estimated from test-history flake rate. This yields genuinely causal edges in Pearl's sense: we intervened, we observed, the counterfactual is the unmutated world.

Git commits as natural experiments

Every commit cc touches a set of files FcF_c and either leaves the build green or turns a test tt red. Across many commits we have samples that let us estimate

P(fail(t)fFc)vs.P(fail(t)fFc)P\big(\text{fail}(t) \,\big|\, f \in F_c\big) \quad\text{vs.}\quad P\big(\text{fail}(t) \,\big|\, f \notin F_c\big)

Files with a large positive gap are load-bearing for that test. This isn't a true intervention (commits correlate with each other), but it's a principled way to rank candidate causal edges for mutation testing without having to mutate everything.

Counterfactual replay

Given a regression introduced by commit cc, the counterfactual is "what would have happened if cc had reverted file ff to its state at c1c-1?" We use Claude + the test suite to hypothesize, then run the test to verify. Edges surviving this are labeled counterfactually-required.

The introspection loop

Every few agent iterations, a dedicated pass reads the graph and reports:

  • Repeated actions — if the agent has called Edit on the same file 4+ times without an intervening successful test, it's stuck. Surface it.
  • Contradictions — any new violations of DAG or pairwise checks.
  • Low-confidence territory — clusters of nodes where average edge confidence < 0.3 deserve human review.

This snapshot gets injected back into the next turn's context, so the agent can course-correct on its own memory. It echoes Reflexion (NeurIPS 2023) — agents that read their own trajectory outperform agents that don't.