Docs · verification
Verification
Contradictions, DAG invariants, interventional edges
Structural verification
Every time the graph is updated, Causalist runs:
- Cycle detection on causal edges.
caused,prevented, andcauses-to-failmust form a DAG. We use Tarjan's SCC algorithm in ; any non-singleton SCC is a contradiction. - Pairwise contradiction detection. Edges and with overlapping timestamps are incompatible; we surface both and ask the user (or the agent's introspection loop) to resolve.
- Confidence floor. Edges below are dimmed in the viewer; edges below are dropped from reasoning context entirely.
From structural to interventional
Today's edges are observational or LLM-inferred. The research program for Causalist v2 is to promote them to interventional where possible.
Mutation testing
Given a function , a test suite , and an existing calls edge :
- Apply a semantic mutation to (e.g. flip a comparison, return a constant).
- Re-run . Let be the set of newly-failing tests.
- For each , promote or create the edge with confidence
estimated from test-history flake rate. This yields genuinely causal edges in Pearl's sense: we intervened, we observed, the counterfactual is the unmutated world.
Git commits as natural experiments
Every commit touches a set of files and either leaves the build green or turns a test red. Across many commits we have samples that let us estimate
Files with a large positive gap are load-bearing for that test. This isn't a true intervention (commits correlate with each other), but it's a principled way to rank candidate causal edges for mutation testing without having to mutate everything.
Counterfactual replay
Given a regression introduced by commit , the counterfactual is "what would have happened if had reverted file to its state at ?" We use Claude + the test suite to hypothesize, then run the test to verify. Edges surviving this are labeled counterfactually-required.
The introspection loop
Every few agent iterations, a dedicated pass reads the graph and reports:
- Repeated actions — if the agent has called
Editon the same file 4+ times without an intervening successful test, it's stuck. Surface it. - Contradictions — any new violations of DAG or pairwise checks.
- Low-confidence territory — clusters of nodes where average edge confidence < 0.3 deserve human review.
This snapshot gets injected back into the next turn's context, so the agent can course-correct on its own memory. It echoes Reflexion (NeurIPS 2023) — agents that read their own trajectory outperform agents that don't.