We ran 24 API calls through Claude Sonnet 4.6 - same questions about code changes, same model, same temperature. One gets git diff output. The other gets structured JSON from sem diff. Results: 95.9% accuracy vs 41.5%.
The failure modes are systematic, not random:
1. git diff has no concept of "entity." Ask an agent to count added functions and it counts + lines instead. On one commit it reported 238 adds - that was the number of + lines. Actual: 32 entities.
2. No way to distinguish add vs modify. A modified function shows +/- hunks, same as a new function in a changed file. The model listed 9 "added" functions - 4 were actually modified. Precision: 55.6%.
3. No entity type vocabulary. Asked for entity type counts, the model returned {"file": 11}. It counted files because line diffs have no AST. Truth: 15 functions, 12 interfaces, 3 variables, 1 class.
4. Config files are invisible. JSON/YAML changes are just +/- key-value lines. The model doesn't classify them as entities. Missed package.json entirely.
5. Large diffs get truncated. A 3,905-line diff hit the 100KB cap. The model saw partial context and found 25/67 functions (37% recall). Structured JSON is compact enough to fit - 64% recall on the same commit.
sem is a Rust CLI (30ms, single binary) that sits on top of git. It uses tree-sitter to parse your code into entities - functions, classes, properties - then does three-phase matching (exact ID, content hash, fuzzy similarity) to classify each change as added/modified/deleted/renamed.
Output is JSON. Pipe it into your agent, CI, or whatever. git stays your source of truth - sem just reads from it.
We ran 24 API calls through Claude Sonnet 4.6 - same questions about code changes, same model, same temperature. One gets git diff output. The other gets structured JSON from sem diff. Results: 95.9% accuracy vs 41.5%.
The failure modes are systematic, not random:
1. git diff has no concept of "entity." Ask an agent to count added functions and it counts + lines instead. On one commit it reported 238 adds - that was the number of + lines. Actual: 32 entities.
2. No way to distinguish add vs modify. A modified function shows +/- hunks, same as a new function in a changed file. The model listed 9 "added" functions - 4 were actually modified. Precision: 55.6%.
3. No entity type vocabulary. Asked for entity type counts, the model returned {"file": 11}. It counted files because line diffs have no AST. Truth: 15 functions, 12 interfaces, 3 variables, 1 class.
4. Config files are invisible. JSON/YAML changes are just +/- key-value lines. The model doesn't classify them as entities. Missed package.json entirely.
5. Large diffs get truncated. A 3,905-line diff hit the 100KB cap. The model saw partial context and found 25/67 functions (37% recall). Structured JSON is compact enough to fit - 64% recall on the same commit.
sem is a Rust CLI (30ms, single binary) that sits on top of git. It uses tree-sitter to parse your code into entities - functions, classes, properties - then does three-phase matching (exact ID, content hash, fuzzy similarity) to classify each change as added/modified/deleted/renamed.
Output is JSON. Pipe it into your agent, CI, or whatever. git stays your source of truth - sem just reads from it.
brew install sem-diff
Benchmark script + full results: https://github.com/Ataraxy-Labs/sem/blob/main/bench/agent-ac...