Your AI context files are either helping or hurting. Here's how to tell.

Context files are now standard in AI coding workflows: AGENTS.md, CLAUDE.md, Cursor rules, and Copilot instructions.

The open question is not whether teams should have context files. The real question is whether the files they have are improving outcomes or quietly making them worse.

Recent data shows both outcomes are possible.

In AGENTbench, LLM-generated context files reduced success rates by about 2-3% and increased costs by 20-23%.
In Vercel’s Next.js 16 evals, an 8KB AGENTS.md docs index improved pass rate from 53% to 100%.

Those results are not contradictory. They point to a practical rule:

Context works when it is novel, current, and reliably consumed by the agent.

Two studies, one conclusion

AGENTbench: context can hurt when it is generic

The AGENTbench study from JetBrains and Constructor University evaluated 138 real GitHub tasks across 12 Python repositories. Agents were run with no context, LLM-generated context, and human-written context.

Configuration	Success Rate Change	Cost Change	Avg. Extra Steps
Baseline (no context file)	—	—	—
LLM-generated context	-2% to -3%	+20% to +23%	+2.45 to +3.92
Human-written context	~+4%	+20% to +23%	+2.45 to +3.92

Source: “Evaluating AGENTS.md,” arXiv 2602.11988, Feb 2026

Two important takeaways from AGENTbench:

Agents did follow context instructions.
Following poor or redundant instructions increased activity and cost without improving outcomes.

The study also removed repository docs and reran tests. In that setting, LLM-generated context improved performance (+2.7%), which strongly suggests the main issue was redundancy, not the existence of a context file.

Vercel: context can be decisive when delivery is reliable

On January 27, 2026, Vercel published eval results for Next.js 16 APIs that were not in training data. They compared multiple ways of giving the model missing knowledge.

Configuration	Pass Rate	Change vs. Baseline
Baseline (no docs)	53%	—
Skills retrieval (default)	53%	+0pp
Skills with explicit instructions	79%	+26pp
`AGENTS.md` docs index (8KB)	100%	+47pp

Source: Vercel Engineering Blog, Jan 27, 2026

In this case, passive context in AGENTS.md beat default on-demand skill retrieval. That is a delivery reliability result as much as a content result. Vercel also calls out an important rule for agent systems: prefer retrieval-led reasoning over pre-training-led reasoning.

The Context File Paradox

What increases with context	What can worsen despite context
Instruction following	Success rate for LLM-generated context (-2% to -3%)
Tool usage and exploration (+2.45 to +3.92 steps)	Token/compute cost (+20% to +23%)
Agent activity that looks thorough	Signal quality when context duplicates existing docs

Takeaway: More agent activity is not the same as better outcomes.

What actually makes context effective

The two studies together point to three conditions.

1) Novelty

Low-value context repeats what the agent can already infer from repo docs, config files, and code. High-value context adds codebase-specific constraints the model cannot infer reliably.

Use this quick test for every line: if removing this line would not change model behavior, it does not belong in your context file.

2) Minimality

Long files are not automatically better. Dense files with a small number of high-impact constraints usually perform better than broad files full of generic advice.

AGENTbench’s recommendation is explicit: “minimal requirements.” In practice, this means shorter context with higher signal density.

3) Freshness

A context file can start useful and become harmful as the codebase evolves.

A context file that references old architecture, deprecated APIs, or moved directories will be followed faithfully, and that is exactly why stale context creates repeat failures.

Stale context is not neutral. It is a reliable way to generate the wrong code faster.

A practical context standard

You do not need a perfect file. You need a maintainable standard: keep repo-specific constraints and non-obvious workflows, remove generic advice that duplicates existing docs, keep commands and paths current, and promote repeated review feedback into durable rules. The target is not completeness. The target is high signal.

A compact example of low-value vs high-value rules:

Low-value (generic)	High-value (repo-specific)
“Write comprehensive tests"	"When behavior crosses service boundaries, add an integration test in `tests/integration/` and run `make test-integration` before opening a PR."
"Follow code style"	"Service-layer functions must return `Result[T, ErrorCode]`; never raise from services except startup/config boot failures."
"Handle errors gracefully"	"For external API handlers, always use `@handle_api_errors` from `core/middleware.py`; do not return raw exceptions from route functions.”

How to run this as a team process

Treat context maintenance the same way you treat test maintenance. Signals come in continuously from merged diffs, PR review comments, and codebase changes, including implicit learnings from accepted code changes even when comments are sparse. When the same correction appears more than once, convert it into a concrete context update and ship it through normal PR review. This keeps changes visible, reversible, and tied to real engineering feedback.

The key is cadence. Small, frequent updates work better than occasional rewrites because they reduce staleness and make each change easier to evaluate. Over time, this builds a context file that reflects how the team actually ships code, not how it shipped six months ago.

You still have to measure

As we argued in SWE-bench is not your codebase, benchmark wins do not automatically transfer to your repositories. AGENTbench and Vercel provide strong direction, but you still need local proof. Track a compact scorecard: acceptance rate of context-update PRs, recurrence of repeated review corrections, and time-to-merge for AI-generated PRs before and after updates. We are also building codebase-level benchmarks so teams can measure this directly and know when context is helping versus hurting.

Limits of the evidence

AGENTbench and Vercel provide strong directional evidence, but they are not universal guarantees for every stack and team. Use these results to set better defaults, then validate on your own repositories with your own evals and review data.

What we’re building at ContextBridge

If you’re interested in tools that help teams get more out of AI coding workflows, see what we’re building at ContextBridge.