Featured image of post In agent systems, context matters more than the prompt

In agent systems, context matters more than the prompt

A practical guide to what the planner actually sees, why context is more than the prompt, and why better context selection matters more than simply adding more tokens.

This is the second post in a series called “Agent systems in plain English,” where I try to make agent systems easier to reason about in practical terms. In this post, I focus on context: the full run input the planner sees when it is choosing the next step.

If the model is allowed to choose the next step, what exactly does it see before it chooses?

I think people often talk about “prompt engineering” as if the prompt were the whole object. It is not.

Anthropic’s September 29, 2025 post “Effective context engineering for AI agents” frames the distinction clearly: the planner does not see “the prompt.” It sees the whole token budget for that run. OpenClaw’s current context docs make the same point from the runtime side. Context is everything the system sends to the model for a run, bounded by the current context window, the finite token budget the model can attend to in a single run.

The planner’s input is the full run context. The prompt is only one slice of it.

If the previous post, titled Agents in plain English for developers asked “who owns the next step?”, post 2 asks the follow-up question: how is that input assembled?

This post is specifically about tool-using agent systems where the model is choosing the next step.

Here, “planner” just means the model as used for next-step choice, and “runtime” means the host system that assembles context, exposes tools, and enforces limits.

Context is not just the prompt

At the API level, an LLM call is effectively stateless. Across calls, it only knows what the runtime sends for this run.

That usually includes more than people think, including few-shot examples, meaning small input/output demonstrations shown in the prompt:

Part of the runUsually visible to you?Counts toward the window?
System promptpartlyyes
Few-shot examplesyesyes
Conversation historyyesyes
Tool calls and tool resultsyesyes
Injected files or workspace snippetssometimesyes
Tool schemasoften not as plain textyes
Provider-added wrappers or hidden headersoften not visibleyes

OpenClaw’s context docs explicitly note that provider wrappers or hidden headers still count even when you do not see them directly. Invisible tokens still consume budget.

Prompting still matters. It is just one slice of the assembled input. Files, examples, tool definitions, tool results, and runtime-added wrappers all compete for the same finite window.

Bigger context is not automatically better context

A common instinct here is: if some context helps, more context should help more. Sometimes that works briefly. Then it starts failing in familiar ways.

Too much context often creates at least four problems:

  • it dilutes the signal
  • it creates more irrelevant choices
  • it makes tool selection murkier
  • it raises the cost of every turn

The issue is not just token cost. In many cases, the more important cost is decision quality. If you give the planner three similar tools, one stale incident note, and an example that fits the wrong checkout failure, you are not really informing it. You are making action selection harder.

I think the better question is not “how much context can I fit?” but “what is the smallest useful context for this task?”

Context is not memory

Context is what is inside the model’s current window for this run. Memory is what you may store somewhere else and decide to reload later.

OpenClaw’s context docs are very clear on this distinction: memory may live on disk and be reloaded later; context is what is actually inside the model window now.

That distinction matters because “memory” often ends up meaning “we pasted old things back into the next run.” The result is not memory. It is clutter.

We will cover this topic in more detail later in another blog post. For now, a useful rule is simpler: if you cannot explain why a piece of information belongs in the current run, it probably does not belong there.

A minimal setup

If you need uv, PEP 723, or free API access routes, see the previous post, Agents in plain English for developers. For this post, the only thing you need to know is that the script below is ordinary Python 3.11+ and runs with uv run. If no API key is set, it still works locally because it prints the assembled contexts.

A small context builder

This example builds two contexts for the same task:

  • a lean context with the right files, one clean reference example, and two clearly scoped tools
  • a bloated context with a stale but tempting incident note, a misaligned example, and broader tools that invite the wrong move

The important part is not just that the second profile is longer. It gives the model more ways to get pulled off course. In this post, those tools are only definitions inside the assembled context. Nothing executes yet; the host-side tool loop comes later.

If you want the commands exactly as written, call this file context_builder.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# ///

from __future__ import annotations

import json

FILES = {
    "checkout_spec.txt": "Checkout must preserve cart state and open the payment sheet.",
    "router_notes.txt": "The iOS white screen started after replacing push_checkout_screen() with replace('/checkout'). Inspect route replacement and state handoff.",
    "migration_incident_2024.txt": "During the 2024 payment-provider migration, blank checkout screens usually came from missing STRIPE_PUBLIC_KEY. Search env wiring before UI routes.",
    "support_macros.txt": "Start by apologizing, asking for a screenshot, and suggesting a reinstall if needed.",
}

TOOLS = {
    "lean": [
        ("read_file", "Read one provided file by name."),
        ("finish_json", "Return only JSON with probable_root_cause and next_file_to_read."),
    ],
    "bloated": [
        ("read_file", "Read one provided file by name."),
        ("search_repo", "Search broadly for anything related to the symptom, env keys, or past fixes."),
        ("open_incident_log", "Open an older incident note that may describe a similar checkout failure."),
        ("finish_json", "Return only JSON with probable_root_cause and next_file_to_read."),
    ],
}

EXAMPLES = {
    "lean": [
        ("white screen on checkout after router rewrite", '{"probable_root_cause":"router state handoff regression","next_file_to_read":"router_notes.txt"}'),
    ],
    "bloated": [
        ("white screen on checkout after router rewrite", '{"probable_root_cause":"router state handoff regression","next_file_to_read":"router_notes.txt"}'),
        ("blank checkout after provider upgrade", '{"probable_root_cause":"missing payment provider key","next_file_to_read":"migration_incident_2024.txt"}'),
    ],
}


def build_context(task, profile):
    files = ["checkout_spec.txt", "router_notes.txt"] if profile == "lean" else list(FILES)
    tools = TOOLS[profile]
    examples = EXAMPLES[profile]
    prompt = "\n\n".join([
        "You are a debugging assistant. Answer only from the provided context.",
        f"Task: {task}",
        "Available tools:\n" + "\n".join(f"- {name}: {desc}" for name, desc in tools),
        "Examples:\n" + "\n".join(f"Input: {inp}\nOutput: {out}" for inp, out in examples),
        "Files:\n" + "\n".join(f"## {name}\n{FILES[name]}" for name in files),
    ])
    return {"profile": profile, "files": files, "tools": [name for name, _ in tools], "example_count": len(examples), "prompt": prompt}


task = "Explain the likely checkout bug and name the next file to inspect."
for profile in ("lean", "bloated"):
    payload = build_context(task, profile)
    print(json.dumps({
        "profile": payload["profile"],
        "files": payload["files"],
        "tools": payload["tools"],
        "example_count": payload["example_count"],
        "chars": len(payload["prompt"]),
        "prompt_preview": payload["prompt"][:220] + "...",
    }, indent=2))

Run it:

1
uv run context_builder.py

Expected output sketch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  "profile": "lean",
  "files": ["checkout_spec.txt", "router_notes.txt"],
  "tools": ["read_file", "finish_json"],
  "example_count": 1,
  "chars": 706,
  "prompt_preview": "You are a debugging assistant..."
}
{
  "profile": "bloated",
  "files": ["checkout_spec.txt", "router_notes.txt", "migration_incident_2024.txt", "support_macros.txt"],
  "tools": ["read_file", "search_repo", "open_incident_log", "finish_json"],
  "example_count": 2,
  "chars": 1337,
  "prompt_preview": "You are a debugging assistant..."
}

Even a tiny tool schema, the structured contract the model sees for a tool, starts adding context too:

1
2
3
4
5
6
7
8
9
{
  "name": "search_repo",
  "description": "Search broadly for anything related to the symptom, env keys, or past fixes.",
  "input_schema": {
    "type": "object",
    "properties": {"query": {"type": "string"}},
    "required": ["query"]
  }
}

The same logic applies to any skill text or runtime metadata the host injects. If it is in the run, it spends tokens.

Illustrative contrast, not guaranteed model output:

  • lean: {"probable_root_cause":"router state handoff regression","next_file_to_read":"router_notes.txt"}
  • bloated: {"probable_root_cause":"missing payment provider key","next_file_to_read":"migration_incident_2024.txt"}

Notice that the wrong answer is not random. It follows the stale file and the misaligned example.

If you have an API key set, add a tiny model call and compare the answers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import os
from urllib import request

if os.getenv("OPENROUTER_API_KEY"):
    for profile in ("lean", "bloated"):
        payload = build_context(task, profile)
        print(f"[{profile}]")
        req = request.Request(
            "https://openrouter.ai/api/v1/chat/completions",
            data=json.dumps({
                "model": "openrouter/free",
                "temperature": 0,
                "messages": [{"role": "user", "content": payload["prompt"]}],
            }).encode("utf-8"),
            headers={"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}", "Content-Type": "application/json"},
        )
        with request.urlopen(req, timeout=60) as response:
            body = json.loads(response.read().decode("utf-8"))
        print(body["choices"][0]["message"]["content"].strip())

When I ran the fuller editor-side version of this example on March 28, 2026, this is what I got in one pass:

OpenRouter:

1
2
3
4
5
[lean]
{"probable_root_cause":"router state handoff regression","next_file_to_read":"router_notes.txt"}

[bloated]
{"probable_root_cause":"missing payment provider key","next_file_to_read":"migration_incident_2024.txt"}

Gemini:

1
2
3
4
5
[lean]
{"probable_root_cause":"router state handoff regression","next_file_to_read":"router_notes.txt"}

[bloated]
{"probable_root_cause":"router state handoff regression","next_file_to_read":"router_notes.txt"}

That result is useful. A stronger model such as Gemini can sometimes work through worse context, but that does not make the worse context harmless.

The bloated profile still costs more, still gives the model more plausible ways to go wrong, and still makes the host pay for more ambiguity. The point is not to prove that a good model will always collapse under bad context. The practical goal is to set the system up for success: better odds of the right move, lower token cost, and fewer chances to drift.

What to notice:

  • The task is the same in both profiles. What changes is the planner’s situation.
  • The lean profile gives the model two files, two clearly scoped tools, and one clean reference example.
  • The bloated profile adds a stale checkout incident, a second example that points toward the wrong failure mode, and broader tools that make search look attractive.
  • That is the real difference between helpful context and misleading context. The second profile is not just longer. It is more ambiguous.
  • Even toy tool descriptions cost tokens. The schema for each tool costs more.
  • In this entry, the tools are still just part of the input. The script is not executing read_file or search_repo; it is showing what the model would see.
  • OpenRouter drifted on the bloated profile in the run above. Gemini did not. In this example, both observations point in the same direction: it helps to avoid making the planner solve a harder problem than necessary.

The behavioral consequence is the useful part. In the lean profile, a very plausible first move is “read the note about route replacement.” The debugging task already points to a narrow route/state-handoff problem, so broad search is a detour, not a good first move. In the bloated profile, search_repo and the old migration incident can start looking reasonable even though they are worse bets.

How I would test this

I would run the same task against both profiles several times and compare:

  • whether the lean profile identifies route replacement or state handoff and names router_notes.txt first
  • whether the bloated profile drifts toward migration_incident_2024.txt, provider-key explanations, or generic search
  • how long and how generic the answer becomes
  • how much prompt size grew relative to the extra useful signal

If I were wiring this into a real loop, I would also compare first-tool choice and later trace length. The point is to test whether extra context improved the planner’s situation or just made it busier.

One thing to change

Remove the clean reference example from the lean profile.

Then run the same task again and see whether the answer gets less structured or the next file becomes less stable. That is a cheap way to feel the difference between:

  • context that shows the output contract clearly
  • context that only describes the task and hopes for the best

You can also add another stale migration note to the bloated profile and watch how quickly “helpful background” becomes an attractive distraction.

What breaks first in practice

1. Too much context

This is the dilution problem.

Large tool sets, long examples, full repo dumps, and noisy logs all compete for the same window. Even harmless extras steal attention and budget. Anthropic’s September 29, 2025 article makes this point directly: context is finite, and the real job is curating the smallest useful set of tokens.

2. Wrong context

Wrong context is about misdirection.

The model can fail because the right file was absent, because a stale incident note looked plausible, because the wrong example dominated the pattern, or because broad tool descriptions pushed it toward the wrong action. From the outside, that can look like a reasoning failure, even when the problem started earlier in the assembled context.

3. Untrusted context

Any file, webpage, pasted log, or tool output you inject into the run becomes part of the planner’s situation.

That matters even before the full safety post. If the source is noisy, hostile, stale, or misleading, then the planner is being asked to reason on top of shaky input. A more useful response is to treat context ingestion like a boundary worth designing, not just something to throw into the run.

4. Confusing memory with current context

If every old result, every note, and every trace gets stuffed back into the current run, then memory stops helping and starts polluting.

This is one reason I like OpenClaw’s explicit separation between context and memory. It forces the question that should have been asked earlier: why is this in the current window at all?

Limits of this entry

This post did not try to teach retrieval systems, RAG ranking, long-horizon memory, or context compaction in depth.

It also stayed on static context assembly on purpose. In real agentic systems, context changes across steps, tool calls, and retries, and that dynamic management deserves its own treatment later in the series.

The first task here was narrower: make context bigger than “the prompt,” show that context assembly is an engineering decision, and make the testing question explicit.

A separate question, which I will leave for another post, is when the developer should own the next step instead of the model.

Posts published in this series

Built with Hugo
Theme Stack designed by Jimmy