Context management is a core engineering problem for any reliable agent harness. At Cotool, our users rely on us to automatically solve this for their custom agentic security flows. These flows involve autonomously querying SIEMs, analyzing logs, and following investigation paths. However, users build the agents: they can choose any model, write custom system prompts, and connect to security tools through our in-house connector platform, and we need to be able to manage context throughout.

The Problem: the LLM decides what data to retrieve and when. The agents query SIEMs, retrieve logs, read detection-as-code, and explore threat intelligence feeds to bring that data into context dynamically.

We thought context management would be a solved problem. Traditional patterns like RAG, chunking, and sliding windows work well for document Q&A and code generation. They don't work for security investigations. In this post, we'll walk through why security operations tasks are fundamentally different, the solutions we tried, and what we eventually built.

The Security Data Problem

Security tools are verbose. Extremely verbose.

Most security workflows center around log aggregators and SIEMs. These systems are designed to capture everything because security is fundamentally about anomaly detection: finding needles in haystacks. That suspicious PowerShell command buried in line 847 of a 5,000-line log dump? That could be your incident.

This creates a paradox for context management: you can't aggressively prune data before it enters the context window because you don't know what's anomalous yet. Any line, any field, any timestamp could be the critical piece of evidence. Traditional chunking and semantic search fall flat here; security logs are typically structured, and the anomaly is only meaningful in the context of surrounding logs.

We've had some success giving LLMs access to pruning tools: direct SQL-like SIEM queries, grep utilities, and filtering capabilities. But with long enough agent loops tackling complex investigations, you still eventually hit the context ceiling.

Our Requirements

Based on our use case, we established three hard requirements:

Speed: Our agents need to respond quickly. We have interactive flows where high latency breaks the user experience. Any context management solution that adds seconds to every agent turn is a non-starter.

Minimal Compression: While context rot exists, our hypothesis is that models perform best on time-based correlation and anomaly detection when they can access as much tool output as possible. We only wanted to prune or compact when absolutely necessary.

Zero Context Errors: Running out of context mid-investigation is an unacceptable failure mode.

Measuring Context: Harder Than It Looks

To manage context, you first need to measure it accurately. But here's the constraint: you need a measurement system that always overestimates (to avoid context errors) but doesn't overestimate by too much (to avoid unnecessary pruning). We set a target of no more than 110% overestimation.

Cotool supports all the major LLM providers, but this introduces a challenge because each model provider uses a different (and in some cases, black-box and proprietary) tokenizers. In practice, the token lengths vary dramatically between providers for the same messages, so your estimator needs to be provider-aware.

What We Tried First

Generic Character Estimation: We tried JSON.stringify(messages).length / 4. This was either a massive overestimate or underestimate depending on the content mix (which we get into later); useless in practice.

OpenAI + Tiktoken: We started with Tiktoken, OpenAI's open-source encoder. On an M4 MacBook Pro, encoding a 673K token conversation takes ~368ms. Not heinous, but not particularly latency or resource-efficient either.

On our production GCP Cloud Run instances (CPU-only, serverless), the numbers were significantly worse: 3.3 seconds to encode 678K tokens. This is significant additional latency for interactive UX’s on top of the providers’ time to first token.

// Tiktoken encoding example
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4');
const start = performance.now();
const tokens = enc.encode(JSON.stringify(conversation));
const end = performance.now();
console.log(`Encoded ${tokens.length} tokens in ${(end - start).toFixed(0)}ms`);
// Output on M4 MacBook Pro: Encoded 673,293 tokens in 368ms
// Output on GCP Cloud Run:  Encoded 678,227 tokens in 3,309ms
// 10-turn agent loop on Cloud Run: 9.8s of pure encoding overhead

More importantly, from first principles, this approach is heavy: do we need to fully tokenize the entire conversation repeatedly just to estimate token counts? And critically, this only solves the problem for OpenAI. We still needed solutions for Anthropic and Google, for example.

Provider Token APIs: Anthropic and Google both offer endpoints that return token counts. These were vulnerable to rate limits and still introduced unacceptable latency for multi-turn conversations. It also left us without a paddle for OpenAI

What Actually Worked

We built a custom token estimator using three key insights:

  1. Encoders behave differently for different input types. JSON payloads encode differently than files, which encode differently than plain text.

  1. Token counts are additive across messages. estimateTokens(m1, m2, m3) = estimateTokens(m1) + estimateTokens(m2) + estimateTokens(m3)

  1. LLMs can "learn" estimators through test-driven iteration.

Here's what we did:

  1. Built a labeled dataset by generating diverse conversations across model providers and capturing actual token counts from API responses.

  1. Initialized base weights for each message type within functions for each provider and applied

    JSON.stringify(m).length / seed_weight.
    function estimateAnthropicTokens(messages: ModelMessage[]): number {
        const x = convertToAnthropicMessagesPrompt({ prompt: messages }).messages;
        return x.map(n => {
            const contentLength = Array.isArray(n.content) ? n.content.map(c => {
                if (c.type === 'text') {
                    return (JSON.stringify(c).length / textWeight)
                } else if (c.type === 'thinking') {
                    return (JSON.stringify(c).length / thinkingWeight)
                } else if (c.type === 'redacted_thinking') {
                    return (JSON.stringify(c).length / redactedThinkingWeight)
                } else if (c.type === 'tool_use') {
                    return JSON.stringify(c).length / toolUseWeight;
                } else if (c.type === 'tool_result') {
                    return JSON.stringify(c).length / toolResultWeight
                }
                return 1600;
            }).reduce((l, r) => l + r, 0) : n.content.length / defaultWeight;
            return JSON.stringify({ ...n, content: undefined }).length + contentLength
        }).reduce((l, r) => l + r, 0);
    }
  1. Created unit tests with real messages and their known token counts.

    expect(estimateTokens(m1)).toBeGreaterThan(actualCount);
    expect(esimateTokens(m1)).toBeLessThanOrEqualTo(actualCount * 1.1);
  1. Used Claude Code to iteratively adjust the estimation function until tests passed within our threshold.

The result: a performant, interpretable function that mimics provider token counts on a distribution of data we care about within 110% accuracy, with latency bound to computing lengths of serialized JSON strings. We considered applying linear regression over the dataset to compute optimized weights as well, but we were interested in pushing the coding agent to optimize the entire estimation function.

Windowing Techniques: What We Learned the Hard Way

Attempt 1: Large Context Models

Initially, we just used large context window models and hoped for the best. This worked fine when use cases were simple. As investigations got more complex, requiring deeper log analysis and more tool calls, we started hitting limits even on million-token models like Gemini 2.5-pro.

Attempt 2: Sliding Window

We implemented a sliding window that removed old messages when context exceeded limits. This somewhat worked, but Anthropic broke everything.

Problem 1: Anthropic's Reasoning Requirements

Anthropic requires that assistant messages include reasoning content (thinking blocks) when the model supports extended thinking. If you remove an old assistant message that had reasoning but keep a newer one without it, the API throws an error. You can’t insert fake reasoning messages because Anthropic signs their reasoning messages to prevent mutating the chain-of-thought.

// This breaks with Anthropic
const messages = [
  { role: 'user', content: 'Investigate this alert' },
  { 
    role: 'assistant', 
    content: [
      { type: 'thinking', thinking: 'I need to query the SIEM first...' },
      { type: 'text', text: "I'll check the logs" }
    ]
  },
  // ... more messages ...
  { role: 'user', content: '[Content pruned to save context space]' },
  {
    role: 'assistant',
    content: [
      { type: 'text', text: 'Based on the logs...' }
      // ❌ Missing thinking block - Anthropic API error
    ]
  }
];
// After removing old messages with a sliding window, you need to track
// which assistant messages had reasoning to maintain proper structure

Problem 2: AI SDK Tool Call/Result Pairing

The AI SDK enforces strict pairing between tool calls and tool results. You can't have an assistant message with a tool-call without the corresponding tool message containing the tool-result, and vice versa.

// Example: conversation before sliding window
const messages = [
  { role: 'user', content: 'Search for suspicious IPs' },
  {
    role: 'assistant',
    content: [
      { type: 'text', text: 'Querying SIEM...' },
      {
        type: 'tool-call',
        toolCallId: 'call_123',
        toolName: 'query_siem',
        args: { query: 'SELECT * FROM logs WHERE...' }
      }
    ]
  },
  {
    role: 'tool',
    content: [
      {
        type: 'tool-result',
        toolCallId: 'call_123',
        toolName: 'query_siem',
        result: '... 50,000 lines of logs ...'  // 200K tokens
      }
    ]
  },
  { role: 'assistant', content: 'I found 3 suspicious IPs...' }
];
// If the tool result is too large, you have two bad options:
// 1. Remove both the tool-call AND tool-result (lose critical context)
// 2. Keep both and exceed context limits
// You can't do this - AI SDK will error:
const brokenMessages = [
  { role: 'user', content: 'Search for suspicious IPs' },
  {
    role: 'assistant',
    content: [
      { type: 'text', text: 'Querying SIEM...' },
      {
        type: 'tool-call',
        toolCallId: 'call_123',
        toolName: 'query_siem',
        args: { query: 'SELECT * FROM logs WHERE...' }
      }
    ]
  },
  // ❌ Missing tool result - AI SDK throws "tool call without result" error
  // Sentinel message to let the model know somethings missing
  { role: 'user', content: '[Content pruned to save context space]' },
  { role: 'assistant', content: 'I found 3 suspicious IPs...' }
];

These two constraints do not prevent you from implementing a sliding window approach, but they require you to maintain model-specific state and potentially over-prune messages in order to be compliant with the Anthropic and AI SDK APIs. Rather than maintain a complex state-machine with these variables to achieve a sliding window, we thought there might be an easier way where we keep the entire conversation history in context.

Attempt 3: Tool Result Pruning (Current Solution)

Our current approach keeps all messages in order but truncates their content strategically.

We score each message based on age and size, prioritizing tool results first. Starting with the oldest, largest tool results, we prune content until we're back under the context limit. If tool results aren't enough, we continue with other message classes.

This preserves conversation structure while intelligently removing the least critical data. It's not perfect, but it works reliably across providers and maintains the context integrity that Anthropic and the AI SDK demand.

How It Works

Instead of removing entire messages, we replace oversized content with placeholders while preserving message structure:

// Same conversation that broke with sliding window
const messages = [
  { role: 'user', content: 'Search for suspicious IPs' },
  {
    role: 'assistant',
    content: [
      { type: 'text', text: 'Querying SIEM...' },
      {
        type: 'tool-call',
        toolCallId: 'call_123',
        toolName: 'query_siem',
        args: { query: 'SELECT * FROM logs WHERE...' }
      }
    ]
  },
  {
    role: 'tool',
    content: [
      {
        type: 'tool-result',
        toolCallId: 'call_123',
        toolName: 'query_siem',
        result: '... 50,000 lines of logs ...'  // 200K tokens
      }
    ]
  },
  { role: 'assistant', content: 'I found 3 suspicious IPs...' }
];
// After pruning: structure intact, content truncated
const prunedMessages = [
  { role: 'user', content: 'Search for suspicious IPs' },
  {
    role: 'assistant',
    content: [
      { type: 'text', text: 'Querying SIEM...' },
      {
        type: 'tool-call',
        toolCallId: 'call_123',
        toolName: 'query_siem',
        args: { query: 'SELECT * FROM logs WHERE...' }
      }
    ]
  },
  {
    role: 'tool',
    content: [
      {
        type: 'tool-result',
        toolCallId: 'call_123',  // ✅ Still paired correctly
        toolName: 'query_siem',
        result: '[Tool result truncated to make space in the context window]'  // 20 tokens
      }
    ]
  },
  { role: 'assistant', content: 'I found 3 suspicious IPs...' }
];
// ✅ AI SDK happy: tool-call has matching tool-result
// ✅ Anthropic happy: all message structure preserved
// ✅ 180K tokens saved without losing conversation flow

The pruning happens in phases:

  1. Tool result outputs (oldest, largest first) - except the most recent

  2. Assistant tool-call inputs - except the latest assistant message

  3. Last tool result - if still over budget

  4. Assistant text content - except latest assistant

  5. User content - except first and last user messages

Each phase exits early if we're back under budget, ensuring minimal pruning.

Our Learnings

Context management is still unsolved for security AI. The patterns that work for document Q&A and code generation don't translate to security workflows.

Security logs are fundamentally different. They're not semantic, they're anomaly-focused, and you need surrounding context to derive meaning. RAG and chunking don't help when the needle could be anywhere in the haystack.

Generic agents are harder to manage. Giving control to the LLM produces better investigations but creates arbitrary conversation lengths with unpredictable tool usage. You can't pre-optimize for specific flows.

LLM-assisted optimization is powerful. Using an LLM coding tool to hill-climb toward an optimal solution through test cases proved surprisingly effective for hairy estimation problems. We'll definitely be applying this pattern elsewhere.

If you're building AI agents for security... or any domain with verbose, anomaly-driven data... traditional context management won't cut it. You need to measure fast, prune conservatively, and design for the fact that the LLM is in control, not you.

Cotool is a platform for building custom security AI agents. If you're dealing with alert fatigue and want to explore agentic automation, reach out.


Glossary

AI SDK: A software development kit (in this case, from Vercel) that provides standardized interfaces for working with different LLM providers and managing conversation flows, including tool calls and responses.

Chain-of-Thought: A reasoning technique where models explicitly show their thinking process step-by-step before arriving at conclusions, improving logical reasoning and transparency.

Chunking: The practice of breaking large documents or data into smaller pieces that fit within context limits, typically used in RAG systems.

Context Window: The maximum amount of text (measured in tokens) that an LLM can process in a single request, including both input and output. Different models have different limits (e.g., 128K, 1M tokens).

Context Rot: The degradation in model performance that occurs when context windows become too full, causing the model to lose track of earlier information or make errors.

Extended Thinking / Reasoning Blocks: Features in some models (like Anthropic's Claude) where the model performs internal reasoning that can be preserved in the conversation but may have special requirements for message structure.

RAG (Retrieval-Augmented Generation): A technique where relevant information is retrieved from a knowledge base and injected into an LLM's context to answer questions, typically using semantic search.

Semantic Search: A search technique that finds information based on meaning and context rather than exact keyword matching, often using embeddings to compare similarity.

Sliding Window: A context management technique that maintains a fixed-size "window" of recent messages by removing older messages as new ones are added.

Time to First Token (TTFT): The latency between sending a request to an LLM and receiving the first token of its response, a key metric for user experience.

Token / Tokenization: The process of breaking text into smaller units (tokens) that LLMs process. A token is roughly 4 characters or ¾ of a word. Token count determines context usage and API costs.

Tokenizer: The algorithm or tool that converts text into tokens. Different model providers use different tokenizers (e.g., Tiktoken for OpenAI), which can produce different token counts for the same text.

Tool Call: When an LLM requests to execute a specific function or API (like querying a database), including the function name and arguments. The result comes back as a "tool result."

Logan Carmody

CTO

Share