Claude Memory Strategies

Select the right context management strategy for your Claude application. Compare approaches, estimate token savings, and see implementation patterns.

Choose a Memory Strategy

Sliding Window
Keep N recent messages, drop oldest
Summary Compression
Condense old turns into summaries
🔍
RAG Retrieval
Retrieve relevant context per query
Hybrid
Summary + RAG + sliding window

Conversation Parameters

Context Window Usage

How the selected strategy fills the context window at the configured conversation length:

System Summary/RAG Recent Messages Free

Strategy Comparison

StrategyTokens UsedMonthly CostInfo RetainedComplexity

Implementation Pattern

Token Optimization Tips

1
Compress early, compress often. Do not wait until the context window is full to start summarizing. Trigger compression when the conversation reaches 60% of the window. This leaves headroom for large user inputs and prevents the abrupt quality drop that occurs when you hit the limit and must emergency-truncate.
2
Keep the system prompt under 500 tokens. In long conversations, the system prompt is the most expensive component because it is included in every single API call. A 2,000-token system prompt across a 50-turn conversation costs 100,000 tokens just for the system prompt. Compress it to 500 tokens and save 75,000 tokens per conversation.
3
Use structured summaries, not prose. When compressing conversation history, use bullet points or key-value pairs instead of prose paragraphs. Structured summaries are 40-60% shorter than prose summaries while preserving more actionable information. They are also easier for the model to parse on subsequent turns.
4
Cache system prompts with Anthropic's API. Prompt caching reduces the cost of the system prompt by 90% on subsequent requests within a conversation. For applications with 100+ daily conversations, this is typically the single largest cost saver, reducing monthly bills by thousands of dollars.
5
Separate facts from conversation flow. Extract key facts, decisions, and user preferences into a structured memory store separate from the conversation transcript. Include only the relevant facts in each request rather than the full transcript. A user's name, preferences, and past decisions rarely need the full conversation context around them.
6
Benchmark with real conversations. Test your memory strategy against actual user conversation logs, not synthetic examples. Real conversations have irregular turn lengths, topic changes, and reference patterns that synthetic tests miss. Measure both token cost and answer quality to find the optimal compression ratio for your use case.

How the Memory Strategy Selector Works

The Memory Strategy Selector is a browser-based planning tool for choosing and configuring context management strategies in Claude applications. Every conversational AI application eventually hits the context window limit. A 200,000-token window sounds enormous, but a conversation averaging 800 tokens per turn fills it in just 250 turns. Many production chatbots handle conversations lasting hundreds of turns over days or weeks. Without a memory strategy, these conversations silently lose information as older messages are truncated to fit new ones. This tool helps you choose the right strategy before your users start complaining that the assistant forgot what they said five minutes ago.

Select a memory strategy from the four options to see how it handles your conversation parameters. Adjust the conversation length, tokens per turn, and system prompt size to match your actual application. The visual context window diagram shows exactly how the strategy allocates tokens across system prompt, compressed memory, recent messages, and free space. The comparison table lets you see all four strategies side by side with token usage, monthly cost, information retention quality, and implementation complexity.

Understanding the Four Memory Strategies

The sliding window strategy is the simplest and most common approach in production. It keeps the N most recent conversation turns and discards everything older. Implementation requires just a few lines of code that slice the message array to a maximum length before each API call. The advantage is zero additional API calls for memory management and perfectly preserved recent context. The disadvantage is binary information loss: everything outside the window is completely gone. The sliding window works well for task-focused conversations like coding sessions, customer support tickets, and short interactions where historical context is less important than the current task.

Summary compression periodically condenses older conversation turns into a compact summary. When the conversation reaches a threshold, you make an API call to Claude asking it to summarize messages 1 through N into a few hundred tokens. That summary replaces the original messages, and subsequent turns include the summary plus the most recent raw messages. This approach preserves key facts and decisions from the entire conversation while using a fraction of the tokens. The trade-off is lossy compression: subtle nuances, exact phrasings, and minor details are lost. Summary compression works best for advisory conversations, project discussions, and any interaction where remembering key decisions matters more than preserving exact wording.

RAG (Retrieval-Augmented Generation) retrieval is fundamentally different from the other strategies because it does not try to keep conversation history in the context window. Instead, it stores conversation content and external knowledge in a vector database. On each turn, the user's message is used to retrieve the most relevant chunks from the database, and only those chunks are injected into the context window. RAG excels at knowledge-intensive applications where the relevant information for any given query is a tiny fraction of the total knowledge base. Customer support bots that need to reference thousands of help articles, internal assistant tools that span company documentation, and research assistants over large paper collections all benefit from RAG.

The hybrid strategy combines all three approaches for maximum flexibility. It maintains a rolling summary of the full conversation history, uses RAG to retrieve relevant external knowledge for each query, and keeps the most recent messages in a sliding window. The context window on each turn contains: the system prompt, the conversation summary, retrieved document chunks, and the last 5 to 10 raw messages. This gives Claude long-term memory through the summary, domain knowledge through RAG, and immediate conversational context through the sliding window. The complexity cost is real, requiring infrastructure for vector storage, summarization scheduling, and context assembly, but the result is the most capable memory system possible within token limits.

Token Growth Patterns and Cost Impact

Without any memory strategy, token usage grows linearly with conversation length. A 50-turn conversation at 800 tokens per turn consumes 40,000 input tokens on the final turn alone, because the entire history is sent with each request. The total input tokens across all 50 turns is approximately 1 million tokens. At $3 per million tokens with Claude 3.5 Sonnet, that is $3 per conversation. At 100 conversations per day, monthly cost is approximately $9,000 just for input tokens. These numbers surprise most developers because they forget that the full history is resent on every single turn.

A sliding window capped at 20 turns changes the economics dramatically. After the conversation exceeds 20 turns, input token usage plateaus at approximately 16,000 tokens per turn (20 turns times 800 tokens) plus the system prompt. Total input tokens across a 50-turn conversation drops to roughly 500,000 tokens. Monthly cost at the same volume drops to approximately $4,500, a 50% reduction. The information loss from turns 1 through 30 is complete, but for many applications this is an acceptable trade-off.

Summary compression offers the best balance for most applications. Compressing turns 1 through 40 into a 400-token summary means the final turn's context contains 400 tokens of summary plus 8,000 tokens of recent messages (10 turns) plus the system prompt. Total input tokens across all turns is roughly 350,000. Monthly cost drops to about $3,200. The summarization calls add approximately 10% overhead, but the net saving is still around 60% compared to no strategy. More importantly, Claude retains awareness of the full conversation through the summary, which prevents the jarring "I forgot what you said earlier" experience that degrades user trust.

Implementation Considerations

Timing your compression triggers correctly is critical. Compressing too early wastes the API call on a short conversation that might never need it. Compressing too late risks hitting the context window limit and having to emergency-truncate. The optimal trigger point is when the conversation reaches 60 to 70 percent of the available context window. This leaves enough headroom for large user inputs and assistant responses while avoiding unnecessary compression of short conversations. For a 200K context window with a 500-token system prompt, trigger compression at approximately 120,000 tokens of conversation history.

The quality of your summary prompt directly determines the quality of long-term memory. A generic "summarize this conversation" instruction produces vague summaries that lose critical details. Instead, instruct the summarizer to extract specific categories: key decisions made, user preferences stated, facts established, action items agreed, and open questions remaining. This structured approach consistently produces more useful summaries that preserve the information Claude needs on subsequent turns. Testing summary quality against real conversations is essential before deploying to production.

For RAG implementations, chunk size and retrieval count are the two parameters that most affect quality. Chunks that are too small lose context and produce fragmented retrieval. Chunks that are too large waste tokens on irrelevant information within the chunk. A chunk size of 300 to 500 tokens works well for most conversational applications. Retrieving 3 to 5 chunks per query balances coverage against token cost. Use the Context Window Optimizer to plan how much of your token budget to allocate to RAG chunks versus other context components.

Privacy and Local Execution

The Memory Strategy Selector runs entirely in your browser. Conversation parameters, strategy configurations, and cost calculations are processed client-side using JavaScript. No data is sent to any server. There are no accounts, no cookies, no analytics, and no server-side processing. Your application architecture details and cost projections remain completely private on your device.

Frequently Asked Questions

Why does Claude forget things in long conversations?

Claude does not actually forget. It processes the entire conversation history on every turn, but the context window has a fixed token limit. When the conversation exceeds this limit, older messages must be dropped. Without a memory strategy, the oldest messages are simply truncated. Memory strategies solve this by intelligently managing which information stays in the window.

What is the sliding window memory strategy?

The sliding window keeps the N most recent messages and drops everything older. It is the simplest to implement and works well for task-focused conversations where only recent context matters. The downside is abrupt information loss for anything outside the window.

When should I use RAG instead of keeping full history?

Use RAG when your application references a large knowledge base exceeding the context window, when users ask about specific topics matchable to stored documents, or when you need consistent answers across conversations. RAG retrieves only the most relevant chunks per query, dramatically reducing token usage.

How does summary compression work?

Summary compression periodically condenses older turns into a brief summary using an extra API call. Instead of keeping 50 raw messages, you keep a 200-token summary of messages 1-40 plus the 10 most recent raw messages. Key facts are preserved while using far fewer tokens.

What is a hybrid memory strategy?

A hybrid strategy combines summary compression for old messages, RAG for knowledge base lookups, and a sliding window for recent turns. It provides long-term memory, domain knowledge, and immediate context simultaneously. It is worth the complexity for long-running conversations that reference external knowledge.

Explore ClaudFlow

ML
Michael Lip

Solo developer building free tools for the AI engineering community. Creator of Zovo Tools, a network of 18 developer utilities. Focused on making AI workflows accessible to everyone, no sign-up required.