Why does Claude forget things in long conversations?

Claude does not actually forget. It processes the entire conversation history in its context window on every turn. The issue is that the context window has a fixed token limit (200K for Claude 3.5 Sonnet and Claude 4 Opus). When the conversation exceeds this limit, older messages must be dropped to make room for new ones. Without a memory strategy, the most common approach is to simply truncate the oldest messages, which means Claude loses access to information shared early in the conversation. Memory strategies solve this by intelligently managing which information stays in the context window.

What is the sliding window memory strategy?

The sliding window strategy keeps the N most recent messages in the context window and drops everything older. It is the simplest memory strategy to implement. You set a maximum number of conversation turns or tokens, and when the conversation exceeds that limit, the oldest messages are removed. This works well for task-focused conversations where only recent context matters, like coding sessions where the current file is more important than files discussed 50 turns ago. The downside is abrupt information loss: facts mentioned early in the conversation vanish completely once they slide out of the window.

When should I use RAG instead of keeping full conversation history?

Use RAG (Retrieval-Augmented Generation) when your application needs to reference a large knowledge base that exceeds the context window, when users ask about specific topics that can be matched to stored documents, or when you need consistent answers across different conversations about the same facts. RAG is better than full history when the relevant information is sparse within a large corpus. Instead of stuffing the entire knowledge base into the context, RAG retrieves only the 3-5 most relevant chunks for each query. This dramatically reduces token usage while maintaining answer quality for factual lookups.

How does summary compression work for long conversations?

Summary compression periodically condenses older conversation turns into a brief summary. Instead of keeping 50 raw messages, you use Claude to generate a 200-token summary of messages 1-40, then keep that summary plus the 10 most recent raw messages. This preserves the key facts and decisions from earlier in the conversation while using far fewer tokens. The trade-off is lossy compression: nuanced details are lost in the summary. Implementation requires an extra API call whenever you trigger a compression cycle, but the token savings on subsequent turns usually outweigh this cost.

What is a hybrid memory strategy and when is it worth the complexity?

A hybrid memory strategy combines multiple approaches. A common hybrid is summary compression for old messages plus RAG for knowledge base lookups plus a sliding window for recent turns. The context window contains: a compressed summary of the full conversation history, retrieved documents relevant to the current query, and the last 5-10 raw messages. This gives Claude long-term memory (via summary), domain knowledge (via RAG), and immediate context (via sliding window). It is worth the complexity when your application has long-running conversations that reference external knowledge and where users expect the assistant to remember earlier decisions.

Claude Memory Strategies

Select the right context management strategy for your Claude application. Compare approaches, estimate token savings, and see implementation patterns.

How the Memory Strategy Selector Works

The Memory Strategy Selector is a browser-based planning tool for choosing and configuring context management strategies in Claude applications. Every conversational AI application eventually hits the context window limit. A 200,000-token window sounds enormous, but a conversation averaging 800 tokens per turn fills it in just 250 turns. Many production chatbots handle conversations lasting hundreds of turns over days or weeks. Without a memory strategy, these conversations silently lose information as older messages are truncated to fit new ones. This tool helps you choose the right strategy before your users start complaining that the assistant forgot what they said five minutes ago.

Select a memory strategy from the four options to see how it handles your conversation parameters. Adjust the conversation length, tokens per turn, and system prompt size to match your actual application. The visual context window diagram shows exactly how the strategy allocates tokens across system prompt, compressed memory, recent messages, and free space. The comparison table lets you see all four strategies side by side with token usage, monthly cost, information retention quality, and implementation complexity.

Understanding the Four Memory Strategies

The sliding window strategy is the simplest and most common approach in production. It keeps the N most recent conversation turns and discards everything older. Implementation requires just a few lines of code that slice the message array to a maximum length before each API call. The advantage is zero additional API calls for memory management and perfectly preserved recent context. The disadvantage is binary information loss: everything outside the window is completely gone. The sliding window works well for task-focused conversations like coding sessions, customer support tickets, and short interactions where historical context is less important than the current task.

Summary compression periodically condenses older conversation turns into a compact summary. When the conversation reaches a threshold, you make an API call to Claude asking it to summarize messages 1 through N into a few hundred tokens. That summary replaces the original messages, and subsequent turns include the summary plus the most recent raw messages. This approach preserves key facts and decisions from the entire conversation while using a fraction of the tokens. The trade-off is lossy compression: subtle nuances, exact phrasings, and minor details are lost. Summary compression works best for advisory conversations, project discussions, and any interaction where remembering key decisions matters more than preserving exact wording.

RAG (Retrieval-Augmented Generation) retrieval is fundamentally different from the other strategies because it does not try to keep conversation history in the context window. Instead, it stores conversation content and external knowledge in a vector database. On each turn, the user's message is used to retrieve the most relevant chunks from the database, and only those chunks are injected into the context window. RAG excels at knowledge-intensive applications where the relevant information for any given query is a tiny fraction of the total knowledge base. Customer support bots that need to reference thousands of help articles, internal assistant tools that span company documentation, and research assistants over large paper collections all benefit from RAG.

The hybrid strategy combines all three approaches for maximum flexibility. It maintains a rolling summary of the full conversation history, uses RAG to retrieve relevant external knowledge for each query, and keeps the most recent messages in a sliding window. The context window on each turn contains: the system prompt, the conversation summary, retrieved document chunks, and the last 5 to 10 raw messages. This gives Claude long-term memory through the summary, domain knowledge through RAG, and immediate conversational context through the sliding window. The complexity cost is real, requiring infrastructure for vector storage, summarization scheduling, and context assembly, but the result is the most capable memory system possible within token limits.

Token Growth Patterns and Cost Impact

Without any memory strategy, token usage grows linearly with conversation length. A 50-turn conversation at 800 tokens per turn consumes 40,000 input tokens on the final turn alone, because the entire history is sent with each request. The total input tokens across all 50 turns is approximately 1 million tokens. At $3 per million tokens with Claude 3.5 Sonnet, that is $3 per conversation. At 100 conversations per day, monthly cost is approximately $9,000 just for input tokens. These numbers surprise most developers because they forget that the full history is resent on every single turn.

A sliding window capped at 20 turns changes the economics dramatically. After the conversation exceeds 20 turns, input token usage plateaus at approximately 16,000 tokens per turn (20 turns times 800 tokens) plus the system prompt. Total input tokens across a 50-turn conversation drops to roughly 500,000 tokens. Monthly cost at the same volume drops to approximately $4,500, a 50% reduction. The information loss from turns 1 through 30 is complete, but for many applications this is an acceptable trade-off.

Summary compression offers the best balance for most applications. Compressing turns 1 through 40 into a 400-token summary means the final turn's context contains 400 tokens of summary plus 8,000 tokens of recent messages (10 turns) plus the system prompt. Total input tokens across all turns is roughly 350,000. Monthly cost drops to about $3,200. The summarization calls add approximately 10% overhead, but the net saving is still around 60% compared to no strategy. More importantly, Claude retains awareness of the full conversation through the summary, which prevents the jarring "I forgot what you said earlier" experience that degrades user trust.

Implementation Considerations

Timing your compression triggers correctly is critical. Compressing too early wastes the API call on a short conversation that might never need it. Compressing too late risks hitting the context window limit and having to emergency-truncate. The optimal trigger point is when the conversation reaches 60 to 70 percent of the available context window. This leaves enough headroom for large user inputs and assistant responses while avoiding unnecessary compression of short conversations. For a 200K context window with a 500-token system prompt, trigger compression at approximately 120,000 tokens of conversation history.

The quality of your summary prompt directly determines the quality of long-term memory. A generic "summarize this conversation" instruction produces vague summaries that lose critical details. Instead, instruct the summarizer to extract specific categories: key decisions made, user preferences stated, facts established, action items agreed, and open questions remaining. This structured approach consistently produces more useful summaries that preserve the information Claude needs on subsequent turns. Testing summary quality against real conversations is essential before deploying to production.

For RAG implementations, chunk size and retrieval count are the two parameters that most affect quality. Chunks that are too small lose context and produce fragmented retrieval. Chunks that are too large waste tokens on irrelevant information within the chunk. A chunk size of 300 to 500 tokens works well for most conversational applications. Retrieving 3 to 5 chunks per query balances coverage against token cost. Use the Context Window Optimizer to plan how much of your token budget to allocate to RAG chunks versus other context components.

Privacy and Local Execution

The Memory Strategy Selector runs entirely in your browser. Conversation parameters, strategy configurations, and cost calculations are processed client-side using JavaScript. No data is sent to any server. There are no accounts, no cookies, no analytics, and no server-side processing. Your application architecture details and cost projections remain completely private on your device.

Claude Memory Strategies

Choose a Memory Strategy

Conversation Parameters

Context Window Usage

Strategy Comparison

Implementation Pattern

Token Optimization Tips

How the Memory Strategy Selector Works

Understanding the Four Memory Strategies

Token Growth Patterns and Cost Impact

Implementation Considerations

Privacy and Local Execution

Frequently Asked Questions

Explore ClaudFlow

Related Tools

Guides

Research