Context Windows Explained: Why AI Forgets Mid-Conversation

I’ve watched this happen to people pushing an AI hard for the first time. They build up a long conversation. Often, it might go twenty or thirty exchanges deep. Suddenly, the model starts contradicting itself.

For example, it asks a question the user answered twenty minutes ago.

Alternatively, it ignores a constraint set at the very beginning. The user figures the AI must be glitching. Or maybe they assume it just lacks intelligence.

The AI isn’t glitching. Instead, it simply ran out of room.

That gives you the short version. However, the longer version proves highly worth understanding. Once you see exactly what happens under the hood, the strange behaviour stops being mysterious. Furthermore, it starts being predictable. Consequently, you can easily work around predictable problems.

The model doesn’t remember. It re-reads.

This fact surprises most people when they first hear it. Fundamentally, the model does not maintain a persistent memory between your messages. It consults no internal journal. Moreover, it stores no understanding of who you are or what you told it.

Every single time you send a message, the model reads the entire conversation from scratch. First, it scans every message you sent. Next, the system reviews every response it generated. Finally, the AI also reads any documents or files you pasted in. Subsequently, it uses all that text together to figure out its next response.

This process isn’t like talking to someone who remembers you. Instead, imagine handing a fast reader a full transcript they never saw before. You then ask them to respond to the last line.

Hitting the Context Wall

The context window is the limit on how long that transcript can be. Once the conversation grows past that limit, something has to give. Usually, the system drops the oldest content. Specifically, it quietly deletes the beginning of the conversation. The model then continues as if those early messages never existed.

Developer Simon Willison tests language models extensively.

For instance, he described hitting this wall while using GPT-4 for a long coding session. The model maintained a specific architectural pattern throughout the conversation. Then, past a certain point, the AI started generating code that contradicted the pattern entirely.

The model wasn’t making a reasoning error. Rather, it simply could no longer see the messages establishing the pattern.

Willison concluded the behaviour looked like a bug. However, it actually represented a predictable consequence of running out of context. As a result, he now routinely starts new conversations when switching between major tasks. He treats context exhaustion as a known variable rather than a surprise.

“Forgetting” is how it looks from the outside. From the inside, though, the machine never had anything to forget. The information only ever existed alive inside that specific window.

**The model follows the instruction perfectly at first — every response is exactly two sentences**.

**After multiple turns, the same model completely ignores the two-sentence constraint.**

Why does a limit exist at all?

The architecture behind modern language models relies on transformers. Specifically, these transformers use a mechanism called self-attention to process text. For every token in the input, the model computes a relationship score with every other token.

This computation allows it to track pronouns effectively. Thus, the system knows that “it” in a paragraph refers to a noun from three sentences earlier. Additionally, it understands that a caveat buried in clause two affects the conclusion in clause seven.

The Math Behind the Memory

The problem stems from how this pairwise comparison scales quadratically. If you double the context length, the computation quadruples instead of doubling. Similarly, triple the length, and you reach nine times the cost.

The system must store intermediate values from all those comparisons in GPU memory. Developers call this the KV cache. Meanwhile, this storage happens while the computation runs. GPU memory remains finite regardless of how expensive the hardware is.

Engineers cannot fix this software limitation with a clever update. Ultimately, the mathematics of the attention mechanism dictate this hard limit.

Pushing the Boundaries

The limits continue moving fast, though. Initially, GPT-2 shipped in 2019 with a context window of 1,024 tokens. That equals roughly one dense page of text. Later, GPT-3 offered 4,096 tokens. GPT-4 launched at 8,192 tokens, and OpenAI offered a 32,000-token variant separately.

Anthropic released Claude 2 in July 2023 with a 100,000-token context window. They demonstrated this publicly by feeding the model the entire text of The Great Gatsby. Afterward, they asked it questions about minor details buried in the middle. The model handled it perfectly. Currently, Claude models boast 200,000 tokens.

Google announced Gemini 1.5 Pro in February 2024. Impressively, this model demonstrated one million tokens in benchmark testing. That holds roughly eleven hours of video transcript. Furthermore, it can fit the complete works of Shakespeare three times over.

A token equals roughly three to four characters of English text. It rarely equals a full word.

For example, the word “understanding” counts as two tokens.

Conversely, most common short words count as one. Therefore, 200,000 tokens translates to roughly 150,000 words. That matches the length of a fairly thick novel.

It sounds enormous, and it truly is. However, production conversations with heavy document uploads can chew through it incredibly fast.

What truncation actually looks like in practice

When a conversation crosses the context limit, most systems silently drop the oldest messages. Consequently, the model receives a new “start” and has no idea anything went missing. It just reads the available text and responds accordingly.

The Silent Data Loss

This explains why the degradation often looks so specific. For instance, a user on the r/ChatGPT subreddit documented this precisely last year. They gave GPT-4 detailed character profiles for a long-form story at the start of a session. Next, they worked through thirty-plus exchanges to build out scenes.

Around exchange thirty-five, the model assigned a secondary character traits that directly contradicted the initial profile. The user firmly believed the model “changed its mind.” In reality, it hadn’t. The moving window pushed the character profiles out of view.

Therefore, the model wrote with zero access to the original constraints. It simply filled the gaps using its own defaults.

Some systems handle this gracefully. To mitigate data loss, developers insert a rolling summary of the oldest content at the front of the context.

This compressed version replaces the dropped text. Admittedly, it helps with broad continuity, but you lose specific details.

The compression averages and blurs exact phrasing, particular data points, or established nuances. This method beats having nothing, but it fails to replicate having the original content present.

The Lost in the Middle Phenomenon

The placement of information within the context also matters. In 2023, Stanford researchers published a widely cited paper. Many people call it the “lost in the middle” study. Specifically, they tested multiple large language models, including GPT-3.5 Turbo, GPT-4, and Claude 1.3.

They gave these models tasks requiring them to find relevant information within long contexts. The team found consistent results across the board.

Notably, model performance peaked when relevant information appeared at the beginning or end of the context. Conversely, performance dropped meaningfully when the data appeared in the middle. Moreover, longer contexts caused more pronounced dips.

This creates a practical upshot for users. Suppose you work with a long contract, research paper, or report. You need the model to reason carefully about a specific clause. Therefore, you will get a better result by extracting that section and placing it at the front of your message.

Avoid dumping the full document and asking the model to locate the details. You can use a bigger context window with the same model. However, different placement yields measurably different answer quality.

The memory feature is not the same thing

Several AI products now advertise memory features. These tools boast the ability to carry information about you across separate conversations.

For example, ChatGPT rolled this out to Plus users in early 2024. Claude followed with its own memory system shortly after.

These features prove useful, but users commonly misunderstand them. Many people mistakenly think these features solve the context window problem. Unfortunately, they do not.

Instead, these systems maintain a separate store of facts outside any conversation. When you start a new session, the system pulls relevant entries from that store. Then, it inserts them at the beginning of your context.

Consider a ChatGPT user who mentions they are vegetarian and prefer metric units. The system saves those facts as memories. Subsequently, they appear in the context of every subsequent session. The model then “knows” both things without you telling it again.

RAG vs. Memory

This process relies on retrieved injection, not continuous memory. Meanwhile, the context window remains exactly as finite as before. Only one thing really changed. Specifically, high-value information now survives session boundaries because the system stores it externally.

The system then re-injects that data each time. Within a single long conversation, you will still hit the exact same limits. Ultimately, the memory feature never extends the window. It simply ensures certain facts return to each new session.

A completely different engineering solution handles scale in real deployments. Developers call it Retrieval-Augmented Generation, or RAG. A RAG system avoids loading an entire document library into the context.

Instead, the process breaks documents into smaller chunks. Next, it stores them as numerical vectors in a database. At query time, the system retrieves only the chunks most semantically relevant to your question. Finally, only those specific chunks enter the context alongside your prompt.

Notion AI works exactly this way. You might ask Notion AI a question about your workspace. However, it does not load every document you ever wrote into the context window. First, the system runs a retrieval step to pull the most relevant pages. Then, the application sends only those specific pages into the model.

Microsoft Copilot uses the same architecture to answer questions about your SharePoint library. Similarly, Intercom’s Fin chatbot uses it to handle support queries against thousands of articles. The context window still does the same finite job. It just receives curated input instead of a brute-force data dump.

The bit that doesn’t get said enough

The way people talk about AI memory causes real confusion in practice. The word “forgetting” carries a heavy implication. Specifically, it suggests the model held something and then lost it. Furthermore, this phrasing implies the model possessed the information at some point before letting it go.

That never actually happens. There is no possession. Instead, information exists inside a context window for the duration of a single inference pass. After that, it vanishes completely. The model does not hold onto it between turns.

The next time it responds, the system reads everything again. It even re-reads its own previous responses. Therefore, each new response acts as a completely fresh act of reading.

Human Memory vs. AI Processing

Human memory works differently. We consolidate experiences, compress them, and reconstruct them during retrieval. Consequently, human memory features a continuity that transformers lack completely.

Neuroscientist Lisa Feldman Barrett notes that human memory is not a recording. Instead, it functions as a reconstruction. We rebuild memories each time we access them. Additionally, we layer in new context and emotion.

This messy, reconstructive system differs radically from language models. A language model never reconstructs anything. Rather, it merely reads what sits directly in front of it.

This distinction matters greatly in practice. Consider a developer using Claude to debug a large codebase in a long session. Eventually, the model suggested fixes that broke things it carefully worked around earlier. The developer instinctively called it a hallucination problem.

However, a quick review revealed the truth. The shifting window pushed the relevant context out thirty messages earlier. Therefore, the model avoided confabulating entirely.

It reasoned correctly about an incomplete picture. Ultimately, the output looked wrong because the input silently changed. Nothing actually broke within the reasoning process itself.

How to actually work with this

People who get the most out of AI tools internalize the context window as a real constraint. Furthermore, they actively build habits around it.

A content team at a mid-sized SaaS company noticed an issue. Their AI-assisted drafting sessions produced inconsistent results. Initially, the model followed their style guide but drifted later. Their fix required a straightforward approach.

They stopped appending instructions to the beginning of a long conversation. Instead, the team created a “context header.” This short document contained their core style constraints. Now, they paste it at the top of every new session.

Additionally, they require a fresh start for any session longer than twenty exchanges. This simple rule eliminated the inconsistency problem almost entirely.

Resetting for Success

You should start a new conversation when the task changes. This habit goes beyond simple tidiness. Specifically, it resets the tool to a clean context.

Long conversations accumulate token weight incredibly fast. Unfortunately, not all those tokens do useful work. Conversely, a fresh conversation features a crisp, specific prompt. This fresh start often outperforms an old thread wandering across six different topics.

Sometimes, a working session goes wrong in the middle. The model might contradict an earlier instruction. Alternatively, the AI could start treating you like a stranger. Furthermore, the system might produce output ignoring constraints it followed perfectly just minutes before.

Do not ask, “What is wrong with the model?” Instead, ask, “What did it just lose access to?” You should start over and re-establish the key context up front. Undoubtedly, this approach proves much faster and more reliable than correcting errors inside an exhausted thread.

Optimizing Document Uploads

For document work, you must take the “lost in the middle” research seriously. Amazon’s Q Business handles long enterprise documents carefully. Specifically, it segments and prioritizes content placement specifically to combat this effect.

The system keeps the most query-relevant chunks near the context boundaries. Therefore, it avoids burying them centrally.

You might do this manually by pasting a long PDF into a chat. In this case, extract the relevant section first. Never dump the whole thing and hope the model finds what it needs.

This does not mean these systems suffer from deep fragility. On the contrary, they remain incredibly useful tools. The context window expanded from 1,024 tokens to 200,000 tokens in just five years. This represents a substantial amount of progress.

A single long document used to fill the entire window. Now, that same document sits comfortably alongside a full session of back-and-forth chat. Clearly, the ceiling keeps rising every year.

However, the ceiling absolutely still exists. It strictly shapes what the model can see. Furthermore, what the model sees represents everything it can work with.

Hitting this ceiling creates strange behaviours like contradictions, ignored instructions, and apparent confusion. Crucially, these are not signs of a broken model. They simply show what context exhaustion looks like from the outside.

Knowing this fact makes those strange moments much less mysterious. Finally, it gives you a concrete action plan when you inevitably hit them.

Context Windows Explained: Why AI ‘Forgets’ Mid-Conversation