How to Add Memory to Your AI Chatbot Without a Database
LLMs don't remember anything between calls โ every message starts from scratch. Here's how to give your chatbot genuine conversational memory using pure in-memory techniques, no database required.
Last updated: July 2, 2026 ยท 21 min read ยท Python examples
Three memory strategies covered in this guide
Full history
Pass every message in every request. Simple, effective for short chats.
Sliding window
Keep only the last N messages. Caps token usage automatically.
Summarized memory
Compress old history into a rolling summary. Best of both worlds.
Here's the part most AI tutorials skip over: language models have no memory. At all. Every single API call is completely stateless โ the model has no idea who you are, what you said five minutes ago, or that this isn't the first time you've talked. It just sees whatever text you send in the current request, and nothing else.
The reason ChatGPT and Claude seem to remember your conversation is because the application layer is doing the work โ it's passing the entire conversation history back to the model on every turn, disguised as a single long input. The model isn't remembering anything. It's just reading a transcript.
Once you understand that, you realize you have full control. You decide what goes into that transcript, how much of it, and in what form. This guide walks through three practical strategies for managing that memory โ all without touching a database.
Why LLMs forget โ and why that's actually by design
A language model processes text through what's called a context window โ a fixed-size buffer of tokens (roughly, word-pieces) that it can "see" at once. Modern models have large context windows: GPT-4o handles 128k tokens, Claude handles 200k, and local models via Ollama typically support 4kโ128k depending on the model and your configuration.
Within that window, the model can reference anything โ your instructions, the conversation so far, documents you've provided, examples you've shown it. But the moment a request finishes, the model discards everything. There's no persistent state between API calls.
The core challenge
Every message you send costs tokens. Conversation history grows over time. If you naively include the full history on every call, you'll eventually hit the context limit โ and long before that, you'll be paying for (or waiting on) a huge amount of redundant tokens every single turn. Memory management is really context window management.
Strategy 1 โ Full conversation history
The simplest approach: keep every message in a list and send the whole list on every API call. This is what most tutorial code does, and it works perfectly for short conversations.
from ollama import chat # works the same with openai library
# This list IS the memory โ it grows with every turn
conversation_history = [
{
"role": "system",
"content": (
"You are a helpful assistant. Be concise and direct. "
"Remember details the user shares about themselves."
),
}
]
def chat_with_memory(user_message: str) -> str:
# Add the user's message to history
conversation_history.append({
"role": "user",
"content": user_message
})
# Send the full history to the model
response = chat(
model="llama3.2",
messages=conversation_history,
)
assistant_reply = response["message"]["content"]
# Add the model's reply to history too
conversation_history.append({
"role": "assistant",
"content": assistant_reply
})
return assistant_reply
# Usage
print(chat_with_memory("My name is Sara and I'm learning Python."))
# โ "Nice to meet you, Sara! ..."
print(chat_with_memory("What's my name?"))
# โ "Your name is Sara." โ It remembers!That's genuinely all it takes. The model sees the full conversation on every call, so it can reference anything said earlier. The list in memory is the memory.
โ Pros
- Dead simple to implement
- Perfect recall โ nothing is lost
- The model can reference any past message
- Zero extra infrastructure
โ Cons
- Grows without limit โ hits context max eventually
- Each call gets slower and more expensive over time
- Lost when the process restarts
- Not practical beyond ~30โ40 exchanges
When to use it: short-lived sessions, prototypes, demos, or any chatbot where conversations are expected to stay under 20โ30 exchanges.
Strategy 2 โ Sliding window memory
Instead of keeping every message forever, you keep only the most recent N exchanges. Older messages fall off the back as new ones come in โ like a conveyor belt. This caps your token usage automatically and keeps response times consistent.
from ollama import chat
SYSTEM_PROMPT = {
"role": "system",
"content": "You are a helpful assistant. Be concise.",
}
# Only keep the last N *pairs* of messages (user + assistant)
MAX_PAIRS = 10 # = 20 messages total in the window
full_history: list[dict] = []
def build_window(history: list[dict], max_pairs: int) -> list[dict]:
"""Return system prompt + last max_pairs exchanges."""
# Each exchange = 1 user message + 1 assistant message = 2 items
cutoff = max_pairs * 2
recent = history[-cutoff:] if len(history) > cutoff else history
return [SYSTEM_PROMPT] + recent
def chat_with_window(user_message: str) -> str:
full_history.append({"role": "user", "content": user_message})
# Only send the recent window to the model
windowed_messages = build_window(full_history, MAX_PAIRS)
response = chat(model="llama3.2", messages=windowed_messages)
reply = response["message"]["content"]
full_history.append({"role": "assistant", "content": reply})
return replyNotice that full_history still keeps everything โ we just trim what we send to the model. This means you can always go back and look at the full log, even though the model only sees a window of it.
Choosing the right window size
The right number depends on your model's context limit and the average length of your messages. A rough rule of thumb: aim to use no more than 40โ50% of the context window for history, leaving room for the system prompt, the current message, and the response.
Rough token estimates per message
When to use it: most production chatbots. It's the most practical default โ simple to implement, predictable cost, and works well for conversations that don't require recalling something from 2 hours ago.
Strategy 3 โ Summarized rolling memory
This is the most powerful no-database approach. Instead of just dropping old messages, you ask the model to summarize what was discussed before it falls off the window. That summary gets injected at the top of every prompt, giving the model a compressed but meaningful sense of the full conversation history.
Think of it like a meeting recap at the start of every call: "Last time we discussed X, Y, and Z. Sara mentioned she's learning Python and prefers short explanations."
from ollama import chat
WINDOW_SIZE = 6 # messages before we summarize
summary: str = "" # grows over time as history compresses
recent_history: list[dict] = []
def summarize_history(messages: list[dict]) -> str:
"""Ask the model to compress a list of messages into a summary."""
transcript = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in messages
)
response = chat(
model="llama3.2",
messages=[
{
"role": "user",
"content": (
f"Summarize this conversation excerpt in 3-5 sentences. "
f"Focus on key facts, user preferences, and decisions made.\n\n"
f"{transcript}"
),
}
],
)
return response["message"]["content"]
def build_messages_with_summary(user_message: str) -> list[dict]:
system_content = "You are a helpful assistant. Be concise."
if summary:
system_content += (
f"\n\nContext from earlier in this conversation:\n{summary}"
)
return [
{"role": "system", "content": system_content},
*recent_history,
{"role": "user", "content": user_message},
]
def chat_with_summary_memory(user_message: str) -> str:
global summary, recent_history
messages = build_messages_with_summary(user_message)
response = chat(model="llama3.2", messages=messages)
reply = response["message"]["content"]
# Add this exchange to recent history
recent_history.append({"role": "user", "content": user_message})
recent_history.append({"role": "assistant", "content": reply})
# When the window fills up, compress the oldest half into the summary
if len(recent_history) >= WINDOW_SIZE:
to_summarize = recent_history[: WINDOW_SIZE // 2]
new_summary_chunk = summarize_history(to_summarize)
# Append the new chunk to the existing summary
if summary:
summary = f"{summary}\n{new_summary_chunk}"
else:
summary = new_summary_chunk
# Keep only the newer half of recent history
recent_history = recent_history[WINDOW_SIZE // 2 :]
return replyโ ๏ธ One important caveat
Summarization costs an extra LLM call every time the window fills. For a local model this is free, but for API-based models it adds latency and cost. Trigger summarization asynchronously or between turns โ never in the critical path of a user response if you can avoid it.
When to use it: long sessions (customer support, tutoring bots, personal assistants) where users expect the chatbot to remember context from an hour ago or earlier in the same session.
Bonus โ Extracting and injecting user facts
Beyond conversation history, you can build a simple "user profile" in plain Python dictionaries โ extracted by the model itself as the conversation unfolds. This gives you a lightweight fact store that persists independently of the sliding window.
import json
from ollama import chat
user_facts: dict = {} # {"name": "Sara", "skill_level": "beginner", ...}
def extract_facts(user_message: str, assistant_reply: str) -> dict:
"""Ask the model to pull any new facts from this exchange."""
prompt = f"""
Extract any personal facts about the user from this exchange.
Return ONLY a JSON object (or empty {{}} if nothing new).
User: {user_message}
Assistant: {assistant_reply}
Examples of facts to extract: name, location, job, skill level,
preferences, goals, constraints, tools they use.
"""
response = chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
)
try:
text = response["message"]["content"]
# Strip markdown fences if the model adds them
text = text.strip().strip("'''json").strip("'''").strip()
return json.loads(text)
except (json.JSONDecodeError, KeyError):
return { }
def build_system_prompt() -> str:
base = "You are a helpful assistant. Be concise."
if user_facts:
facts_text = "\n".join(f"- {k}: {v}" for k, v in user_facts.items())
base += f"\n\nWhat you know about this user:\n{facts_text}"
return base
def chat_with_facts(user_message: str) -> str:
messages = [
{"role": "system", "content": build_system_prompt()},
{"role": "user", "content": user_message},
]
response = chat(model="llama3.2", messages=messages)
reply = response["message"]["content"]
# Async-friendly: extract facts after responding
new_facts = extract_facts(user_message, reply)
user_facts.update(new_facts)
return reply
# Example session
print(chat_with_facts("I'm Alex, a backend dev who hates verbose docs."))
print(user_facts)
# โ {"name": "Alex", "job": "backend developer", "preference": "concise docs"}
print(chat_with_facts("What's a good tool for API testing?"))
# The model now knows Alex is a backend dev and will tailor its answerThis approach is surprisingly powerful. The model personalizes its answers based on accumulated facts without you needing to engineer elaborate prompts โ you just keep the facts dict up to date and inject it into the system prompt every turn.
Combining strategies for production
In practice, a production chatbot uses all three layers at once. Here's how they fit together:
System prompt
Static instructions + user facts dict
Always present. Shapes behavior and personalizes every response.
Rolling summary
Compressed history from older exchanges
Provides context from earlier in the session without flooding the window.
Recent window
Last 6โ10 exchanges verbatim
Exact recent context for coherent back-and-forth flow.
Current message
The user's latest input
What the model is actually responding to.
Total context usage: roughly 1,000โ3,500 tokens per request โ well within even a 4k context window, with room to spare for a detailed response.
When do you actually need a database?
In-memory strategies work until they don't. Here's an honest look at where the ceiling is:
Memory survives a server restart
Write history/facts to a JSON file or SQLite. Even a flat file beats a full database for simple cases.
Multiple users with separate conversations
Use a session ID to key separate history lists. Still no database needed if you can keep sessions in memory (e.g., Redis, or a Python dict keyed by session ID).
Recall specific facts from months ago
This requires persistent storage. A vector database (Chroma, Qdrant) lets you retrieve semantically relevant old memories rather than re-reading the whole history.
High concurrency (many simultaneous users)
In-memory per-process works fine as long as sessions are sticky. For distributed systems with multiple server instances, you need shared storage.
Audit logs or compliance
Always use a proper database. You need durable, queryable, immutable records.
FAQ
Does this work with the OpenAI API, not just Ollama?โ
What happens when the context window fills up even with a sliding window?โ
How do I count tokens to know how close I am to the limit?โ
Is summarization accurate? What if the model misses something important?โ
Can I use this approach with LangChain or LlamaIndex?โ
How do I handle a user starting a new topic mid-conversation?โ
Quick recap
- 1LLMs have no memory โ you manage what they see by controlling the messages array.
- 2Full history works great for short sessions. Simple, zero overhead.
- 3Sliding window caps token usage. Best default for most production chatbots.
- 4Summarized memory lets you span long sessions without hitting context limits.
- 5A user facts dict gives structured, reliable recall of important personal details.
- 6Combine all three layers for the best balance of recall, cost, and simplicity.
- 7You only need a real database when memory must survive restarts, scale across servers, or meet compliance requirements.
Keep building
Memory is one piece of the puzzle. These guides cover the next steps for building more capable AI applications: