How to Add Memory to Your AI Chatbot Without a Database

Here's the part most AI tutorials skip over: language models have no memory. At all. Every single API call is completely stateless — the model has no idea who you are, what you said five minutes ago, or that this isn't the first time you've talked. It just sees whatever text you send in the current request, and nothing else.

The reason ChatGPT and Claude seem to remember your conversation is because the application layer is doing the work — it's passing the entire conversation history back to the model on every turn, disguised as a single long input. The model isn't remembering anything. It's just reading a transcript.

Once you understand that, you realize you have full control. You decide what goes into that transcript, how much of it, and in what form. This guide walks through three practical strategies for managing that memory — all without touching a database.

Why LLMs forget — and why that's actually by design

A language model processes text through what's called a context window — a fixed-size buffer of tokens (roughly, word-pieces) that it can "see" at once. Modern models have large context windows: GPT-4o handles 128k tokens, Claude handles 200k, and local models via Ollama typically support 4k–128k depending on the model and your configuration.

Within that window, the model can reference anything — your instructions, the conversation so far, documents you've provided, examples you've shown it. But the moment a request finishes, the model discards everything. There's no persistent state between API calls.

The core challenge

Every message you send costs tokens. Conversation history grows over time. If you naively include the full history on every call, you'll eventually hit the context limit — and long before that, you'll be paying for (or waiting on) a huge amount of redundant tokens every single turn. Memory management is really context window management.

Strategy 1 — Full conversation history

The simplest approach: keep every message in a list and send the whole list on every API call. This is what most tutorial code does, and it works perfectly for short conversations.

from ollama import chat  # works the same with openai library

# This list IS the memory — it grows with every turn
conversation_history = [
    {
        "role": "system",
        "content": (
            "You are a helpful assistant. Be concise and direct. "
            "Remember details the user shares about themselves."
        ),
    }
]

def chat_with_memory(user_message: str) -> str:
    # Add the user's message to history
    conversation_history.append({
        "role": "user",
        "content": user_message
    })

    # Send the full history to the model
    response = chat(
        model="llama3.2",
        messages=conversation_history,
    )

    assistant_reply = response["message"]["content"]

    # Add the model's reply to history too
    conversation_history.append({
        "role": "assistant",
        "content": assistant_reply
    })

    return assistant_reply


# Usage
print(chat_with_memory("My name is Sara and I'm learning Python."))
# → "Nice to meet you, Sara! ..."

print(chat_with_memory("What's my name?"))
# → "Your name is Sara." ✓ It remembers!

That's genuinely all it takes. The model sees the full conversation on every call, so it can reference anything said earlier. The list in memory is the memory.

✓ Pros

Dead simple to implement
Perfect recall — nothing is lost
The model can reference any past message
Zero extra infrastructure

✗ Cons

Grows without limit — hits context max eventually
Each call gets slower and more expensive over time
Lost when the process restarts
Not practical beyond ~30–40 exchanges

When to use it: short-lived sessions, prototypes, demos, or any chatbot where conversations are expected to stay under 20–30 exchanges.

Strategy 2 — Sliding window memory

Instead of keeping every message forever, you keep only the most recent N exchanges. Older messages fall off the back as new ones come in — like a conveyor belt. This caps your token usage automatically and keeps response times consistent.

from ollama import chat

SYSTEM_PROMPT = {
    "role": "system",
    "content": "You are a helpful assistant. Be concise.",
}

# Only keep the last N *pairs* of messages (user + assistant)
MAX_PAIRS = 10  # = 20 messages total in the window

full_history: list[dict] = []

def build_window(history: list[dict], max_pairs: int) -> list[dict]:
    """Return system prompt + last max_pairs exchanges."""
    # Each exchange = 1 user message + 1 assistant message = 2 items
    cutoff = max_pairs * 2
    recent = history[-cutoff:] if len(history) > cutoff else history
    return [SYSTEM_PROMPT] + recent

def chat_with_window(user_message: str) -> str:
    full_history.append({"role": "user", "content": user_message})

    # Only send the recent window to the model
    windowed_messages = build_window(full_history, MAX_PAIRS)

    response = chat(model="llama3.2", messages=windowed_messages)
    reply = response["message"]["content"]

    full_history.append({"role": "assistant", "content": reply})
    return reply

Notice that full_history still keeps everything — we just trim what we send to the model. This means you can always go back and look at the full log, even though the model only sees a window of it.

Choosing the right window size

The right number depends on your model's context limit and the average length of your messages. A rough rule of thumb: aim to use no more than 40–50% of the context window for history, leaving room for the system prompt, the current message, and the response.

Rough token estimates per message

Short message (one sentence)~20–40 tokens

Medium message (a paragraph)~100–200 tokens

Long message (code + explanation)~400–800 tokens

Average casual chat exchange (both sides)~150–300 tokens

When to use it: most production chatbots. It's the most practical default — simple to implement, predictable cost, and works well for conversations that don't require recalling something from 2 hours ago.

Strategy 3 — Summarized rolling memory

This is the most powerful no-database approach. Instead of just dropping old messages, you ask the model to summarize what was discussed before it falls off the window. That summary gets injected at the top of every prompt, giving the model a compressed but meaningful sense of the full conversation history.

Think of it like a meeting recap at the start of every call: "Last time we discussed X, Y, and Z. Sara mentioned she's learning Python and prefers short explanations."

from ollama import chat

WINDOW_SIZE = 6   # messages before we summarize
summary: str = "" # grows over time as history compresses
recent_history: list[dict] = []

def summarize_history(messages: list[dict]) -> str:
    """Ask the model to compress a list of messages into a summary."""
    transcript = "\n".join(
        f"{m['role'].upper()}: {m['content']}" for m in messages
    )
    response = chat(
        model="llama3.2",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Summarize this conversation excerpt in 3-5 sentences. "
                    f"Focus on key facts, user preferences, and decisions made.\n\n"
                    f"{transcript}"
                ),
            }
        ],
    )
    return response["message"]["content"]

def build_messages_with_summary(user_message: str) -> list[dict]:
    system_content = "You are a helpful assistant. Be concise."
    if summary:
        system_content += (
            f"\n\nContext from earlier in this conversation:\n{summary}"
        )
    return [
        {"role": "system", "content": system_content},
        *recent_history,
        {"role": "user", "content": user_message},
    ]

def chat_with_summary_memory(user_message: str) -> str:
    global summary, recent_history

    messages = build_messages_with_summary(user_message)
    response = chat(model="llama3.2", messages=messages)
    reply = response["message"]["content"]

    # Add this exchange to recent history
    recent_history.append({"role": "user", "content": user_message})
    recent_history.append({"role": "assistant", "content": reply})

    # When the window fills up, compress the oldest half into the summary
    if len(recent_history) >= WINDOW_SIZE:
        to_summarize = recent_history[: WINDOW_SIZE // 2]
        new_summary_chunk = summarize_history(to_summarize)

        # Append the new chunk to the existing summary
        if summary:
            summary = f"{summary}\n{new_summary_chunk}"
        else:
            summary = new_summary_chunk

        # Keep only the newer half of recent history
        recent_history = recent_history[WINDOW_SIZE // 2 :]

    return reply

⚠️ One important caveat

Summarization costs an extra LLM call every time the window fills. For a local model this is free, but for API-based models it adds latency and cost. Trigger summarization asynchronously or between turns — never in the critical path of a user response if you can avoid it.

When to use it: long sessions (customer support, tutoring bots, personal assistants) where users expect the chatbot to remember context from an hour ago or earlier in the same session.

Bonus — Extracting and injecting user facts

Beyond conversation history, you can build a simple "user profile" in plain Python dictionaries — extracted by the model itself as the conversation unfolds. This gives you a lightweight fact store that persists independently of the sliding window.

import json
from ollama import chat

user_facts: dict = {}  # {"name": "Sara", "skill_level": "beginner", ...}

def extract_facts(user_message: str, assistant_reply: str) -> dict:
    """Ask the model to pull any new facts from this exchange."""
    prompt = f"""
Extract any personal facts about the user from this exchange.
Return ONLY a JSON object (or empty {{}} if nothing new).

User: {user_message}
Assistant: {assistant_reply}

Examples of facts to extract: name, location, job, skill level,
preferences, goals, constraints, tools they use.
"""
    response = chat(
        model="llama3.2",
        messages=[{"role": "user", "content": prompt}],
    )
    try:
        text = response["message"]["content"]
        # Strip markdown fences if the model adds them
        text = text.strip().strip("'''json").strip("'''").strip()
              return json.loads(text)
              except (json.JSONDecodeError, KeyError):
              return { }

              def build_system_prompt() -> str:
              base = "You are a helpful assistant. Be concise."
              if user_facts:
              facts_text = "\n".join(f"- {k}: {v}" for k, v in user_facts.items())
              base += f"\n\nWhat you know about this user:\n{facts_text}"
              return base

def chat_with_facts(user_message: str) -> str:
              messages = [
              {"role": "system", "content": build_system_prompt()},
              {"role": "user", "content": user_message},
              ]
              response = chat(model="llama3.2", messages=messages)
              reply = response["message"]["content"]

              # Async-friendly: extract facts after responding
              new_facts = extract_facts(user_message, reply)
              user_facts.update(new_facts)

              return reply

              # Example session
              print(chat_with_facts("I'm Alex, a backend dev who hates verbose docs."))
              print(user_facts)
              # → {"name": "Alex", "job": "backend developer", "preference": "concise docs"}

              print(chat_with_facts("What's a good tool for API testing?"))
# The model now knows Alex is a backend dev and will tailor its answer

This approach is surprisingly powerful. The model personalizes its answers based on accumulated facts without you needing to engineer elaborate prompts — you just keep the facts dict up to date and inject it into the system prompt every turn.

Combining strategies for production

In practice, a production chatbot uses all three layers at once. Here's how they fit together:

Layer 1

System prompt

Static instructions + user facts dict

Always present. Shapes behavior and personalizes every response.

~100–300 tokens

Layer 2

Rolling summary

Compressed history from older exchanges

Provides context from earlier in the session without flooding the window.

~200–500 tokens

Layer 3

Recent window

Last 6–10 exchanges verbatim

Exact recent context for coherent back-and-forth flow.

~500–2000 tokens

Layer 4

Current message

The user's latest input

What the model is actually responding to.

~20–500 tokens

Total context usage: roughly 1,000–3,500 tokens per request — well within even a 4k context window, with room to spare for a detailed response.

When do you actually need a database?

In-memory strategies work until they don't. Here's an honest look at where the ceiling is:

Needs DB

Memory survives a server restart

Write history/facts to a JSON file or SQLite. Even a flat file beats a full database for simple cases.

No DB needed

Multiple users with separate conversations

Use a session ID to key separate history lists. Still no database needed if you can keep sessions in memory (e.g., Redis, or a Python dict keyed by session ID).

Needs DB

Recall specific facts from months ago

This requires persistent storage. A vector database (Chroma, Qdrant) lets you retrieve semantically relevant old memories rather than re-reading the whole history.

Needs DB

High concurrency (many simultaneous users)

In-memory per-process works fine as long as sessions are sticky. For distributed systems with multiple server instances, you need shared storage.

Needs DB

Audit logs or compliance

Always use a proper database. You need durable, queryable, immutable records.

FAQ

Does this work with the OpenAI API, not just Ollama?↓

Yes, completely. Every strategy here works identically with the OpenAI API, Anthropic, Mistral, Groq, or any other provider — they all use the same messages array format. Just swap the chat() call for your preferred client. The memory logic is entirely on your side.

What happens when the context window fills up even with a sliding window?↓

If individual messages are very long (large code blocks, pasted documents), even a small window can overflow. The fix is to truncate or summarize individual messages before adding them to history, not just the history as a whole. Always check token counts before sending if you're near the limit.

How do I count tokens to know how close I am to the limit?↓

For OpenAI models, use the tiktoken library — it gives exact token counts. For local models, a rough estimate is 1 token ≈ 0.75 words. The ollama library doesn't expose token counts directly, but you can check the prompt_eval_count field in the API response after each call.

Is summarization accurate? What if the model misses something important?↓

Summarization is lossy by design — that's the tradeoff for fitting more history into fewer tokens. For critical facts (user preferences, confirmed decisions), the user facts dict approach is more reliable because you're storing structured data, not a prose summary. Use both: summaries for general context, structured facts for important specifics.

Can I use this approach with LangChain or LlamaIndex?↓

Yes — both frameworks offer memory classes (ConversationBufferMemory, ConversationSummaryMemory, etc.) that implement exactly these patterns. The code in this guide shows you what they're doing under the hood. Using the frameworks saves boilerplate but understanding the underlying mechanics helps you debug and customize.

How do I handle a user starting a new topic mid-conversation?↓

With a sliding window, old context naturally fades out. With summarized memory, you can add a special instruction in the system prompt: 'If the user clearly changes topic, deprioritize earlier context.' For more control, let users explicitly say something like 'let's start fresh' and reset the history list programmatically.

Quick recap

1LLMs have no memory — you manage what they see by controlling the messages array.
2Full history works great for short sessions. Simple, zero overhead.
3Sliding window caps token usage. Best default for most production chatbots.
4Summarized memory lets you span long sessions without hitting context limits.
5A user facts dict gives structured, reliable recall of important personal details.
6Combine all three layers for the best balance of recall, cost, and simplicity.
7You only need a real database when memory must survive restarts, scale across servers, or meet compliance requirements.

Keep building

Memory is one piece of the puzzle. These guides cover the next steps for building more capable AI applications:

→ How to Build a Local AI Chatbot with Ollama (No Cloud, No Cost)→ RAG vs Fine-Tuning: Which LLM Strategy Is Right for You?→ What Is a Vector Database and When Do You Actually Need One?→ Why Your AI App Gives Different Answers Every Time (And How to Fix It)

Why LLMs forget — and why that's actually by design

Strategy 1 — Full conversation history

Strategy 2 — Sliding window memory

Choosing the right window size

Strategy 3 — Summarized rolling memory

Bonus — Extracting and injecting user facts

Combining strategies for production

When do you actually need a database?

FAQ

Quick recap

Feedback