The open-source memory project MemPalace claims 96.6% retrieval recall at rank 5 on the LongMemEval benchmark using raw semantic search alone. No API key. No cloud service. No large language model at any stage of retrieval.

That number is the headline. The architecture behind it is the story.

MemPalace stores conversation history as verbatim text, then retrieves it with semantic search. It does not summarize, extract, or paraphrase. The index is structured as a physical metaphor — wings for people and projects, rooms for topics, drawers for original content. Searches scope to a wing rather than running against a flat corpus. The retrieval layer is pluggable, with ChromaDB as the default and sqlite_exact, Qdrant, and pgvector as alternatives.

The 96.6% figure comes from the LongMemEval benchmark, a 500-question test. The project publishes three tiers of results. Raw semantic search hits 96.6% with no heuristics and no LLM. A hybrid pipeline adding keyword boosting, temporal-proximity boosting, and preference-pattern extraction reaches 98.4% on a held-out set of 450 questions. Adding an LLM rerank pushes past 99% on the full 500-question set.

The project is explicit about what the numbers mean. The raw 96.6% requires no API key, no cloud, and no LLM at any stage. The hybrid pipeline’s 98.4% is the honest generalizable figure — tuned on 50 development questions, not seen during training. The rerank pipeline works with any reasonably capable model, including Claude Haiku, Claude Sonnet, and minimax-m2.7 via Ollama Cloud. The project does not headline a 100% number because the last 0.6% was reached by inspecting specific wrong answers, which the benchmark methodology flags as teaching to the test.

The project deliberately avoids side-by-side comparisons against Mem0, Mastra, Hindsight, Supermemory, or Zep. The README states that those projects publish different metrics on different splits, and placing retrieval recall next to end-to-end QA accuracy is not an honest comparison. That is a rare and welcome piece of methodological honesty in a field where benchmark comparisons are routinely misleading.

What the architecture reveals

MemPalace’s design choices reveal a specific theory of how AI memory should work. The project stores verbatim text. No summarization, no extraction, no paraphrasing. The assumption is that the original text contains information that any summary would lose, and that the retrieval system should surface the original for the consuming model to interpret.

The structured index is the second key choice. Wings and rooms are not a gimmick. They allow searches to be scoped to a specific project or person, reducing the candidate pool before semantic search even runs. This is a practical concession to the reality that flat semantic search over a large corpus degrades as the corpus grows. The project ships 29 MCP tools covering palace reads and writes, knowledge-graph operations, cross-wing navigation, drawer management, and agent diaries.

The pluggable backend matters. ChromaDB is the default, but the interface is defined in a single base class. Alternative backends can be dropped in without touching the rest of the system. The two external backends — Qdrant and pgvector — exercise the storage contract on different substrates, a REST/dict store and a SQL/JSONB store. The project explicitly states this is to ensure the contract is not accidentally shaped around one vendor.

The knowledge graph is a temporal entity-relationship graph with validity windows, backed by local SQLite. Agents each get their own wing and diary, discoverable at runtime via a dedicated tool. The auto-save hooks for Claude Code save periodically and before context compression.

What the benchmarks actually test

LongMemEval is a 500-question retrieval recall test. LoCoMo hits 60.3% R@10 on raw, 88.9% with the hybrid pipeline. ConvoMem averages 92.9% across 250 items. MemBench, an ACL 2025 benchmark with 8,500 items, scores 80.3% R@5.

These are retrieval numbers, not end-to-end QA numbers. The project is honest about this. Retrieval recall measures whether the right session is in the top N results. It does not measure whether the consuming model can use that session to answer a question correctly. The rerank pipeline addresses that gap, but the headline numbers are raw retrieval.

The 96.6% raw recall is surprising because it challenges the assumption that good retrieval requires an LLM in the loop. The project uses embeddinggemma-300m, a multilingual embedding model, or all-MiniLM-L6-v2 for English-only. The embedding model requires about 300 MB of disk space. No API key is required for the core benchmark path.

What this means for AI builders

The practical implication is that local-first AI memory is viable today for many use cases. A developer can install MemPalace with a single command, point it at a project directory, and get 96.6% retrieval recall on a 500-question benchmark without sending data to any external service. The project ships a CLI, a Python API, a Docker image, and MCP server tools.

The project targets Claude Code specifically, with auto-save hooks and a retention setup checklist. Claude Code sessions expire in 30 days without auto-save hooks wired. MemPalace mines those transcripts and makes them searchable. The project also supports Gemini CLI, MCP-compatible tools, and local models.

The open question is how the system performs at scale. The benchmarks test retrieval over curated datasets. Real-world use involves thousands of sessions, each with multiple turns, across multiple projects. The structured index helps, but the project has not published latency or recall numbers for a corpus of that size.

The second open question is how the consuming model uses the retrieved context. A 96.6% retrieval recall means the right session is in the top 5 results 96.6% of the time. It does not mean the model answers the question correctly 96.6% of the time. The rerank pipeline helps, but the project does not publish end-to-end accuracy numbers.

The third open question is maintenance. The project is young. The repository has a changelog, a history document, and a public notice about impostor sites. The README warns that other domains are impostors and may distribute malware. The project has already dealt with impersonation. That is a sign of interest, but also a sign of operational overhead.

MemPalace’s 96.6% raw recall on LongMemEval is a genuine technical achievement. The project’s methodological honesty about benchmarks is a model for the field. The architecture — verbatim storage, structured index, pluggable backends, no cloud dependency — is a coherent design that prioritizes data ownership and reproducibility over convenience. For developers building AI tools that need local, private, and verifiable memory, the project is worth a serious look. The benchmark numbers are reproducible from the repository. The code is MIT licensed. The embedding model runs on a laptop. That is a concrete offer, not a promise.