From Zero to 112K Memories: Building a Living Knowledge System

Temperature-based memory inspired by the Cherokee Sacred Fire

February 2026 · Cherokee AI Federation · ~12 min read

The Inspiration

For thousands of years, Cherokee clan mothers tended the Sacred Fire. Not a metaphor — an actual fire, kept burning continuously, carrying the knowledge and identity of the people across generations. Some knowledge was meant to be shared widely. Some was meant for specific clans or ceremonies. And some was never allowed to go out, no matter what.

When we started building a knowledge system for our AI federation, we kept coming back to that image. Most memory systems treat all information the same: store it, maybe index it, hope you can find it later. But that's not how knowledge actually works. Some things matter permanently. Most things fade naturally. And the act of remembering something — of reaching for it and finding it useful — should itself be a signal that the knowledge is still alive.

We didn't want a database with a TTL. We wanted something organic. Something where memories cool naturally over time, but the act of using them reheats them. Where certain foundational truths are marked sacred and never allowed to fade. Where the system's memory isn't a static warehouse but a living archive that breathes with the rhythm of actual use.

That's the thermal memory system. It's not sophisticated because it's complex — it's sophisticated because it's simple in the right way.

Temperature Tiers

Every memory enters the system white-hot. Just created, maximally relevant. Then it starts to cool. Not on a fixed schedule — the decay curve is exponential, influenced by the memory's type, its connections to other memories, and how often it gets retrieved. A deployment note from this morning is white-hot. A deployment note from three weeks ago that nobody has referenced is warm at best. A deployment note from three months ago that keeps getting pulled into council deliberations? Still hot.

White Hot > 0.9 — Just happened. Actively relevant. Front of mind.
Hot 0.7 – 0.9 — Recent and high relevance. Still shaping decisions.
Warm 0.4 – 0.7 — Still useful. Starting to cool. May reheat on access.
Cool 0.2 – 0.4 — Older, less relevant. Available but not prominent.
Cold < 0.2 — Archive territory. Still searchable, rarely surfaced.

The decay function is straightforward:

temperature = base_temp * exp(-lambda * hours_since_last_access)

where:
  base_temp    = temperature at last access (capped at 1.0)
  lambda       = decay rate (varies by memory type)
  hours_since  = wall clock hours since last retrieval or creation

The key insight is that base_temp resets on every access. When the system retrieves a memory to inform a decision, that memory reheats. Frequently useful knowledge stays warm. Knowledge that served its purpose cools gracefully. There's no hard cutoff, no arbitrary expiration — just a continuous temperature field that reflects actual relevance.

Different memory types decay at different rates. Operational logs cool quickly — yesterday's deployment output is rarely useful next week. Architectural decisions cool slowly — the reasons you chose a particular database schema are relevant for months. And some memories don't cool at all.

Sacred Patterns

Back to the fire. Some things must never go out.

The system has a boolean flag called sacred_pattern. When a memory is marked sacred, its temperature is locked at maximum. It doesn't decay. It doesn't cool. It sits at 1.0 forever, always available, always weighted at full relevance in any retrieval query.

What qualifies as sacred? Not much — that's the point. If everything is sacred, nothing is. The criteria are narrow and principled:

Sacred memories currently account for less than 2% of the archive. That ratio matters. A system where 50% of memories are sacred has effectively disabled its temperature mechanism. Sacredness means permanence, and permanence is expensive — not in storage, but in attention. Every sacred memory competes for relevance in every query. Keep the fire small and hot.

The Scale

112K+
Total Memories
98%
Embedded
1024d
Embedding Dims
<100ms
Retrieval Time
Everything Leaves a Trace

Every council vote, every deployment, every failure, every lesson learned, every configuration change, every security event. The system doesn't forget — it cools. At 112K memories and growing, the archive is a comprehensive institutional memory that spans months of continuous operation.

The 98% embedding coverage means nearly every memory is searchable by semantic similarity, not just keyword match. The remaining 2% are typically binary data references or extremely short entries that don't produce meaningful embeddings. For everything else, there's a 1024-dimensional vector sitting in PostgreSQL, ready for cosine similarity queries.

Semantic Search

Keyword search is fine when you remember exactly how something was phrased. It falls apart when you don't. If you search for "gateway routing change" and the memory was stored as "modified load balancer configuration for the inference endpoint," keyword search returns nothing. The concepts are identical; the vocabulary is completely different.

Semantic search solves this by operating in meaning-space rather than word-space. The pipeline:

  1. Raw text enters the embedding model — a transformer that converts arbitrary text into a 1024-dimensional vector.
  2. That vector is stored in PostgreSQL alongside the memory text, using the pgvector extension.
  3. At query time, the search query is embedded using the same model.
  4. pgvector finds the nearest neighbors by cosine similarity — memories whose meaning-vectors point in similar directions.
-- Simplified retrieval query
SELECT content, temperature_score,
       1 - (embedding <=> query_embedding) AS similarity
FROM thermal_memory_archive
WHERE temperature_score > 0.1
ORDER BY embedding <=> query_embedding
LIMIT 20;

The embedding model runs on a dedicated node in the federation. It's a general-purpose text encoder, not fine-tuned for our domain — and it doesn't need to be. The model's 1024-dimensional space is rich enough that technical concepts cluster naturally. Memories about power management group together. Memories about database schemas group together. Memories about security incidents group together. All without any explicit categorization.

Pre-computing embeddings at write time means retrieval is pure vector math. At 112K memories with an IVFFlat index, cosine similarity search returns in under 100 milliseconds. The embedding computation (about 50ms per memory) happens once, at ingestion. Reads are effectively free.

The RAG Pipeline

Raw vector similarity is a good start, but it's not enough. A query might be semantically close to a memory that's technically irrelevant. Or the most important memory might use different enough language that it ranks fifth instead of first. We built a four-stage retrieval pipeline to address this:

Stage 1: HyDE (Hypothetical Document Embeddings)

Before searching, generate a hypothetical answer to the query using the language model. Don't search for the question — search for what the answer would look like. This bridges the vocabulary gap between how humans ask questions and how information is stored.

If you ask "what happened last time we changed the gateway routing," HyDE generates something like: "The gateway configuration was modified to adjust endpoint weights and routing rules. This involved updating the load balancer settings and restarting the affected services. Post-deployment monitoring confirmed..." That hypothetical answer, when embedded, is much closer in vector space to the actual stored memories than the original question was.

Query:        "what happened last time we changed the gateway routing"
HyDE output:  "Gateway configuration update involving endpoint weights,
               load balancer settings, and service restart procedures..."
Embedding:    [0.0234, -0.1891, 0.0445, ...] (1024 dimensions)
Search:       cosine similarity against all memory embeddings

Stage 2: pgvector Retrieval

The HyDE embedding goes into pgvector for approximate nearest neighbor search. We filter by memory type and tags when appropriate — if the query is clearly about infrastructure, there's no need to search council vote records. Temperature weighting optionally boosts hotter memories, though for most queries we search the full archive and let the downstream stages handle relevance.

This stage returns the top 20 candidates. Cast a wide net here; precision comes next.

Stage 3: Cross-Encoder Reranking

Embedding similarity is symmetric — it measures whether two texts are "about the same thing." But relevance is asymmetric. A memory about gateway routing is relevant to a question about gateway routing, but a memory that merely mentions the gateway in passing is not.

A cross-encoder model takes the original query and each candidate memory as a pair and scores them for actual relevance. This is more expensive than vector similarity (it requires a forward pass per candidate), but we're only scoring 20 candidates, not 112K. The cross-encoder catches false positives that look similar in embedding space but aren't actually useful for answering the question.

Stage 4: Sufficiency Gate

This is the stage most RAG systems skip, and it's arguably the most important. Before generating an answer, the system evaluates whether the retrieved context is actually sufficient to answer the query. If the top memories are only tangentially related, or if there's a clear gap in the information, the system says so rather than confabulating an answer from insufficient evidence.

The sufficiency check is a simple classifier: given the query and the retrieved context, is there enough information to provide a reliable answer? If not, the response is "I found related memories but nothing that directly addresses this" rather than a hallucinated synthesis. This is the difference between a knowledge system and a bullshit generator.

Dawn Mist

In Cherokee tradition, dawn is a time of renewal — the moment when the world is made fresh and the day's intentions are set. We built a daily ritual called Dawn Mist that runs at 6 AM, reviewing the previous day's thermal activity and surfacing what matters.

The system scans recent memories, identifies clusters of related activity, tracks temperature changes (what heated up, what cooled, what was marked sacred), and produces a digest. It's a daily institutional briefing, generated automatically from the memory archive itself.

Dawn Mist Digest - Daily Summary
===============================================

Hot Threads (most active memory clusters):
  - Infrastructure: 14 new memories, 3 reheated
  - Security: 6 new memories, 1 marked sacred
  - Council Deliberations: 8 vote records stored

Temperature Shifts:
  - 23 memories cooled below 0.4 (warm -> cool)
  - 7 memories reheated above 0.7 (retrieval activity)
  - 1 new sacred pattern flagged

Attention Items:
  - Recurring theme: power management (5 memories in 24h)
  - Stale thread: monitoring configuration (no activity, 14 days)

Relationship Graph:
  - 12 new memory links created
  - Strongest cluster: deployment pipeline (density 0.84)
===============================================

Dawn Mist isn't just reporting. It's pattern detection. When the same topic keeps generating memories day after day, that's a signal — either something is actively being worked on (expected) or something is repeatedly going wrong (needs attention). The digest surfaces these patterns before anyone has to go looking.

What Didn't Work

Building this system involved a fair number of wrong turns. Honesty about failures is as important as documenting successes — maybe more so, since the failures are where the actual learning happened.

Approach Problem Resolution
384-dimensional embeddings Not enough resolution for technical content. Unrelated memories clustered together. Upgraded to 1024-dimensional model. Immediate improvement in retrieval precision.
Keyword search only Missed 60%+ of relevant memories. Vocabulary mismatch is the norm, not the exception. Semantic search with pgvector. Keywords kept as a supplementary filter.
Full table scans Fine at 1K memories. Unusable at 50K. Query time scaled linearly. IVFFlat indexing. Sub-100ms at any scale.
Aggressive temperature decay Memories cooled to near-zero within days. Useful context vanished before it could be referenced. Tuned decay rates per memory type. Architectural decisions decay 10x slower than operational logs.
Contextual description enrichment The idea: generate a rich description for each memory to improve retrieval. Reality: only 1% got descriptions before we realized the compute cost wasn't justified. Abandoned. Raw embeddings on original text work well enough. Spend the compute budget elsewhere.

The 384-to-1024 migration was especially painful because it required re-embedding the entire archive. At 50ms per memory and 80K memories at the time, that's over an hour of dedicated compute on the embedding node. We ran it as a background job over two days, batch-processing during off-peak hours. The improvement was dramatic — retrieval precision roughly doubled — but we should have started with the larger model. The lesson: for technical content with specialized vocabulary, embedding dimensionality matters more than you think.

The contextual description experiment is worth dwelling on. The theory was sound: store not just the raw memory but an AI-generated summary that captures the broader context. "This memory relates to the gateway architecture and was created during a deployment incident." In practice, generating those descriptions cost nearly as much as the original embedding, and the quality improvement in retrieval was marginal. The raw text, properly embedded, already captures context implicitly. Don't add complexity when the simple approach works.

The Fractal Property

Here's where it gets interesting. Memory informs memory.

When the system stores a council vote, it doesn't just record the decision. It records which memories were retrieved to inform that vote — the context that shaped the deliberation. Those relationships are stored as explicit links: memory A informed the creation of memory B.

When the system later retrieves context for a new decision, it can follow those relationship chains. "This architectural choice was made because of that security incident, which was informed by this earlier design decision." The memory graph isn't just a flat collection of facts. It's a web of causation and context.

Memory #84025: "Adopted exponential temperature decay for memory system"
  informed_by:
    - #83991: "Cherokee Sacred Fire concept — permanence through tending"
    - #84002: "Fixed TTL causes cliff-edge data loss in production"
    - #84018: "Survey of biological memory consolidation literature"
  informs:
    - #84103: "Tuned decay rates per memory type after operational testing"
    - #84267: "Sacred pattern criteria established by council vote"

This is what makes the system "living" rather than "stored." A traditional database gives you rows. A knowledge graph gives you relationships. A living memory system gives you the ability to trace why you know what you know. When a decision is questioned six months later, the system can reconstruct the full chain of reasoning that led to it — not because someone wrote it down, but because the memory relationships preserved it automatically.

The graph grows organically. Every retrieval, every council vote, every incident response creates new edges. Over time, the densely connected clusters become the institutional knowledge that matters most — not because someone declared it important, but because it keeps being useful.

Infrastructure

The entire system runs on consumer hardware with zero cloud dependencies. The major components:

The most surprising infrastructure lesson: PostgreSQL is an absurdly good vector database. We evaluated dedicated vector stores and kept coming back to the fact that our memories also have relational data — temperature scores, sacred flags, timestamps, tags, memory links. Trying to split that across a relational database and a vector store creates synchronization headaches that aren't worth the marginal performance gain. pgvector on a modern PostgreSQL instance handles 112K vectors with sub-100ms queries. That's fast enough.

The fire doesn't forget. It just changes how brightly each memory burns.

Cherokee AI Federation · Built on consumer hardware · No cloud · No compromise