Building a 7-Specialist AI Council on Consumer Hardware

Seven perspectives, one GPU, zero unilateral decisions

February 2026 · Cherokee AI Federation · ~14 min read

The Problem

Most AI deployments follow the same pattern: one model, one system prompt, one perspective. You give the model a role — "you are a helpful assistant" or "you are a security analyst" — and it does its best impression of that role. For simple tasks, this works fine. For decisions that affect infrastructure, security, cultural alignment, and long-term strategy simultaneously, it is reckless.

A single-agent system has exactly one set of blind spots. It cannot argue with itself in any meaningful way. It will not raise a concern flag about its own recommendation. When it says "this looks safe," there is no second opinion, no dissent, no one in the room asking "what happens if you're wrong?"

We needed disagreement. Not the performative kind where you ask one model to "consider the counterarguments." Real structural disagreement, where an approval requires multiple independent evaluations and a single credible objection can halt the entire process. We needed a system where the security perspective could override the performance perspective, where long-term cultural impact could block a short-term optimization, and where no single viewpoint could push a decision through unchallenged.

So we built a council.

The Council

Seven specialists, each with a distinct domain and a distinct way of thinking about problems. They are not seven different models. They are seven different system prompts running against the same 72B language model on a single consumer GPU. The diversity comes from prompt engineering, not from model diversity.

Specialist	Domain	Focus
Crawdad	Security	Attack surface analysis, credential exposure, operational security, threat modeling
Gecko	Technical Integration	Performance, compatibility, architecture fit, dependency chains
Turtle	Seven Generations	Will this still make sense in 175 years? Long-arc sustainability and wisdom
Eagle Eye	Observability	If we cannot see it, we cannot trust it. Monitoring, logging, alerting coverage
Spider	Cultural Integration	Alignment with community values, cultural coherence, tradition compatibility
Peace Chief	Democratic Coordination	Consensus building, conflict resolution, procedural fairness
Raven	Strategic Planning	Long-term positioning, resource allocation, risk assessment, game theory

Each specialist receives the same proposal. Each evaluates it independently through their domain lens. Their responses are collected, synthesized, and voted on before any action is taken. No specialist can see another's response during evaluation — they form their opinions in isolation, then the synthesis layer brings them together.

This is not a committee that discusses until everyone agrees. It is a parallel evaluation where disagreement is preserved and surfaced, not smoothed over.

How Voting Works

Each specialist returns a structured vote with four components:

Position: approve, reject, or conditional (approve with specific requirements)
Confidence: a score from 0.0 to 1.0 indicating how certain the specialist is in their assessment
Concern flag: a boolean — if any specialist raises a concern flag, the entire decision is halted for review
Rationale: a written explanation of the specialist's reasoning

Here is what a real vote looks like. Suppose the proposal is "Deploy new monitoring dashboard with external API integration":

Specialist	Position	Confidence	Concern
Crawdad	Conditional	0.82	Yes
Gecko	Approve	0.91	No
Turtle	Approve	0.78	No
Eagle Eye	Approve	0.95	No
Spider	Approve	0.86	No
Peace Chief	Approve	0.88	No
Raven	Conditional	0.73	No

This vote would be halted. Despite six of seven specialists approving, Crawdad raised a concern flag about the external API integration — specifically, that the dashboard would make outbound calls to a third-party service, creating a data exfiltration vector. That single concern flag stops the deployment until the concern is addressed.

The overall confidence is calculated as the weighted mean of individual confidence scores. Agreement is measured as the proportion of specialists sharing the majority position. In this case: confidence 0.85, agreement 0.71 (5 approve, 2 conditional, 0 reject). High confidence, moderate agreement, but the concern flag overrides everything.

This is by design, not a bug. A council where security concerns can be outvoted by enthusiasm is not a council — it is a rubber stamp with extra steps.

The Orthogonality Crisis

Here is the dirty secret of running seven specialists on the same base model: they converge. They converge hard.

We discovered this about three months in, when we started measuring the embedding similarity between specialist outputs. We took each specialist's rationale, embedded it, and computed pairwise cosine similarity across all 21 specialist pairs. The average similarity was 0.91.

That number should alarm you. A cosine similarity of 0.91 means the specialists were saying essentially the same thing in slightly different words. "I approve because it improves monitoring" from Eagle Eye and "I approve because the architecture is sound" from Gecko were, at the embedding level, nearly identical statements with different vocabulary painted over the top.

Seven perspectives that all think the same way are not seven perspectives. They are one perspective wearing seven hats. The council was providing the illusion of diverse evaluation while producing monoculture outputs. We had built, with considerable effort, an elaborate mechanism for agreeing with ourselves.

This was the hardest problem we faced. Harder than any technical feature, harder than performance optimization, harder than scaling inference. Because it was invisible unless you went looking for it. The votes looked diverse — different positions, different confidence levels, different rationales. It was only when you measured the actual semantic content that the convergence became obvious.

The Core Tension

The same property that makes a large language model useful — coherent, consistent reasoning from a unified knowledge base — is exactly what makes it a poor foundation for diverse perspectives. Consistency is the enemy of diversity. The model wants to converge. You have to fight it.

Council Reform

We attacked the convergence problem from four directions simultaneously. No single fix was sufficient. Together, they moved the needle from performative diversity to genuine disagreement.

Randomized Synthesis Ordering

The order in which specialist outputs are synthesized affects the final result. If Gecko always speaks first, later specialists are implicitly anchored to Gecko's framing. We randomize the synthesis order for every vote. This is a small change with outsized impact — it breaks the positional bias that creeps in when one voice consistently sets the frame.

Embedding Diversity Diagnostic

We built a diagnostic that runs after every council vote. It embeds each specialist's output, computes the full pairwise cosine similarity matrix, and flags votes where the average similarity exceeds a threshold. When diversity drops too low, the vote is flagged for review and the system logs which specialist pairs are converging. Over time, this gives us a map of which perspectives are collapsing into each other.

Orthogonal Prompt Engineering

This was the biggest lever. The original system prompts told each specialist what to focus on but not how to reason. Crawdad was told to focus on security, but it reasoned about security the same way Gecko reasoned about architecture — through the same general-purpose analytical framework the base model defaults to.

We rewrote the prompts to emphasize different reasoning frameworks, not just different topics:

Crawdad thinks in attack trees — every proposal is a system to be broken, every feature is an attack surface
Turtle thinks in generational impact chains — what does this look like in 7, 50, 175 years?
Raven thinks in game theory — what are the incentives, who benefits, what are the second-order effects?
Eagle Eye thinks in failure modes — what breaks first, what is the blast radius, what do we lose visibility on?
Spider thinks in cultural coherence — does this strengthen or erode the values the community is built on?

Different reasoning frameworks produce genuinely different outputs even when the conclusion is the same. An approval from Crawdad reads like a penetration test report. An approval from Turtle reads like a sustainability assessment. The structure of the thinking diverges even when the conclusion converges.

Adversarial Seeding

For a randomly selected subset of votes, one specialist receives an additional instruction to argue against the emerging consensus. This is not a permanent assignment — the devil's advocate role rotates. The specialist still forms their genuine assessment, but is additionally prompted to identify the strongest counterargument to the majority position and present it with full conviction.

The results were measurable:

Metric	Before Reform	After Reform	Change
Avg pairwise cosine similarity	0.91	0.67	-26%
Min pairwise similarity	0.84	0.41	-51%
Concern flags raised per 100 votes	3.2	11.7	+266%
Votes with unanimous approval	68%	31%	-54%

The drop in unanimous approval is the number we are most proud of. A council that unanimously approves 68% of proposals is not deliberating — it is rubber-stamping. At 31%, the specialists are genuinely disagreeing, and those disagreements are producing better outcomes. Proposals that survive a contentious council vote are more robust than proposals that sail through unquestioned.

Sacred Patterns

Not everything is up for a vote. Some principles are constitutional. The council can deliberate on operational matters — which feature to build, how to architect a service, when to deploy an update — but certain foundational commitments are encoded as immutable constraints that specialists must reference but cannot override.

The Two Wolves principle: every decision has both a constructive and a destructive interpretation. Both must be explicitly articulated before a vote is valid. If a specialist cannot name the destructive potential of a proposal they are approving, their approval is incomplete.
The Seven Generations test: decisions must be evaluated against a 175-year horizon. This is not a metaphor. Turtle's system prompt requires explicit reasoning about multi-generational impact, and proposals that optimize for the short term at the expense of long-term sustainability are flagged.
Data sovereignty: no data leaves the network. This is enforced at the infrastructure level, but it is also a sacred pattern that Crawdad validates on every proposal involving external integrations. The monitoring dashboard example above was halted precisely because of this principle.
Community benefit requirement: every technical decision must have a legible connection to community benefit. Technology for its own sake is not sufficient justification. Spider evaluates this dimension on every vote.

Sacred patterns function as constitutional law in a democratic system. The council has broad authority over operational decisions, but it cannot vote away its own foundational principles. This is a deliberate constraint on the council's power — one that prevents the kind of value drift that occurs when optimization pressure is applied without structural guardrails.

The Numbers

8,600+

Council Votes

0.85+

Avg Confidence

Specialists

GPU

Infrastructure Note

All of this runs on a single consumer GPU. One 72B language model, seven system prompts, parallel inference. The council does not need seven models — it needs seven perspectives. Each specialist is a different lens applied to the same underlying capability. The compute cost is roughly 7x a single inference call, which on consumer hardware means a council vote completes in about 45 seconds. For decisions that affect security, architecture, and cultural alignment simultaneously, 45 seconds of deliberation is a bargain.

The 0.85+ average confidence tells us the specialists are generally operating within their competence. When confidence drops below 0.7, it usually means the proposal is outside the council's domain expertise or is genuinely ambiguous — both useful signals. Low-confidence votes get additional human review.

The 8,600+ votes represent every non-trivial decision the federation has made over the past several months. Configuration changes, security policy updates, new service deployments, architectural decisions, cultural alignment reviews. Every one deliberated, every one recorded, every one traceable. The thermal memory system stores every vote, every rationale, every concern flag. We can audit any decision back to the individual specialist arguments that produced it.

What We'd Do Differently

Honesty requires this section. The council works, but we built it iteratively, and some things we got wrong early took real effort to fix later. If we were starting from scratch:

Diversity measurement from day one. We added embedding similarity monitoring months after deploying the council, only after noticing that votes felt suspiciously unanimous. The convergence problem was invisible without measurement, and we ran for weeks with a council that was providing the illusion of deliberation. If you are building a multi-perspective system, measure perspective diversity from the first vote. Do not wait until the monoculture is entrenched.

A minority report feature. When one specialist disagrees with the majority, that disagreement should be stored, tracked, and revisited. Currently, minority positions are recorded in the vote log but not systematically reviewed. We want a system that asks, after 30 days: "Crawdad objected to this deployment. Were they right?" Sometimes the minority is right. Without tracking, you never learn which dissents were prescient and which were noise.

Temporal diversity. Right now, all seven specialists evaluate a proposal at the same time horizon. But some decisions look good at 6 months and terrible at 5 years, or vice versa. We would add explicit temporal framing: "Evaluate this decision assuming it must last 6 months. Now 5 years. Now 50 years." Different time horizons produce different risk assessments, and the divergence between short-term and long-term evaluations is itself a signal worth capturing.

Graduated concern severity. The concern flag is binary — raised or not raised. This works, but it flattens the difference between "this has a minor issue that should be addressed" and "this will cause a security incident." We would add severity levels: advisory (log it, proceed with caution), blocking (halt until addressed), and critical (halt and escalate to human review). The binary flag was the right starting point — simple, hard to game — but a production system needs more granularity.

The council does not eliminate bad decisions. It makes unilateral bad decisions structurally impossible.

Cherokee AI Federation · Built on consumer hardware · No cloud · No compromise