Building a 7-Specialist AI Council on Consumer Hardware
Seven perspectives, one GPU, zero unilateral decisions
The Problem
Most AI deployments follow the same pattern: one model, one system prompt, one perspective. You give the model a role — "you are a helpful assistant" or "you are a security analyst" — and it does its best impression of that role. For simple tasks, this works fine. For decisions that affect infrastructure, security, cultural alignment, and long-term strategy simultaneously, it is reckless.
A single-agent system has exactly one set of blind spots. It cannot argue with itself in any meaningful way. It will not raise a concern flag about its own recommendation. When it says "this looks safe," there is no second opinion, no dissent, no one in the room asking "what happens if you're wrong?"
We needed disagreement. Not the performative kind where you ask one model to "consider the counterarguments." Real structural disagreement, where an approval requires multiple independent evaluations and a single credible objection can halt the entire process. We needed a system where the security perspective could override the performance perspective, where long-term cultural impact could block a short-term optimization, and where no single viewpoint could push a decision through unchallenged.
So we built a council.
The Council
Seven specialists, each with a distinct domain and a distinct way of thinking about problems. They are not seven different models. They are seven different system prompts running against the same 72B language model on a single consumer GPU. The diversity comes from prompt engineering, not from model diversity.
| Specialist | Domain | Focus |
|---|---|---|
| Crawdad | Security | Attack surface analysis, credential exposure, operational security, threat modeling |
| Gecko | Technical Integration | Performance, compatibility, architecture fit, dependency chains |
| Turtle | Seven Generations | Will this still make sense in 175 years? Long-arc sustainability and wisdom |
| Eagle Eye | Observability | If we cannot see it, we cannot trust it. Monitoring, logging, alerting coverage |
| Spider | Cultural Integration | Alignment with community values, cultural coherence, tradition compatibility |
| Peace Chief | Democratic Coordination | Consensus building, conflict resolution, procedural fairness |
| Raven | Strategic Planning | Long-term positioning, resource allocation, risk assessment, game theory |
Each specialist receives the same proposal. Each evaluates it independently through their domain lens. Their responses are collected, synthesized, and voted on before any action is taken. No specialist can see another's response during evaluation — they form their opinions in isolation, then the synthesis layer brings them together.
This is not a committee that discusses until everyone agrees. It is a parallel evaluation where disagreement is preserved and surfaced, not smoothed over.
How Voting Works
Each specialist returns a structured vote with four components:
- Position: approve, reject, or conditional (approve with specific requirements)
- Confidence: a score from 0.0 to 1.0 indicating how certain the specialist is in their assessment
- Concern flag: a boolean — if any specialist raises a concern flag, the entire decision is halted for review
- Rationale: a written explanation of the specialist's reasoning
Here is what a real vote looks like. Suppose the proposal is "Deploy new monitoring dashboard with external API integration":
| Specialist | Position | Confidence | Concern |
|---|---|---|---|
| Crawdad | Conditional | 0.82 | Yes |
| Gecko | Approve | 0.91 | No |
| Turtle | Approve | 0.78 | No |
| Eagle Eye | Approve | 0.95 | No |
| Spider | Approve | 0.86 | No |
| Peace Chief | Approve | 0.88 | No |
| Raven | Conditional | 0.73 | No |
This vote would be halted. Despite six of seven specialists approving, Crawdad raised a concern flag about the external API integration — specifically, that the dashboard would make outbound calls to a third-party service, creating a data exfiltration vector. That single concern flag stops the deployment until the concern is addressed.
The overall confidence is calculated as the weighted mean of individual confidence scores. Agreement is measured as the proportion of specialists sharing the majority position. In this case: confidence 0.85, agreement 0.71 (5 approve, 2 conditional, 0 reject). High confidence, moderate agreement, but the concern flag overrides everything.
This is by design, not a bug. A council where security concerns can be outvoted by enthusiasm is not a council — it is a rubber stamp with extra steps.
The Orthogonality Crisis
Here is the dirty secret of running seven specialists on the same base model: they converge. They converge hard.
We discovered this about three months in, when we started measuring the embedding similarity between specialist outputs. We took each specialist's rationale, embedded it, and computed pairwise cosine similarity across all 21 specialist pairs. The average similarity was 0.91.
That number should alarm you. A cosine similarity of 0.91 means the specialists were saying essentially the same thing in slightly different words. "I approve because it improves monitoring" from Eagle Eye and "I approve because the architecture is sound" from Gecko were, at the embedding level, nearly identical statements with different vocabulary painted over the top.
Seven perspectives that all think the same way are not seven perspectives. They are one perspective wearing seven hats. The council was providing the illusion of diverse evaluation while producing monoculture outputs. We had built, with considerable effort, an elaborate mechanism for agreeing with ourselves.
This was the hardest problem we faced. Harder than any technical feature, harder than performance optimization, harder than scaling inference. Because it was invisible unless you went looking for it. The votes looked diverse — different positions, different confidence levels, different rationales. It was only when you measured the actual semantic content that the convergence became obvious.
The same property that makes a large language model useful — coherent, consistent reasoning from a unified knowledge base — is exactly what makes it a poor foundation for diverse perspectives. Consistency is the enemy of diversity. The model wants to converge. You have to fight it.
Council Reform
We attacked the convergence problem from four directions simultaneously. No single fix was sufficient. Together, they moved the needle from performative diversity to genuine disagreement.
Randomized Synthesis Ordering
The order in which specialist outputs are synthesized affects the final result. If Gecko always speaks first, later specialists are implicitly anchored to Gecko's framing. We randomize the synthesis order for every vote. This is a small change with outsized impact — it breaks the positional bias that creeps in when one voice consistently sets the frame.
Embedding Diversity Diagnostic
We built a diagnostic that runs after every council vote. It embeds each specialist's output, computes the full pairwise cosine similarity matrix, and flags votes where the average similarity exceeds a threshold. When diversity drops too low, the vote is flagged for review and the system logs which specialist pairs are converging. Over time, this gives us a map of which perspectives are collapsing into each other.
Orthogonal Prompt Engineering
This was the biggest lever. The original system prompts told each specialist what to focus on but not how to reason. Crawdad was told to focus on security, but it reasoned about security the same way Gecko reasoned about architecture — through the same general-purpose analytical framework the base model defaults to.
We rewrote the prompts to emphasize different reasoning frameworks, not just different topics:
- Crawdad thinks in attack trees — every proposal is a system to be broken, every feature is an attack surface
- Turtle thinks in generational impact chains — what does this look like in 7, 50, 175 years?
- Raven thinks in game theory — what are the incentives, who benefits, what are the second-order effects?
- Eagle Eye thinks in failure modes — what breaks first, what is the blast radius, what do we lose visibility on?
- Spider thinks in cultural coherence — does this strengthen or erode the values the community is built on?
Different reasoning frameworks produce genuinely different outputs even when the conclusion is the same. An approval from Crawdad reads like a penetration test report. An approval from Turtle reads like a sustainability assessment. The structure of the thinking diverges even when the conclusion converges.
Adversarial Seeding
For a randomly selected subset of votes, one specialist receives an additional instruction to argue against the emerging consensus. This is not a permanent assignment — the devil's advocate role rotates. The specialist still forms their genuine assessment, but is additionally prompted to identify the strongest counterargument to the majority position and present it with full conviction.
The results were measurable:
| Metric | Before Reform | After Reform | Change |
|---|---|---|---|
| Avg pairwise cosine similarity | 0.91 | 0.67 | -26% |
| Min pairwise similarity | 0.84 | 0.41 | -51% |
| Concern flags raised per 100 votes | 3.2 | 11.7 | +266% |
| Votes with unanimous approval | 68% | 31% | -54% |
The drop in unanimous approval is the number we are most proud of. A council that unanimously approves 68% of proposals is not deliberating — it is rubber-stamping. At 31%, the specialists are genuinely disagreeing, and those disagreements are producing better outcomes. Proposals that survive a contentious council vote are more robust than proposals that sail through unquestioned.
Sacred Patterns
Not everything is up for a vote. Some principles are constitutional. The council can deliberate on operational matters — which feature to build, how to architect a service, when to deploy an update — but certain foundational commitments are encoded as immutable constraints that specialists must reference but cannot override.
- The Two Wolves principle: every decision has both a constructive and a destructive interpretation. Both must be explicitly articulated before a vote is valid. If a specialist cannot name the destructive potential of a proposal they are approving, their approval is incomplete.
- The Seven Generations test: decisions must be evaluated against a 175-year horizon. This is not a metaphor. Turtle's system prompt requires explicit reasoning about multi-generational impact, and proposals that optimize for the short term at the expense of long-term sustainability are flagged.
- Data sovereignty: no data leaves the network. This is enforced at the infrastructure level, but it is also a sacred pattern that Crawdad validates on every proposal involving external integrations. The monitoring dashboard example above was halted precisely because of this principle.
- Community benefit requirement: every technical decision must have a legible connection to community benefit. Technology for its own sake is not sufficient justification. Spider evaluates this dimension on every vote.
Sacred patterns function as constitutional law in a democratic system. The council has broad authority over operational decisions, but it cannot vote away its own foundational principles. This is a deliberate constraint on the council's power — one that prevents the kind of value drift that occurs when optimization pressure is applied without structural guardrails.
The Numbers
All of this runs on a single consumer GPU. One 72B language model, seven system prompts, parallel inference. The council does not need seven models — it needs seven perspectives. Each specialist is a different lens applied to the same underlying capability. The compute cost is roughly 7x a single inference call, which on consumer hardware means a council vote completes in about 45 seconds. For decisions that affect security, architecture, and cultural alignment simultaneously, 45 seconds of deliberation is a bargain.
The 0.85+ average confidence tells us the specialists are generally operating within their competence. When confidence drops below 0.7, it usually means the proposal is outside the council's domain expertise or is genuinely ambiguous — both useful signals. Low-confidence votes get additional human review.
The 8,600+ votes represent every non-trivial decision the federation has made over the past several months. Configuration changes, security policy updates, new service deployments, architectural decisions, cultural alignment reviews. Every one deliberated, every one recorded, every one traceable. The thermal memory system stores every vote, every rationale, every concern flag. We can audit any decision back to the individual specialist arguments that produced it.
What We'd Do Differently
Honesty requires this section. The council works, but we built it iteratively, and some things we got wrong early took real effort to fix later. If we were starting from scratch:
Diversity measurement from day one. We added embedding similarity monitoring months after deploying the council, only after noticing that votes felt suspiciously unanimous. The convergence problem was invisible without measurement, and we ran for weeks with a council that was providing the illusion of deliberation. If you are building a multi-perspective system, measure perspective diversity from the first vote. Do not wait until the monoculture is entrenched.
A minority report feature. When one specialist disagrees with the majority, that disagreement should be stored, tracked, and revisited. Currently, minority positions are recorded in the vote log but not systematically reviewed. We want a system that asks, after 30 days: "Crawdad objected to this deployment. Were they right?" Sometimes the minority is right. Without tracking, you never learn which dissents were prescient and which were noise.
Temporal diversity. Right now, all seven specialists evaluate a proposal at the same time horizon. But some decisions look good at 6 months and terrible at 5 years, or vice versa. We would add explicit temporal framing: "Evaluate this decision assuming it must last 6 months. Now 5 years. Now 50 years." Different time horizons produce different risk assessments, and the divergence between short-term and long-term evaluations is itself a signal worth capturing.
Graduated concern severity. The concern flag is binary — raised or not raised. This works, but it flattens the difference between "this has a minor issue that should be addressed" and "this will cause a security incident." We would add severity levels: advisory (log it, proceed with caution), blocking (halt until addressed), and critical (halt and escalate to human review). The binary flag was the right starting point — simple, hard to game — but a production system needs more granularity.
The council does not eliminate bad decisions. It makes unilateral bad decisions structurally impossible.
Cherokee AI Federation · Built on consumer hardware · No cloud · No compromise