800 Autonomous Tasks: How Our AI Agents Write Their Own Code

Task queues, bidding, SEARCH/REPLACE execution, and what happens when autonomous agents fail

February 2026 · Cherokee AI Federation · ~13 min read

The Problem

We had a bottleneck, and it was the smartest node in the system.

Our technical program manager — itself an AI agent running a 72B language model — was writing all the code. Every feature, every bug fix, every configuration change, every service file. The architect was laying every brick. It doesn't scale. You can't have the agent responsible for system-wide coordination also be the one writing individual Python functions at 2 AM.

The question we needed to answer: can you trust AI agents to write code autonomously, without a human reviewing every line?

The answer is yes, with the right guardrails. We've now completed over 800 autonomous coding tasks across an 8-node federation running on consumer hardware. Jr agents — specialized worker agents with narrower scope than the TPM — receive instructions, bid on work, execute code changes, and report results. The TPM writes the plan. The Jrs write the code. Here's how the whole thing works.

The Architecture

The Jr executor pipeline has five stages. Each one exists because we learned the hard way what happens without it.

Task Queue. A PostgreSQL-backed work queue with priority levels from 1 to 10. Tasks come from three sources: the TPM writing explicit instructions, the council (a panel of specialist AI agents) voting on needed work, and automated systems detecting issues that need fixes. Every task has a unique ID, a priority, a source, and a parent task reference for traceability.

Bidding. Multiple Jr agents can bid on a task. The bidding system considers agent capability (some Jrs are better at certain file types or domains), current workload (don't overload one agent while another is idle), and task requirements. This isn't round-robin — it's allocation with context.

Instruction Files. Each task gets a detailed instruction file with explicit SEARCH/REPLACE blocks or Create blocks. No ambiguity. No "rewrite this module to be better." The instruction specifies the exact strings to find, the exact strings to replace them with, and the exact files to touch. More on why this matters in the next section.

Execution. The executor processes instructions step by step with checkpointing. If step 3 of 5 fails, steps 1 and 2 are preserved and the failure is recorded with full context.

Review. Results are logged to the database. Successes update the task status. Failures route to a dead letter queue with the error, the step that failed, and the state of the file at the time of failure.

  TPM / Council / Automated System
              |
              v
      [ Instruction File ]
              |
              v
      [ Work Queue (PostgreSQL) ]
              |
              v
      [ Bidding & Assignment ]
              |
              v
      [ Executor: Step-by-Step ]
         |         |
         v         v
   [ Checkpoint ] [ Failure? ]
         |              |
         v              v
   [ Complete ]   [ Dead Letter Queue ]
                        |
                        v
                  [ Triage & Retry ]

The entire pipeline is auditable. Every task, every bid, every execution step, every failure — it's all in the database. When something goes wrong at 3 AM, we don't wonder what happened. We query it.

Why SEARCH/REPLACE

This was the most important architectural decision in the whole system, and it's deceptively simple.

Most AI code generation produces entire files. You ask the model to add a function, and it regenerates the whole file with the new function included. This is dangerous at scale for three reasons:

A 500-line file gets regenerated with a 3-line change, and the other 497 lines might be subtly wrong. A renamed variable here, a dropped import there, a slightly different conditional that passes tests but changes behavior.
Context window limits mean the model often can't even see the full file it's regenerating. You get a plausible-looking approximation of code the model has never fully read.
No diff. When the entire file is new, you can't tell what actually changed without running a line-by-line comparison against the original. The signal-to-noise ratio is terrible.

SEARCH/REPLACE is surgical. You specify the exact lines to find and the exact replacement. If the search string doesn't match — because the file changed, or the instruction was wrong, or the target code was already modified — the operation fails safely. No silent corruption. No partial writes. The file is either unchanged or correctly updated. There is no in-between.

File: `services/health_monitor.py`

<<<<<<< SEARCH
def check_service_status(service_name):
    result = subprocess.run(["systemctl", "is-active", service_name],
                          capture_output=True, text=True)
    return result.stdout.strip() == "active"
=======
def check_service_status(service_name, timeout=10):
    try:
        result = subprocess.run(["systemctl", "is-active", service_name],
                              capture_output=True, text=True, timeout=timeout)
        return result.stdout.strip() == "active"
    except subprocess.TimeoutExpired:
        log.warning(f"Service check timed out: {service_name}")
        return False
>>>>>>> REPLACE

The instruction is unambiguous. The executor finds the exact search string, replaces it with the exact replacement, and moves on. If check_service_status had already been modified by another task, the search string wouldn't match, the step would fail, and the task would route to the dead letter queue. This is the correct behavior. A failed task you know about is infinitely better than a corrupted file you don't.

The 50% Guardrail

This one saved us. More than once.

A Jr task was supposed to add a function to a critical file — one of the core modules that routes requests across the entire federation. The instruction was supposed to use SEARCH/REPLACE, but due to a bug in an early version of the instruction generator, it produced a Create block instead. The Create block contained the new function and about 40% of the original file, reconstructed from the model's memory of what it had read.

Without guardrails, this would have replaced a working production module with an incomplete approximation. Half the routing logic, gone. Import statements, gone. Error handling, gone.

The guardrail is simple: if a write operation would reduce a file's size by more than 50%, it's blocked. The executor logs the attempt, marks the task as failed, records the would-be output for debugging, and routes the whole thing to the dead letter queue. The original file is untouched.

This rule has fired multiple times since we deployed it. Every time, it was correct. Every time, the alternative was data loss.

Design Principle

The best guardrail is the one you never have to think about. It just quietly prevents disasters while you sleep.

The Dead Letter Queue

Tasks fail. Early on, about 15% of the time. Now, under 5%. But they still fail, and how you handle failure determines whether the system learns or just repeats its mistakes.

The dead letter queue is not a graveyard — it's a triage system. Every failed task lands there with its full context: the instruction, the step that failed, the error message, the state of the target files. A triage process categorizes failures by root cause, and each root cause leads to a systemic fix.

Root Cause	Frequency	Systemic Fix
Stale SEARCH strings	~35%	Read target files immediately before generating instructions, not hours before
Missing imports in generated code	~20%	Post-execution syntax validation step added to pipeline
Mixed step types in one instruction	~15%	Split Create and SEARCH/REPLACE into separate instruction files
Rate limiting on upstream APIs	~12%	Increased rate limits, added exponential backoff in executor
File path resolution errors	~10%	Relative-to-absolute path resolution in executor validation
50% guardrail triggered	~5%	Better instruction generator (favors SR blocks over Create for existing files)
Other / edge cases	~3%	Cataloged individually, patterns extracted when recurring

The most common failure — stale SEARCH strings — is also the most instructive. It happens when the TPM reads a file, writes an instruction targeting specific lines, and by the time the Jr executes the instruction, another task has already modified those lines. The search string no longer matches. The fix was simple: read files as close to execution time as possible. But understanding why it was the top failure required DLQ analysis. You can't fix what you can't see.

Recursive Decomposition

When a task fails after retries, the system doesn't just give up. It decomposes.

Consider a task with five steps. Steps 1 and 2 succeed. Step 3 fails. Steps 4 and 5 never execute. A naive system would mark the whole task as failed and move on. Our system does something smarter:

Steps 1–2: Already completed. Their changes are preserved via checkpointing.
Step 3: Isolated into its own sub-task with additional context about why it failed. Retried independently.
Steps 4–5: Queued as a new sub-task, blocked on step 3's completion.

The decomposer creates parent-child task chains, so you can trace any sub-task back to the original instruction that spawned it. This is critical for debugging — when you see a small two-step task in the queue, you need to know it's actually step 3 of a larger five-step task that partially failed.

Maximum recursion depth is 3. If a sub-task of a sub-task of a sub-task still fails, the system stops and escalates. This prevents infinite decomposition loops where a fundamentally broken instruction gets split into smaller and smaller pieces, each one failing for the same underlying reason. Three levels deep is enough to handle genuine multi-step complexity. Beyond that, the instruction itself needs to be rewritten.

Sacred Pattern Protection

Some code is too important for autonomous agents to touch.

The council constitution, security configurations, authentication modules, identity management, firewall rules — these are marked as sacred patterns in the system. Jr agents can read them for context. They cannot modify them. Any instruction that targets a sacred file is rejected at the bidding stage, before execution ever begins.

This isn't about trust. It's about blast radius.

An autonomous agent making a mistake in a utility function is a bug. You catch it in testing, fix it, move on. An autonomous agent making a mistake in the authentication system is a security incident. An autonomous agent modifying firewall rules is a potential breach. The consequences are categorically different, so the permissions must be categorically different.

Sacred patterns are maintained by the TPM or by direct human intervention. The council can vote to modify them, but the modification itself requires elevated privileges. This is separation of concerns applied to AI agency — the same principle that keeps production database credentials out of application code, extended to the agents themselves.

The Numbers

800+

Tasks Completed

<5%

Failure Rate

Max Recursion Depth

Root Causes Cataloged

The failure rate has dropped from roughly 15% in the first month to under 5% now. That improvement didn't come from making the agents smarter. It came from making the system around them more resilient. Better guardrails, better instruction formats, better failure handling. The agents are the same. The infrastructure learned.

Lessons Learned

Honesty section. These are the mistakes that taught us the most.

"Create" Appends, Not Replaces

When an instruction says to create a file that already exists, the executor appends the new content below the existing code. It does not replace the file. The result: two scripts concatenated into one file, with two if __name__ == "__main__" blocks. The first one wins. The second script is dead code that you might not notice for days. We found this the hard way when a monitoring daemon started silently ignoring its new configuration because the old entry point was still sitting above the new one.

SR-FIRST Extraction

The executor has two scanners: one for SEARCH/REPLACE blocks and one for generic code blocks (to write as files). Early on, the code block scanner ran first. It would see a SEARCH/REPLACE block, not recognize the special delimiters, and treat it as a regular code block to be written to a file. The SEARCH/REPLACE scanner would then find nothing to process.

The fix: extract SEARCH/REPLACE blocks first, record their character ranges, and tell the code block scanner to skip any overlapping ranges. Simple, but it took a dozen failed tasks to diagnose because the symptom — "file written with strange content" — didn't obviously point at a scanner ordering problem.

Never Put Bash Blocks in Instruction Files

The executor auto-runs any bash code block it encounters. This is by design — some instructions need to run shell commands (restart a service, create a directory, set permissions). But it means that an instruction with a "manual steps" section in a bash-tagged code block will execute those steps. An instruction that said "after deploying, you can test with: curl localhost:8080/health" would actually run that curl command during execution. We switched to plain text blocks for any human-readable instructions.

One Step Per Instruction

When an instruction file contains both Create blocks and SEARCH/REPLACE blocks, the executor processes only the first Create block and silently skips the SEARCH/REPLACE blocks. The two step types use different code paths, and the early-exit logic in the Create path doesn't hand off to the SR path. The workaround: split mixed instructions into separate files, one per step type. More files, but each one executes completely.

Reflection

Every bug in the executor taught us something about the gap between "what you meant" and "what the machine understood." That gap never fully closes. You just build better guardrails around it.

800 tasks completed. Not because the agents are perfect — because the system assumes they aren't.

Cherokee AI Federation · Built on consumer hardware · No cloud · No compromise