article
Introducing wood-fired-tasks.
Coordination infrastructure for fleets of AI coding agents — the missing primitive between one Claude Code session and ten of them working the same backlog without stepping on each other. MIT, on npm, self-hostable. Here's the origin story, the design, and how I run it.
This week I open-sourced wood-fired-tasks. It is now public at github.com/Wood-Fired-Games/wood-fired-tasks and on npm. MIT-licensed. The release tag is v1.12. It is a self-hostable task-tracking system with three peer interfaces — a REST API, a CLI, and an MCP server — built specifically so AI coding agents can read and write the same backlog as the humans supervising them, without anyone stepping on anyone else.
I have been using it daily for three months and I want to tell you both why it exists and what it can do for you.
Where it came from
For a good bit of my twenty-five years in games I have been a creative director and a game designer, and in both of those roles the actual unit of work was often more about managing Jira tickets than writing code. I have created thousands of them. I have assigned thousands of them. I have reviewed the results of thousands of them. Every shipped game I have ever worked on existed as a column of tickets long before it existed as a build. The ticket is not paperwork around the work; for a creative lead, the ticket is often the shape the work takes.
When AI agents started getting genuinely useful in late 2025, the workflow I wanted to build was obvious to me. Take a ticket out of the tracker. Hand it to an agent. Let the agent do the work. Review the result the same way I would review a teammate’s PR. Close the ticket. The unit of work would not change at all; the thing doing the work would. I spent the back half of 2025 trying to make that exact workflow run against the off-the-shelf trackers I had spent two decades inside — Jira, Trello, Notion, GitHub Issues — and the agents kept stumbling. They would hit the wrong API. They would lose track of which board they were operating on. They would fail to authenticate in ways the official integrations were not built to recover from. They would just decline to take an action without explaining why. They would ignore formatting requirements. The integrations existed. None of them were reliable enough to be load-bearing.
In early 2026 I gave up on bridging to the existing tools and decided to build a tracker that was agent-first from the foundation up. It needed to be CLI-friendly, because by then I was spending more time in terminals than I had since the early 2000s. And because my own day splits between a Windows machine where the front-end game development happens and a Linux box where the backend services and most of the AI experimentation live, the tracker needed to coordinate across machines too — across operating systems, across repos, across the agents themselves.
The first version of that coordination was undignified. I copy-pasted agent output between Slack threads. A Claude session on the Windows side would summarize what it had done; I would paste the summary into a thread on the Linux side and let a second Claude pick it up. The second version was a network file share where both machines could read the same JSON files. The third version was a recognition: a real task tracker, designed for this, was a better place to do these handoffs than any general-purpose messaging or filesystem layer. Tasks already have status. Tasks already have dependencies. Tasks already have the right shape for cross-machine, cross-repo, cross-agent work. They just needed an MCP server and a real claim protocol and the right CLI on top.
That is when the project started to compound. Around the same time I was learning from the GSD orchestration framework I run as a substrate — github.com/gsd-build/get-shit-down — that agentic work can be made systemic, verifiable, and auditable. So I pushed the tracker hard in that direction. Structured verdicts on every closed task. Read-only graders that physically cannot edit the code they are grading. Dependency-aware execution that refuses to dispatch a parallel loop against a cyclic graph. Generator/critic separation enforced by the agent definitions themselves. The result is what I open-sourced today.
The original vision — taking work directly out of an existing Jira or Trello board and dispatching it into agents — is still something I’m interested in exploring but is largely unrealized. What exists today is the substrate that vision will eventually run on top of, and it has been load-bearing enough to ship several thousand commits across the Wood Fired ecosystem in the past several months. That is the wedge the rest of this post unpacks.
What it does
Wood Fired Tasks is a single Node service that exposes three peer surfaces over one SQLite service layer:
A REST API with 47 route handlers (40 live in a default deploy). A CLI with 31 commands — tasks create, tasks list, tasks claim, the usual shape. An MCP server with 22 tools, so Claude Code, Cursor, Gemini, and Codex can all read and write tasks directly from inside an agent session. All three surfaces hit the same data. Anything an agent does over MCP is immediately visible to the CLI; anything you do from the CLI is immediately visible to the agents. There is no “agent view” and no “human view.” There is one project, one truth, and three ways to talk to it.
The coordination primitives are first-class.
Atomic task claiming with optimistic locking. When a task is unclaimed, any number of agents can race to claim it. Exactly one wins. The other nineteen get a 409-equivalent error and move on to a different task. This is verified end-to-end in CI: twenty concurrent agents race the same task, one success, nineteen conflicts, zero errors. Stale claims auto-release after thirty minutes so a crashed agent does not lock the work forever.
Workflow automation. When a task closes, the system automatically unblocks every task whose blocked_by edge pointed at it. When the last subtask of a parent task closes, the parent auto-completes. You build the dependency graph once; the executor figures out the order.
Real-time SSE events. Every state change emits an event on a Server-Sent Events stream. Dashboards, second agents, and downstream automation all subscribe to one stream and react in real time. Closed tasks, status transitions, new dependencies, claim conflicts — all of it.
The most elegant of the 22 MCP tools, and the one that captures the design philosophy in a sentence, is claim_task:
Atomically claim an unassigned task, setting assignee and transitioning status to in_progress. Returns 409-equivalent error if already claimed.
That is one tool definition. That is the entire story of how N agents work the same backlog without colliding.
Capturing what the agent notices
There is a value pattern in this design that surprised me, and it might be the single best argument for putting a tracker behind an MCP server rather than behind any other interface.
When I give an agent a research task or a debugging task, that agent has a primary objective. It is trying to answer a specific question or fix a specific bug. The context window it is working in fills up fast with whatever serves that objective. Anything tangential — a smell in a neighboring file, an undocumented behavior, a brittle test, a missing migration, a TODO comment that has aged badly, a friction point in the build — would normally fall out of the agent’s context the moment its focus shifts. Context churn eats observations like that for breakfast. The agent is doing exactly what you asked it to do; everything it noticed on the way is gone the next session.
But because the tracker is an MCP server, the agent can pause for one tool call, file the observation as a new task in the right project, tag it appropriately, link it as a follow-up to whatever it was actually working on, and keep going. No interruption to the primary work. No working-memory cost beyond a single round-trip. The observation lands in the database with a timestamp, the file reference, and the agent’s reasoning preserved as the task body. I review the new tasks later, decide which ones to act on, and dispatch a worker against them when it makes sense.
That single pattern has become one of the most valuable things wood-fired-tasks does for me. Technical debt that would have been lost to context churn now accumulates as a tracked backlog. Bugs the agent noticed while doing something else become real bug reports instead of vanished neurons. The result is a queue I can plan against, contributed to by every agent session, without ever asking an agent to step outside its assigned scope. The tracker is doing classic backlog-grooming work, except the contributors filing the tickets are the agents themselves while they are nominally doing other things. That alone has made this a valuable tool.
The orchestration layer
A task tracker with twenty-two MCP tools and a REST API would already be useful. The thing that turns it from “tracker the agents can talk to” into “autonomous backlog drain” is the set of /tasks:* Claude Code skills shipped alongside the service.
/tasks:loop is the sequential autonomous executor. You point it at a project and it drains the backlog. For each task: pick the highest-priority open task, claim it, plan the validation depth, dispatch a fresh subagent to implement the fix, independently re-run build and test to verify the subagent’s claim, commit and push only the named files, dispatch a separate tasks-verifier subagent to grade the closed task against its acceptance criteria, close the task only on PASS, and emit a kill-safe LOOP-RUN.md audit artifact. The mental model the skill teaches:
Think of yourself as the foreman, not the carpenter. Each task: hand a self-contained brief to a fresh subagent (the carpenter), then independently re-check the work before signing it off. Your context only holds the plan, summaries, and verification results — never raw build logs, file scans, or trial-and-error.
/tasks:loop-dag is the wave-by-wave parallel version. Same primitives, but it computes the dependency frontier — the set of open tasks whose blockers are all satisfied — and dispatches a worker subagent per frontier task in parallel under a configurable concurrency cap. When the wave finishes, it runs the verifier per worker, runs an integration-auditor per file overlap, then recomputes the frontier and dispatches the next wave. The mental model:
Think of yourself as a foreman scheduling a build crew across independent foundations on the same site. Each foundation (wave) is a set of tasks that have no remaining dependencies. While the wave’s workers are pouring concrete in parallel, you (the orchestrator) plan the next wave. You never let a worker start before its supporting foundation has cured — that’s what blocked_by enforces.
Before either loop dispatches anything, it asks the topology_check MCP tool to classify the project. The tool walks the task_dependencies graph and returns one of three labels: FLAT (no dependency edges; parallel-safe; use /tasks:loop), DAG (acyclic with edges; use /tasks:loop-dag and let Kahn’s algorithm order the waves), or DAG_CYCLIC (refuse to run; cycles must be broken first; no override flag overrules this). The classifier is its own verification surface. You cannot accidentally run a parallel loop against a graph with a cycle.
The frontier correctness is fixture-tested. Edges {334→337, 335→337, 337→338, 337→339} must produce waves {334, 335}, then {337}, then {338, 339}. If that ordering ever breaks, CI rejects the change before the build goes green.
The graders that grade the graders
The orchestration discipline that makes the loops trustworthy is generator/critic separation. The agent that wrote the code never grades the code. A separate, read-only agent grades it.
tasks-verifier is dispatched after every closed task. Its tool list is deliberately restricted:
tools: Read, Grep, Glob, Bash,
mcp__wood-fired-tasks__get_task,
mcp__wood-fired-tasks__get_comments,
mcp__wood-fired-tasks__get_dependencies,
mcp__wood-fired-tasks__list_tasks,
mcp__wood-fired-tasks__list_projects
No Edit, no Write, no mutating MCP tool. The verifier physically cannot alter the code it is grading or change the task it is grading against. Bash is allowlisted to read-only commands — git log, git diff, git show, the project’s test and build commands, cat, head, tail, sqlite3 SELECT-only — and explicitly denies everything that could mutate state. It runs with hard bounds: thirty tool calls and five minutes per task. The verdict is a structured JSON object — PASS, FAIL, PARTIAL, or NOT_VERIFIED — with cited evidence per acceptance criterion. The forbidden-evidence rule, paraphrased from the skill file:
FORBIDDEN evidence: “looks good”, “appears to satisfy”, “the worker said so”, any paraphrase that does not cite a file, command, or commit.
If the verifier emits anything ungrounded, a static gate rejects the verdict before the close sticks.
integration-auditor is the second grader, and it catches a failure mode the per-task verifier physically cannot see. When two worker subagents in the same loop run touch the same file, the auditor is dispatched once per overlap to grade that one file × two-hunk seam:
You are the falsifiable gate that surfaces composition bugs the per-task verifier cannot see — because per-task verifier sees only one task’s diff against HEAD~, never the union of two workers’ edits to the same symbol. Without this gate, ten green tasks can compose into a broken system and the loop never notices.
The auditor’s verdict surface is SAFE / RISKY / BROKEN, with tighter bounds than the verifier (fifteen tool calls, three minutes, because the scope is one file). It is allowed to mark something BROKEN only if it can cite a concrete file:line referent; otherwise it falls back to RISKY. The auditor exists because of an incident on 2026-05-23 where the verifier emitted an invalid per-check status enum and the orchestrator silently upgraded the run from PARTIAL to PASS based on its own observation. The commit that hardened the no-upgrade rule, 6b26fc5, is in the public history:
“the orchestrator silently upgraded both runs from PARTIAL to PASS based on its own observation, violating the Generator/critic separation rule.”
The fix made the upgrade impossible to repeat. A static gate now refuses any verdict that promotes a per-check status outside the allowed enum. That is the kind of bug that does not exist if you do not separate the generator from the critic. It is also the kind of bug that, when you do separate them, you find at design time instead of at run time.
A note on attribution. The generator/critic-separation pattern and the frontier-wave execution model are not my invention. They come from GSD, the third-party orchestration framework I credited earlier. What is mine is the implementation of those patterns inside a tasks-loop executor — the verifier and auditor agent definitions, the loop skills, the topology classifier, the database schema that persists verifier verdicts on the task row, the MCP server that exposes it all. The shipped skills are vendor-neutral by design; the commit that introduced /tasks:loop-dag explicitly removed cross-references to other agent-focused tooling from the skill text. The lineage is real and credited; the implementation is independent.
How I’m using it
I orchestrate everything I ship through wood-fired-tasks now. Three concrete examples from the last week, all part of the public commit history of this very repo.
When I want to grade a TypeScript codebase against current community standards, I open a separate Codex session and tell it the truth: I have not read this codebase, I did not write a single line of it, please produce a structured improvement plan and enter it as tasks in the production database. Codex emits a multi-phase roadmap. A follow-on Codex session turns the roadmap into a tracked project with created_by: codex stamped on every task. Then I run /tasks:loop project N and Claude Opus 4.7 implements the plan while Codex sleeps. Two frontier models from two different vendors grade each other through the task system without ever meeting inside the same CLI session.
When I want to harden a service before a public release, I dispatch a Codex audit and an Opus audit independently, paste one’s findings into the other, ask Opus to merge both views into a plan, run /tasks:decompose to turn the plan into a structured project, then /tasks:loop-dag project N to drain it in parallel waves with the verifier and integration-auditor running between waves. That is how the final pre-launch sweep on this repo happened the night of May 25. Every task closed before morning.
When I want to clear a backlog overnight, I leave /tasks:loop-dag running with --max-waves 3 --concurrency 4 and check the LOOP-RUN.md artifact when I wake up. Twenty tasks, dispatched in three waves, every one independently verified by an agent that cannot edit the code it is grading. Every claim conflict, every verifier verdict, every integration audit landed in the database as searchable history.
The thing I want every reader to take away from those three patterns is that the orchestrator is not me. The orchestrator is a skill file. I am the project planner and the appellate court. The execution happens because the primitives are first-class.
Extending it
I have extended wood-fired-tasks for my own use with a Grafana dashboard suite that visualizes loop runs and per-workflow cost in real time, a set of session summarizers that auto-generate commit messages and release notes from the telemetry stream, and an attribution hook that tags every agent transaction back to the task it was working on. That extension layer is personal scaffolding I happen to find useful; it is out of scope for this post and I will write it up separately. What matters for a reader installing the tracker today is that the SSE event stream and the structured task rows in the SQLite schema are designed to be exactly the integration points for that kind of observability. Whether you wire it into Grafana, Datadog, a homemade dashboard, or nothing at all, the primitives are there. The point of open-sourcing the tracker is that the parts you build on top of it are yours.
Try it
git clone https://github.com/Wood-Fired-Games/wood-fired-tasks.git
cd wood-fired-tasks && npm install && npm run build
export API_KEYS="your-api-key-here"
export DATABASE_PATH="./data/tasks.db"
npm run migrate && npm start
tasks create --title "My first task" --project 1 --created-by "me"
That is the entire start. The included install.sh (and install.ps1 on Windows) registers the MCP server in ~/.claude.json, copies the /tasks:* skill files to ~/.claude/commands/tasks/, and copies the verifier and auditor agent definitions to ~/.claude/agents/. Restart Claude Code and the autonomous loops are wired up.
The README at the repo root is the full reference. The smallest interesting thing you can do once it is installed is open Claude Code, type /tasks:loop 1, and watch it run. The smallest interesting thing you can do without an agent at all is tasks list --project 1 and start filing real work into a tracker that an agent can pick up later.
Why this is open source
The honest answer is that wood-fired-tasks is load-bearing inside my own development practice. The verifier and auditor lanes that grade every closed task before the close sticks are most of the reason I can ship at the volume I ship. Hiding the executor while publishing posts about how I use it would have been dishonest. Open-sourcing it is the cost of being credible about the practice.
It is MIT-licensed. Use it for your own backlogs. Fork it. Replace the verifier with your own grader. Wire your own dashboards into the SSE stream. If you do, let me know — the part of this I most want to learn from is what other people’s orchestration patterns look like when they have a real task layer underneath them.
A companion post going up next week describes how I came to trust the validation infrastructure enough to ship this release without reading the code. The two posts are meant to be read together.
Repository: github.com/Wood-Fired-Games/wood-fired-tasks
npm: npm install wood-fired-tasks
next