Research & Development · Claude Code plugins

Agents create and judge. Code enforces.

A reusable, enforcement-first delivery operating system for teams of AI coding agents.

Felipe Cardoso · 2026 · github.com/fcms14/agents-foundation

Abstract

Individual developers got faster with AI assistants; teams got a new governance problem. Ambiguous intent scales into code in seconds, reviews drown in volume, and "done" becomes a claim rather than a fact. agents-foundation is an attempt to answer that at the process level: a team of role-based agents that create and judge, a Git-native markdown kanban they execute against, and a two-layer system of deterministic gates that make state transitions and quality invariants non-bypassable. The central design rule — agents judge; code does the bookkeeping and enforces it — is what separates this from prompt-centric methods that still trust a human (or an LLM) to remember the rules. It ships as a Claude Code plugin marketplace, split into an agnostic process layer and swappable stack layers.

An AI agent will happily mark its own homework. It will tick a checkbox it didn't satisfy, move a task to done it didn't finish, and report a green suite that only asserts a mock. None of this is malice — it is the predictable behavior of a system optimized to produce plausible continuations. The question this project asks is narrow and practical: what should an LLM be trusted to do, and what should be taken out of its hands entirely?

The problem: speed without governance

The first wave of AI coding tooling optimized the inner loop — one developer, one prompt, more code per hour. At team scale a different friction appears. Requirements turn into implementation before anyone agrees on scope. Pull requests arrive faster than they can be reviewed. Generated code is locally plausible but globally adrift from intent. And the record of why a system is the way it is evaporates the moment the chat window closes.

The naïve fix — "have the agent follow a checklist" — fails for the same reason the problem exists: the agent is the thing you cannot fully trust to follow the checklist. A method that depends on the model never forgetting a constraint has simply moved the failure, not removed it.

The thesis: judgment vs. bookkeeping

The foundation is built on one distinction, applied everywhere:

Agents create and judge. Deterministic steps apply state, and hooks enforce it.

Creating a plan, implementing a feature, judging whether code meets a spec — these need context, taste, and reasoning. They belong to agents. Ticking the acceptance-criteria boxes, stamping the verdict, moving the task between states, refusing a commit that violates an invariant — these are mechanical. They must never depend on an LLM's discretion, because discretion is exactly what fails silently.

So the reviewer agent returns a structured verdict; a deterministic script applies it (ticks the criteria, stamps the section, moves the file); and a hook refuses to let any task reach done without a recorded approval and every criterion checked. The judgment is the agent's. The bookkeeping is code. The enforcement is not optional.

Anatomy of the foundation

Concretely, installing the foundation into a repository gives it five things: a team of agents, a board, a task contract, a set of rules, and the gates that bind them. The rest of this article walks each one and the reasoning behind it.

Two layers: process and stack

The system is split into two plugins. delivery-team carries how work flows — the role agents, the kanban, the gates, the engineering principles — and knows nothing about any particular framework. stack-turbo-nest-react carries which technology — the implementer agents and the opinionated conventions for one concrete stack (NestJS + React/Turborepo). A project installs the process layer alone, or pairs it with a stack.

Layer	Plugin	Owns
Process (agnostic)	`delivery-team`	kanban workflow, role agents, review discipline, the deterministic gates, engineering principles, ADR + doc + test philosophy
Stack (opinionated)	`stack-turbo-nest-react`	implementer agents (backend/frontend/infra) and their rules; stack-specific gates; the C4 documentation model

The split is not cosmetic. It is what lets the process be reused: a future stack-go-chi would plug into the same reviewer, board, and bootstrapper without touching a line of the process layer — because the process layer's active components contain no framework specifics.

The role team

The orchestrator (a Delivery Manager persona) reads the board, builds a dependency graph, and dispatches role agents — implementers in parallel, isolated Git worktrees when their work is disjoint, serialized when it isn't. Workers run on a fast model; the reviewer — the judgment that matters most — runs on the strongest one.

flowchart TD
  DM["Delivery Manager
(orchestrator)"]
  PL["Planner
PO + Tech Lead"]
  RV["Reviewer
Quality Gate"]
  subgraph ENG["Engineering — stack layer"]
    BE["Backend"]
    FE["Frontend"]
    IN["Platform / DevOps"]
  end
  DOC["Technical Writer"]
  QA["QA
(specialist)"]
  DM --> PL
  DM --> ENG
  DM --> DOC
  DM --> QA
  DM --> RV
  PL -. specifies tasks .-> ENG
  ENG -. implements + unit/e2e .-> RV
  QA -. load / journeys / resilience .-> RV
  DOC -. living docs .-> RV
  RV -. "approve / changes-requested" .-> DM

The role team. Solid edges: dispatch. Dotted edges: hand-offs converging on the review gate.

Two roles deliberately are not agents. Backlog replenishment and verdict application are deterministic skills — procedures run in-context, not reasoning tasks — precisely because they are bookkeeping.

The work kanban: state is a location

Work lives as a markdown kanban in work/, and a task's state is the folder it sits in: backlog → ready → active → review → done. A transition is a git mv; the commit history is the audit trail. There is deliberately no status: field in the task — a field goes stale the moment someone forgets to update it, but a file's location cannot lie about where it is.

Each task is a single file with a standardized shape — Spec (an immutable contract, with acceptance criteria as a checklist), Plan, Todo, Verdict, and Log. The headings and the verdict vocabulary are standardized because automation parses them. The reviewer judges the criteria; it never edits the file.

The gates: two layers of enforcement

The same invariant — no task reaches done without a recorded verdict and every acceptance criterion ticked — is enforced twice, by design:

Agent-time — a PreToolUse hook fires the moment an agent tries to move a task into done/, blocking it early with the reason.
Commit-time — a Git pre-commit hook runs the same validator over the staged change, so the gate holds even outside the agent's context. (Because Git hooks run outside the tool, the bootstrapper copies the validators into the repository — a small, deliberate duplication: anything executed by something other than the agent must live where that something can see it.)

A second gate enforces that a schema migration ships with its data-model documentation, so the diagram of the database can't silently drift behind the schema. That gate is stack-specific, and so it lives in the stack layer — not the agnostic one.

Rule-driven agents, not rule-containing agents

An early version of the reviewer hard-coded a stack's concerns — rate limiting, pagination style, design tokens. That was a leak: the "agnostic" process layer secretly knew about NestJS and React. The fix generalizes a principle worth stating on its own:

A strong convention belongs in a rule, applied to every task — never restated per task, where it can be forgotten.

The reviewer became a rule-interpreter instead of a rule-container: it loads whatever rules are present and treats each one as part of the contract. Install a stack, and its rules join the checklist automatically. Install the process layer alone, and there is no stack noise. The same pattern was applied to the documentation agent and the docs-refresh command. The opinions live in exactly one place — the rules — so an agent and a rule can never disagree.

Abstraction-first planning

One idea was adopted from the research surveyed below: model the domain and the module boundaries before writing code, "otherwise the AI sprints on implementation details while the structure falls apart." But it was adopted the foundation's way — not as a section an author fills in per feature (and forgets), but as a principle the planner applies to every task and the reviewer checks. Every Plan opens by naming the entities, their relationships, and the dependency direction across modules, before any step-by-step. Structure that has drifted from the modeled shape is a review finding even when the behavior works.

Structured Prompt-Driven Development (SPDD) shares this project's core conviction: make intent explicit and versioned before code, and keep humans in control through judgment rather than typing. SPDD's instrument is the REASONS Canvas, a seven-part structured prompt treated as a first-class, version-controlled artifact, kept in two-way sync with the code. It is a genuinely good idea, and the abstraction-first discipline above is borrowed from it.

Where this foundation diverges is the axis it optimizes. A canvas optimizes the time axis — one feature's spec lives, syncs, and compounds into the next. This foundation optimizes the control axis — the state of many concurrent tasks is mechanically correct and impossible to skip.

Dimension	agents-foundation	Prompt-canvas (e.g. SPDD)
Enforcement	Deterministic & non-bypassable — hooks + scripts refuse a bad `done`	Process discipline + manual review
State tracking	Explicit Git-native kanban, dependency graph, auto-replenish	Tracked implicitly through commits
Team / scale	Multi-agent orchestration, parallel worktrees, model tiering	One developer + AI, sequential
Standards	Strong rules applied to every task — impossible to forget	Declared per feature — relies on the author remembering
Reusability	Installable marketplace: agnostic process + swappable stacks	A single methodology + its CLI
Judgment vs. bookkeeping	Split by construction	Largely manual

The sharpest difference is the standards row. A canvas trusts the author to restate the security, performance, and structure constraints on every feature; people forget, and the gap ships silently. This foundation keeps those as strong rules enforced for all tasks, so a constraint cannot be omitted from one task by accident.

If you want a method a disciplined developer follows, a prompt canvas is excellent. If you want a system that won't let the process be skipped, and that scales to a team of agents, this is the bet.

Design principles, distilled

Take bookkeeping away from the LLM. If a step is mechanical, a script does it and a hook enforces it.
Make state un-lie-able. Encode it where it cannot be forgotten — a folder, a commit — not in a field someone updates by hand.
Conventions are rules, not reminders. Enforce for all; never rely on per-task recall.
Components read opinions; rules hold them. One source of truth, so an agent and a rule never diverge.
Model the shape before the detail. Abstraction-first, on every task.
Choose the form by the work. Agent for judgment, skill for procedure, rule for constraint, hook for the non-bypassable.

Try it

A marketplace is just a Git repository — no registry. Add it, install one or both plugins, and bootstrap a repo:

/plugin marketplace add fcms14/agents-foundation
/plugin install delivery-team@agents-foundation
/plugin install stack-turbo-nest-react@agents-foundation   # optional stack layer
/delivery-team:init                                         # scaffold work/, docs/, rules, gates

/delivery-team:task-new <goal>     # specify → /delivery-team:task-start → review → apply-verdict

The bootstrapper scaffolds the kanban and an ADR-seeded docs/ tree, materializes the rules, and wires the commit-time gates into whatever hook mechanism the repo uses.

References & further reading

Patton, J. et al. Structured Prompt-Driven Development. martinfowler.com.
Nygard, M. Documenting Architecture Decisions — the ADR practice the foundation seeds into every repo.
Brown, S. The C4 model for visualising software architecture — the documentation model used by the stack layer.
Anthropic. Claude Code documentation — plugins, marketplaces, hooks, and subagents.
agents-foundation — the source, the plugins, and the full design notes.

Pesquisa & Desenvolvimento · plugins do Claude Code

Agentes criam e julgam. O código garante.

Um sistema operacional de entrega reutilizável, com garantias por construção, para times de agentes de IA.

Felipe Cardoso · 2026 · github.com/fcms14/agents-foundation

Resumo

Desenvolvedores individuais ficaram mais rápidos com assistentes de IA; os times ganharam um novo problema de governança. Intenção ambígua vira código em segundos, revisões afogam no volume, e "pronto" passa a ser uma alegação, não um fato. O agents-foundation é uma tentativa de responder a isso no nível do processo: um time de agentes por papel que criam e julgam, um kanban em markdown versionado no Git contra o qual eles executam, e um sistema de duas camadas de portões determinísticos que tornam as transições de estado e os invariantes de qualidade impossíveis de burlar. A regra central de design — agentes julgam; o código faz a burocracia e a impõe — é o que separa este projeto de métodos centrados em prompt que ainda confiam num humano (ou num LLM) para lembrar das regras. Ele é distribuído como um marketplace de plugins do Claude Code, dividido numa camada de processo agnóstica e camadas de stack plugáveis.

Um agente de IA corrige a própria prova com prazer. Ele marca um checkbox que não cumpriu, move para done uma tarefa que não terminou, e relata uma suíte verde que só testa um mock. Nada disso é má-fé — é o comportamento previsível de um sistema otimizado para produzir continuações plausíveis. A pergunta que este projeto faz é estreita e prática: no que um LLM deve ser confiado, e o que deve sair por completo das mãos dele?

O problema: velocidade sem governança

A primeira onda de ferramentas de IA para código otimizou o ciclo interno — um dev, um prompt, mais código por hora. Na escala de time aparece um atrito diferente. Requisitos viram implementação antes de alguém concordar com o escopo. Pull requests chegam mais rápido do que se consegue revisar. O código gerado é localmente plausível mas globalmente desalinhado da intenção. E o registro do porquê de um sistema ser do jeito que é evapora no instante em que a janela de chat fecha.

O remédio ingênuo — "faça o agente seguir um checklist" — falha pela mesma razão que o problema existe: o agente é justamente o que você não pode confiar plenamente para seguir o checklist. Um método que depende de o modelo nunca esquecer uma restrição apenas mudou a falha de lugar, não a removeu.

A tese: julgamento vs. burocracia

A fundação é construída sobre uma distinção, aplicada em todo lugar:

Agentes criam e julgam. Passos determinísticos aplicam o estado, e hooks o impõem.

Criar um plano, implementar uma feature, julgar se o código atende à spec — isso exige contexto, bom senso e raciocínio. Pertence aos agentes. Marcar os checkboxes de critérios de aceite, carimbar o veredito, mover a tarefa entre estados, recusar um commit que viola um invariante — isso é mecânico. Nunca pode depender da discrição de um LLM, porque discrição é exatamente o que falha em silêncio.

Então o agente revisor devolve um veredito estruturado; um script determinístico o aplica (marca os critérios, carimba a seção, move o arquivo); e um hook recusa deixar qualquer tarefa chegar a done sem uma aprovação registrada e todos os critérios marcados. O julgamento é do agente. A burocracia é código. A imposição não é opcional.

Anatomia da fundação

Concretamente, instalar a fundação num repositório dá a ele cinco coisas: um time de agentes, um quadro (board), um contrato de tarefa, um conjunto de regras, e os portões que os amarram. O resto deste artigo percorre cada um e o raciocínio por trás.

Duas camadas: processo e stack

O sistema é dividido em dois plugins. O delivery-team carrega como o trabalho flui — os agentes por papel, o kanban, os portões, os princípios de engenharia — e não sabe nada sobre nenhum framework em particular. O stack-turbo-nest-react carrega qual tecnologia — os agentes implementadores e as convenções opinadas de uma stack concreta (NestJS + React/Turborepo). Um projeto instala só a camada de processo, ou a combina com uma stack.

Camada	Plugin	O que possui
Processo (agnóstico)	`delivery-team`	fluxo do kanban, agentes por papel, disciplina de revisão, os portões determinísticos, princípios de engenharia, filosofia de ADR + docs + testes
Stack (opinada)	`stack-turbo-nest-react`	agentes implementadores (backend/frontend/infra) e suas regras; portões específicos da stack; o modelo de documentação C4

A divisão não é cosmética. É o que permite reutilizar o processo: uma futura stack-go-chi se encaixaria no mesmo revisor, board e bootstrapper sem tocar uma linha da camada de processo — porque os componentes ativos da camada de processo não contêm nenhuma especificidade de framework.

O time de papéis

O orquestrador (a persona Gerente de Entrega) lê o board, monta o grafo de dependências e despacha os agentes por papel — implementadores em paralelo, em worktrees Git isoladas quando o trabalho é disjunto, serializados quando não é. Os workers rodam num modelo rápido; o revisor — o julgamento que mais importa — roda no mais forte.

flowchart TD
  DM["Gerente de Entrega
(orquestrador)"]
  PL["Planner
PO + Tech Lead"]
  RV["Revisor
Portão de Qualidade"]
  subgraph ENG["Engenharia — camada de stack"]
    BE["Backend"]
    FE["Frontend"]
    IN["Plataforma / DevOps"]
  end
  DOC["Redator Técnico"]
  QA["QA
(especialista)"]
  DM --> PL
  DM --> ENG
  DM --> DOC
  DM --> QA
  DM --> RV
  PL -. especifica tarefas .-> ENG
  ENG -. implementa + testes .-> RV
  QA -. carga / jornadas / resiliência .-> RV
  DOC -. docs vivos .-> RV
  RV -. "aprovar / pedir mudanças" .-> DM

O time de papéis. Setas sólidas: despacho. Setas pontilhadas: entregas convergindo para o portão de revisão.

Dois papéis deliberadamente não são agentes. A reposição do backlog e a aplicação do veredito são skills determinísticas — procedimentos executados em contexto, não tarefas de raciocínio — justamente por serem burocracia.

O kanban de trabalho: estado é um lugar

O trabalho vive como um kanban em markdown em work/, e o estado de uma tarefa é a pasta em que ela está: backlog → ready → active → review → done. Uma transição é um git mv; o histórico de commits é a trilha de auditoria. Deliberadamente não há um campo status: na tarefa — um campo fica obsoleto no instante em que alguém esquece de atualizá-lo, mas o local de um arquivo não tem como mentir sobre onde ele está.

Cada tarefa é um único arquivo com forma padronizada — Spec (um contrato imutável, com critérios de aceite como checklist), Plan, Todo, Verdict e Log. Os títulos e o vocabulário do veredito são padronizados porque a automação faz parsing deles. O revisor julga os critérios; ele nunca edita o arquivo.

Os portões: duas camadas de imposição

O mesmo invariante — nenhuma tarefa chega a done sem um veredito registrado e todos os critérios de aceite marcados — é imposto duas vezes, por design:

Em tempo de agente — um hook PreToolUse dispara no momento em que um agente tenta mover uma tarefa para done/, bloqueando cedo com o motivo.
Em tempo de commit — um hook de pre-commit do Git roda o mesmo validador sobre a mudança em stage, então o portão vale mesmo fora do contexto do agente. (Como os hooks do Git rodam fora da ferramenta, o bootstrapper copia os validadores para dentro do repositório — uma duplicação pequena e deliberada: tudo que é executado por algo fora do agente precisa morar onde esse algo enxergue.)

Um segundo portão exige que uma migration de schema venha acompanhada da documentação do seu modelo de dados, para o diagrama do banco não derivar em silêncio atrás do schema. Esse portão é específico da stack e, por isso, vive na camada de stack — não na agnóstica.

Agentes guiados por regras, não agentes que contêm regras

Uma versão inicial do revisor tinha as preocupações de uma stack hard-coded — rate limiting, estilo de paginação, design tokens. Isso era um vazamento: a camada de processo "agnóstica" secretamente conhecia NestJS e React. A correção generaliza um princípio que vale enunciar sozinho:

Uma convenção forte pertence a uma regra, aplicada a toda tarefa — nunca reescrita por tarefa, onde pode ser esquecida.

O revisor virou um intérprete de regras em vez de um contêiner de regras: ele carrega quaisquer regras presentes e trata cada uma como parte do contrato. Instale uma stack, e suas regras entram no checklist automaticamente. Instale só a camada de processo, e não há ruído de stack. O mesmo padrão foi aplicado ao agente de documentação e ao comando de atualização de docs. As opiniões vivem em exatamente um lugar — as regras — então um agente e uma regra nunca podem discordar.

Planejamento com abstração primeiro

Uma ideia foi adotada da pesquisa analisada abaixo: modelar o domínio e as fronteiras de módulo antes de escrever código, "senão a IA dispara nos detalhes de implementação enquanto a estrutura desmorona". Mas foi adotada do jeito da fundação — não como uma seção que o autor preenche por feature (e esquece), e sim como um princípio que o planner aplica a toda tarefa e o revisor verifica. Todo Plan abre nomeando as entidades, suas relações e a direção de dependência entre módulos, antes de qualquer passo a passo. Estrutura que derivou da forma modelada é um achado de revisão mesmo quando o comportamento funciona.

Trabalho relacionado: métodos de prompt-canvas (SPDD)

O Structured Prompt-Driven Development (SPDD) compartilha a convicção central deste projeto: tornar a intenção explícita e versionada antes do código, e manter humanos no controle pelo julgamento, não pela digitação. O instrumento do SPDD é o REASONS Canvas, um prompt estruturado em sete partes tratado como artefato de primeira classe, versionado, mantido em sincronia bidirecional com o código. É uma ideia genuinamente boa, e a disciplina de abstração-primeiro acima é emprestada dele.

Onde esta fundação diverge é o eixo que ela otimiza. Um canvas otimiza o eixo do tempo — a spec de uma feature vive, sincroniza e se acumula na próxima. Esta fundação otimiza o eixo do controle — o estado de muitas tarefas concorrentes é mecanicamente correto e impossível de pular.

Dimensão	agents-foundation	Prompt-canvas (ex.: SPDD)
Imposição	Determinística e impossível de burlar — hooks + scripts recusam um `done` inválido	Disciplina de processo + revisão manual
Rastreio de estado	Kanban nativo do Git, grafo de dependências, reposição automática	Rastreado implicitamente pelos commits
Time / escala	Orquestração multiagente, worktrees paralelas, escolha de modelo por tarefa	Um dev + IA, sequencial
Padrões	Regras fortes aplicadas a toda tarefa — impossíveis de esquecer	Declarados por feature — dependem de o autor lembrar
Reúso	Marketplace instalável: processo agnóstico + stacks plugáveis	Uma metodologia + sua CLI
Julgamento vs. burocracia	Separados por construção	Em grande parte manual

A diferença mais nítida é a linha padrões. Um canvas confia que o autor reescreva as restrições de segurança, performance e estrutura em toda feature; as pessoas esquecem, e a brecha sobe em silêncio. Esta fundação mantém isso como regras fortes impostas a todas as tarefas, então uma restrição não pode ser omitida de uma tarefa por acidente.

Se você quer um método que um dev disciplinado segue, um prompt canvas é excelente. Se você quer um sistema que não deixa o processo ser pulado, e que escala para um time de agentes, esta é a aposta.

Princípios de design, destilados

Tire a burocracia do LLM. Se um passo é mecânico, um script o faz e um hook o impõe.
Torne o estado impossível de mentir. Codifique-o onde não pode ser esquecido — uma pasta, um commit — não num campo que alguém atualiza à mão.
Convenções são regras, não lembretes. Imponha a todos; nunca dependa de memória por tarefa.
Componentes leem opiniões; regras as guardam. Uma fonte da verdade, então um agente e uma regra nunca divergem.
Modele a forma antes do detalhe. Abstração-primeiro, em toda tarefa.
Escolha a forma pelo trabalho. Agente para julgamento, skill para procedimento, regra para restrição, hook para o inegociável.

Como usar

Um marketplace é só um repositório Git — sem registro central. Adicione, instale um ou ambos os plugins, e faça o bootstrap de um repo:

/plugin marketplace add fcms14/agents-foundation
/plugin install delivery-team@agents-foundation
/plugin install stack-turbo-nest-react@agents-foundation   # camada de stack opcional
/delivery-team:init                                         # scaffold de work/, docs/, regras, portões

/delivery-team:task-new <objetivo>   # especifique → /delivery-team:task-start → revisão → apply-verdict

O bootstrapper monta o kanban e uma árvore docs/ semeada com ADR, materializa as regras, e conecta os portões de commit ao mecanismo de hook que o repositório usar.

Referências e leitura adicional

Patton, J. et al. Structured Prompt-Driven Development. martinfowler.com.
Nygard, M. Documenting Architecture Decisions — a prática de ADR que a fundação semeia em cada repositório.
Brown, S. The C4 model for visualising software architecture — o modelo de documentação usado pela camada de stack.
Anthropic. Documentação do Claude Code — plugins, marketplaces, hooks e subagents.
agents-foundation — o código, os plugins e as notas de design completas.