Forking gstack to Close the Plan-to-Execution Gap

In the last post, I wrote about building orch - a Go CLI that coordinates multiple Claude Code instances via tmux. It works well once you have specs. But getting from “idea” to “specs” to “running agents” still required manual steps: write a plan, review it, manually translate it into role-specific spec files, then run orch up for each agent.

Meanwhile, gstack - Garry Tan’s open-source skill framework for Claude Code - had already solved the planning side. Skills like /plan-ceo-review and /plan-eng-review produce structured, reviewed plans with real artifacts. The problem was that these plans had nowhere to go. They’d sit in a markdown file until someone manually executed them.

So I forked gstack to make it orch-aware. The goal: a reviewed plan flows directly into parallel agent execution with zero manual translation.

What does gstack do for Claude Code planning?

gstack is a collection of 21+ skills that turn Claude Code into a structured workflow. The ones relevant here are the planning skills:

/plan-ceo-review - Product-level review. Challenges premises, expands scope when warranted, finds the 10-star version of the idea.
/plan-eng-review - Architecture-level review. Locks in data flow, edge cases, test coverage, performance characteristics.

These skills produce artifacts - review logs, test plans, architectural decisions - stored in ~/.gstack/projects/. They’re opinionated and thorough. After running both on a task, you have a plan that’s been stress-tested from both the product and engineering angles.

But then what? You copy-paste sections into spec files, manually adapt them for each agent role, and run the orch commands yourself. That handoff was the bottleneck.

What changes bridge gstack planning to orch execution?

The fork touches surprisingly little code. Nine files changed. The modifications fall into three categories.

1. The `/orch` skill

The biggest addition is a new skill that bridges gstack’s planning output into orch’s execution model. When you run /orch, it:

Checks that orch is installed
Looks for existing gstack artifacts - review logs, test plans, CEO review output
Decides what to do based on context

If agents are already running, it offers to attach, send messages, or tear down. If a reviewed plan exists, it generates specs and launches agents. If unreviewed plans exist, it nudges you to run /plan-eng-review first. If nothing exists, it asks what you want to build.

The interesting part is the spec generation step. It calls orch specgen under the hood, which analyzes the codebase (tech stack, project structure, existing tests) and feeds that analysis along with the plan to Claude in print mode. Out come three role-specific specs - engineer, PM, reviewer - that reference actual file paths, actual test patterns, and actual dependencies from your project.

# The flow looks like this:
/plan-ceo-review  →  plan.md (reviewed, stress-tested)
/plan-eng-review  →  plan.md (architecture locked in)
/orch             →  specs generated → agents launched

Before the fork, steps 1-2 happened in gstack and step 3 happened manually. Now it’s one continuous flow.

2. Execution handoffs in the review skills

The subtler change: both /plan-ceo-review and /plan-eng-review now have an “Execution Handoff” section that runs after the review completes. If orch is installed and the task is substantial enough - roughly 8+ files or 2+ implementation phases - the skill offers to spin up orch agents right there.

This matters because it catches the moment of highest intent. You just finished reviewing a plan. You’re mentally committed. The handoff says: “This looks like a multi-agent task. Want me to generate specs and launch agents?” One confirmation and you’re running.

For smaller tasks, it stays silent. Not everything needs three parallel agents. A two-file bug fix is faster with a single Claude session.

3. Keeping the fork alive

The boring but necessary piece: a GitHub Actions workflow that syncs with upstream every 6 hours. Garry and contributors are actively developing gstack, and I don’t want to maintain a divergent fork.

The workflow fetches garrytan/gstack main and attempts a clean merge. If it succeeds, it pushes directly. If there are conflicts - which almost always hit the orch-specific sections - it uses the Claude API to resolve them automatically.

# The conflict resolution prompt tells Claude exactly what to preserve
- name: AI-assisted conflict resolution
  run: |
    claude -p --system-prompt "You are resolving git merge conflicts
    in a fork of gstack. The fork adds orch integration sections
    to plan-ceo-review and plan-eng-review skills, and changes
    update URLs to point to jeffdhooton/gstack. Preserve all
    fork-specific changes while accepting upstream improvements."

If Claude resolves everything cleanly, it opens a PR for review. If it can’t, it opens a PR with conflict markers for manual resolution. In practice, the conflicts are predictable - they’re always in the same few sections - so Claude handles them reliably.

This is a pattern I’d use for any opinionated fork: automate the merge, let AI handle the predictable conflicts, escalate the weird ones. The fork stays current with upstream while preserving local modifications.

Why does automating the plan-to-execution handoff matter?

The individual changes are small. But the compound effect is significant.

Before the fork, the workflow was:

Think about what to build (unstructured)
Maybe write a plan (optional, often skipped)
Write 3 spec files manually (30-60 minutes, tedious)
Run orch up commands (copy-paste from memory)
Monitor and iterate

After the fork:

/plan-ceo-review - structured product review
/plan-eng-review - structured architecture review
/orch - specs generated and agents launched

Steps 1-3 are guided conversations. You’re making decisions, not doing busywork. The translation from “reviewed plan” to “running agents” is automated.

The quality improvement is downstream of the process improvement. When spec generation was manual, I’d cut corners. I’d write vague PM specs because I was tired of writing specs. The engineer would get a detailed spec and the PM would get “check in periodically and coordinate.” Now all three specs get the same level of detail because it’s generated from the same plan with the same codebase analysis.

Why should planning and execution tools stay separate?

gstack is a planning tool. It’s great at structured thinking, review, and decision-making. Orch is an execution tool. It’s great at running parallel agents with defined roles and communication channels.

Trying to make one tool do both would have been worse. gstack’s skill architecture is designed for single-session interactive workflows - you talk to Claude, it guides you through a structured process. Orch is designed for headless parallel execution - agents run autonomously with minimal human interaction.

The fork is just the glue. It detects when planning is done and execution should begin, generates the translation layer (specs), and hands off. Each tool stays focused on what it does well.

This is the same separation of concerns that makes the PM/engineer/reviewer agent pattern work. The PM doesn’t write code. The reviewer doesn’t merge. The planning tool doesn’t execute. The execution tool doesn’t plan. The boundaries are where the value is.

What happened in the three weeks after

I wrote the original version of this post on March 20. At that point, the fork had 21 skills and was at version 0.9.5. Three weeks later, it’s at version 0.21.4 with 48 skills. Thirteen of those are entirely new. The rest were significantly enhanced. What started as “add orch integration to the planning skills” turned into something much bigger.

Here’s what grew out of it.

How did the design pipeline change everything?

The original fork was planning-to-execution. But what about the step before planning, where you decide what the thing should look like?

/design-consultation was already in gstack. It interviews you about your product, researches the landscape, and generates a complete design system as a DESIGN.md file. Typography, colors, spacing, motion, the whole token set. But the output was a document. You still had to manually translate it into code.

Three skills closed that gap:

/design-ref loads brand design systems from 55+ companies. Stripe, Airbnb, Apple, Linear, Figma, Notion. You pick one as a reference and it fetches the actual design tokens, colors, type scales, spacing rules, and component patterns. These get cached locally and applied to your DESIGN.md. So instead of starting from “I want something clean and modern,” you start from “I want Stripe’s type hierarchy with Linear’s color system.”

/design-shotgun generates multiple design variants and opens a comparison board. You give it feedback, it iterates. It’s visual brainstorming without Figma.

/design-html takes whatever you approved, whether from /design-shotgun, a CEO review, or a plain description, and generates production HTML/CSS. Not mockups. Actual code you can ship.

The pipeline now looks like:

/design-ref (load brand tokens)
  → /design-consultation (generate DESIGN.md)
    → /design-shotgun (explore variants)
      → /design-html (generate production code)
        → /design-review (QA the visual output)

Each step feeds the next. No manual translation. The design review skill catches spacing issues, hierarchy problems, and what it calls “AI slop patterns,” visual artifacts that look obviously AI-generated. It fixes them in the source code, commits each fix atomically, and re-verifies.

How does cross-session messaging work without a server?

Orch coordinates agents through files and SQLite. But sometimes you don’t need the full orchestrator. You just want two Claude Code sessions to talk to each other.

/inbox is the low-tech version. File-based messaging between concurrent sessions. No server, no database. Each session gets an inbox directory. Messages are text files. A PreToolUse hook checks for new messages inline, so you see them without polling.

/pair builds on /inbox to create structured pair programming. One session builds, the other reviews in real time. It assigns roles, sets up the communication protocol, manages handoffs, and prevents conflicts (two sessions editing the same file). Think of it as the PM/engineer pattern from orch, but lighter weight and for two humans-plus-agents working together.

The design decision I like most: /inbox uses a claim system for work items. Before a session picks up a task, it claims it. Other sessions see the claim and skip it. No coordination server, no locks, just a file that says “session-abc claimed this at 14:32.” Good enough for the concurrency level we’re dealing with.

What does the project hygiene suite do?

Not every skill needs to be clever. Some of the most useful ones are boring.

/env-sync audits environment variables. It finds vars used in code but missing from .env.example, and vars in .env.example that code never references. Framework-agnostic. Works with Node, PHP, Python, Ruby, Go, Rust. The kind of thing that catches deployment bugs before they happen.

/deps wraps every language’s dependency audit tool (npm audit, composer audit, pip-audit, cargo audit, etc.) into a unified report. Outdated packages, known CVEs, unused deps, license issues. One command, any stack.

/test-gen reads your existing tests to learn your project’s style, then writes new tests that match. It finds functions with no test coverage, prioritizes by complexity, and generates tests using the same framework, assertion style, naming conventions, and mocking patterns your project already uses. It doesn’t impose opinions. It mirrors yours.

/index generates a compact codebase index that gives AI assistants instant context. Maps routes, models, lib exports, pages, and components into small reference files. The claim is that it replaces 50K+ tokens of exploration per conversation. In practice, it means the agent starts working immediately instead of spending the first two minutes reading your project structure.

None of these are flashy. They’re the kind of thing that saves you 10 minutes every day, which adds up fast.

What is the notetaker and why does it matter?

/notetaker watches your tool calls via a PostToolUse hook and journals everything you do. Then it analyzes the journal to spot repeatable patterns.

The output isn’t a summary. It’s a full skill body. If it notices you keep doing the same three-step process, checking git status, running tests, then committing with a specific message format, it generates a skill that automates that exact workflow. Complete with the SKILL.md frontmatter, trigger conditions, and step-by-step instructions.

So the framework is building itself. You use gstack, it watches how you use it, and suggests new skills based on your actual patterns. I’m not sure how useful the generated skills are yet. The ones I’ve seen range from “yeah, that’s exactly what I do” to “close but not quite.” But the concept is right. The best automation targets are the things you do repeatedly without thinking about them.

How do you maintain a fork that moves this fast?

The auto-sync workflow from the original post is still the backbone. Every 6 hours, GitHub Actions fetches upstream, attempts a merge, and either pushes directly or uses Claude to resolve conflicts.

But the scale changed. When the fork was three modified files, conflicts were predictable. Now there are 48 skills, custom infrastructure for cross-session messaging, a browser extension, and a design library. The conflict surface is larger.

Two things kept it manageable.

First, most new skills are additive. They’re new files in new directories, not modifications to existing upstream files. A new skill doesn’t conflict with anything because upstream doesn’t have it. The conflicts still cluster around the same few files they always did: the README skill table, the CHANGELOG, and the plan review skills where orch integration lives.

Second, the fork’s additions are clearly scoped. The orch integration sections are always at the bottom of a skill file, clearly marked. Claude can identify them because they follow a consistent pattern. When upstream modifies the top of the file and the fork modifies the bottom, the merge is mechanical.

Going from v0.9.5 to v0.21.4 in three weeks, with upstream also shipping constantly, only produced a handful of merge conflicts that needed manual attention. Most were resolved automatically. The pattern works at scale, which is the thing I wasn’t sure about when I wrote the original post.

What’s next?

The pipeline is approaching something like completeness. Idea to deployed and monitored, without leaving Claude Code:

/office-hours (brainstorm)
  → /plan-ceo-review (product review)
    → /plan-design-review (visual review)
      → /plan-eng-review (architecture review)
        → /design-ref + /design-html (design to code)
          → /build or /orch (implementation)
            → /review (PR review)
              → /ship (push + PR)
                → /land-and-deploy (merge + deploy)
                  → /canary (monitor production)

The gap now is in the feedback loops between stages. When /canary detects a regression, it should trigger /investigate automatically. When /design-review finds issues, it should feed back into the design stage rather than just fixing them in place. The stages work. Making them talk to each other is the next problem.

The /notetaker might help here. If it can observe the patterns in how I manually bridge stages, “canary found a bug, I ran investigate, then I fixed it and shipped again,” it could generate the glue skills that automate those transitions. The pipeline builds itself by watching how you use it. That’s the bet.

github.com/jeffdhooton/gstack · github.com/jeffdhooton/orch