Spec-Driven Development with spec-kit

“Spec-driven development” is an approach to software development in which one formalizes a specification or spec before writing any code. A spec is a formalized plan that includes “user stories” (what a potential user would want to do with the code) and acceptance criteria (a detailed list of what is acceptable behavior for the code). This would be extremely tedious for the small-userbase computational biology projects that we work with, but it’s 2026 and agents can do all the hard work for us.

Spec-kit is a toolkit from GitHub for spec-driven development using an agent. It solves the central problem with agentic coding: that the agents require frequent guidance to do the right thing. Once you have a spec, you have all of the guardrails set up in advance and the agent can just execute without feedback. Spec-kit works with all major agentic coding tools.

I really recommend this approach. This feels like the way software development should be. One writes out a file (with an agent) describing what is desired, and then the agent asks questions to clarify, carefully works through edge cases, and comes up with a detailed plan, including tests. The agent then writes all the code and tests.

I recently used spec-kit to build a bioinformatics pipeline from scratch in a day. This post walks through that experience. Seeing it in action should be much more helpful than describing the process in vague terms.

Note 1: Spec-kit also works wonderfully for implementing new features in existing codebases. The pattern I like is: (1) raise a GitHub issue describing the vision for the feature, (2) create a spec-kit subdirectory named after the issue number (e.g. specs/298-mpnn-residual), (3) run through the full spec-kit workflow there, and (4) merge. This keeps each feature’s spec artifacts organized and co-located without cluttering the rest of the repo.

Note 2: Some respected colleagues have suggested that Claude Code’s improving long-range planning makes spec-kit irrelevant. That may become true in the future, but as of Claude Code 2.1.42 in February 2026, the depth of planning, consistency checks, and research that spec-kit produces far exceeds what you get from Claude Code alone in my experience.

Start with a vision

My from-scratch bioinformatics pipeline started by agentically writing a docs/vision.md file that described, in broad strokes, what it was I wanted to do. It did some exploration as well about databases we could use and the scale of computation to be done, resulting in a docs/exploration-notes.md.

The Constitution

The process begins by installing spec-kit and establishing a “constitution”: the governing principles for your project. I pointed the agent at my existing coding standards and project documentation (note here that my prompts start with ❯ and are indented):

install https://github.com/github/spec-kit read docs and readme for context for the constitution we are going to follow @/Users/matsen/.claude/agents/clean-code-reviewer.md and @/Users/matsen/.claude/agents/snakemake-pipeline-expert.md and ../dasm2-experiments/CLAUDE.md Other points include that we want to not have perfection be the enemy of good. We want to build a prototype on smaller datasets and then scale up to huge datasets. the goal is dasm training on all protein families, see ../nexus

The agent reads the spec-kit templates and creates a constitution combining principles from my existing standards:

Now I have the context. Let me write the constitution combining principles from the agents and dasm2-experiments, tailored for this Snakemake bioinformatics pipeline project. Reading spec-kit templates... (ctrl+c to interrupt · 2m 11s · thought for 7s) Read spec-kit templates for consistency checks Write constitution.md based on principles from agents and dasm2 Update .gitignore for ISSUE-*.md and other patterns Verify template consistency

The result is a set of principles tailored to the project. In this case they emphasize prototype-first development, fail-fast error handling, Snakemake best practices, real tests with real data, and clean code essentials. It also lists the key spec-kit commands you’ll use:

Constitution Principles 1. Prototype-First Development — Work on test datasets before scaling 2. Fail Fast, Never Silently — No silent defaults; crash with clear errors 3. Snakemake Best Practices — YAML configs, organized outputs, declarative rules 4. Real Tests, Real Data — No mocks; use actual fixtures 5. Clean Code Essentials — Single responsibility, meaningful names, DRY Next Steps # View constitution cat .specify/memory/constitution.md # Commit the setup git add .specify .claude CLAUDE.md .gitignore git commit -m "Add spec-kit setup and project constitution" You can now use spec-kit commands: - /speckit.specify — Create feature specifications - /speckit.plan — Create implementation plans - /speckit.tasks — Generate task lists

/speckit.specify: Creating the Specification

With the constitution in place, you run /speckit.specify to create a detailed specification. I pointed it at my project’s vision documents in docs from the first step:

/speckit.specify . see @docs/ and @README.md └─ Listed directory docs/ └─ Read README.md (21 lines) I'll analyze the docs and README to understand the project and create a specification for the superfamily-pcp pipeline. Read(docs/vision.md) └─ Read 247 lines Read(docs/exploration-notes.md) └─ Read 167 lines Read(.specify/templates/spec-template.md) └─ Read 116 lines Now I have a clear understanding of the project. Let me check for existing branches and create the feature branch with an appropriate short name.

The agent produces a structured spec with user stories, functional requirements, and success criteria. (The goal of this pipeline is to extract PCPs—parent-child pairs from ancestral sequence reconstructions on phylogenetic trees—but the details aren’t important for understanding the spec-kit workflow.)

What the spec covers: 5 prioritized user stories: 1. P1: Run pipeline on test dataset (~100 sequences) for validation 2. P2: Extract PCPs from sialidase superfamily (SSF50939, ~157K sequences) 3. P2: Configure clustering identity thresholds (50%, 70%, 90%) 4. P3: Filter low-quality clusters by size 5. P3: Process Influenza HA superfamily (SSF49818) 13 functional requirements covering the full pipeline flow: - Sequence retrieval from UniProt via SSF/Pfam IDs - MMseqs2 clustering with configurable identity - Per-cluster MAFFT alignment → tree building → PRANK ASR → PCP extraction - Configuration validation (fail-fast) - Taxonomic domain preservation - Parallel execution via Snakemake 8 measurable success criteria for verification. The spec is ready for /speckit.clarify (if you want to refine further) or /speckit.plan to design the implementation. ✱ Brewed for 2m 26s

You can iterate here with the agent. For example, as I reviewed the spec, I noticed it was missing some requirements around tree rooting and directionality that were important for my use case. The agent updated the spec with new functional requirements:

Good catch - directionality is essential for DASM since it models selection from ancestor → descendant. I'll update the spec to include tree rooting. ⦿ Updated spec/full-spec.md(~root.md) Added 2 lines, removed 1 line 98 +FR-002*: System MUST cluster sequences using MMseqs2 at a configurable identity the default of 99.5% 99 +FR-003*: System MUST perform multiple sequence alignment on each cluster using MAFFT with the --auto flag 102 +FR-004*: System MUST reconstruct ancestral sequences using PRANK ASR 103 ... +FR-005*: System MUST root trees to establish ancestor-to-descendant directionality, using taxonomic outgroups where available (e.g., use bacterial sequences to root viral clusters, or use more distant taxonomic groups within the cluster) 106 +FR-006*: System MUST reconstruct ancestral sequences using PRANK ASR on rooted tree 108 +FR-007*: System MUST extract parent-child pairs from each tree edge (ancestors = MM child = leaf) 109 +FR-008*: System MUST validate all required configuration parameters at startup and fail fast (exit immediately) if any are missing 111 +FR-009*: System MUST filter clusters by configurable minimum and maximum size thresholds

I also noted that I wanted the spec to be tool-agnostic at this point:

Updated. The spec is now workflow-agnostic: Changes: - FR-012: "parallel execution across clusters" (no Snakemake reference) - FR-013 (new): Record wall-clock time for each step for bottleneck analysis - SC-009 (new): Timing data available for scaling decisions - User Story 1: Removed snakemake --configfile command - Deferred Decisions section added: Documents that Snakemake vs Nextflow choice will be made during planning, informed by benchmarking data The spec now focuses on what the pipeline needs to do (parallel execution, timing, scaling) without prescribing how (which workflow engine). The planning phase can evaluate both options with the prototype. ✱ Worked for 36s

/speckit.clarify: Resolving Ambiguities

Once you have a draft spec, /speckit.clarify helps identify and resolve gaps. The agent analyzes your spec and presents questions organized by category.

Each question comes with recommended options based on the constitution principles. For example, here’s a question about edge case handling with clear options and a recommendation:

/speckit.clarify Category: Completion Signals Status: Clear Notes: Acceptance criteria testable Assessment: The spec is solid overall. The main gaps are in edge case resolution and some non-functional targets. However, given this is a research/prototyping pipeline, I have one high-impact question worth asking—the rest are either low-impact or better deferred to planning. Question 1 of 1 Category: Edge Cases & Failure Handling When the UniProt API is unavailable or returns an error during sequence retrieval, what should the pipeline do? Recommended: Option B - This balances robustness with fail-fast principles. Transient network issues shouldn't require full reruns, but persistent failures should surface clearly. Option Description ────────────────────────────────────────────────────────────── A Fail immediately with clear error (no retry) B Retry with exponential backoff (3 attempts), then fail C Cache sequences locally; use cache if API unavailable D Skip unavailable superfamilies and continue with others You can reply with the option letter (e.g., "A"), accept the recommendation by saying "yes" or "recommended", or provide your own short answer.

After working through the questions, you get a summary showing which categories are resolved and which are deferred to the planning phase:

Category: Integration & External Dependencies Status: Resolved Category: Edge Cases & Failure Handling Status: Deferred (remaining 5 are tool-specific, better resolved in planning) Category: Constraints & Tradeoffs Status: Clear Category: Terminology & Consistency Status: Clear Category: Completion Signals Status: Clear Deferred items rationale: The remaining edge cases (non-standard amino acids, identical sequences, divergent sequences, ASR failures, rooting without taxonomic diversity) depend on tool behavior and are better addressed during planning when specific tools are evaluated. Recommended next step: /speckit.plan

Important: at this point you want to read the spec in detail. This is the best place to make sure that the process is going as desired.

/speckit.plan: Research and Planning

The planning phase is quite remarkable. /speckit.plan researches open questions for you. Here the agent spawns multiple research agents to investigate workflow engines, tree rooting methods, and output formats:

/speckit.plan Now I need to do Phase 0 research to resolve the NEEDS CLARIFICATION items. The main ones are: 1. Workflow engine: Snakemake vs Nextflow 2. Tree rooting approach: How exactly to implement taxonomic reconciliation 3. PCP output format: What format does DASM training expect? 4. Edge case handling: Non-standard amino acids, identical sequences, ASR failures Let me research these topics. Running 3 agents... (ctrl+o to expand) ├─ Task (Research Snakemake vs Nextflow) · 2 tool uses · 21.5k tokens │ └─ Web Search: Nextflow metrics timing benchmarking pipeline ├─ Task (Research tree rooting methods) · 3 tool uses · 21.5k tokens │ └─ Web Search: gene tree species tree reconciliation software to... └─ Explore (Research PCP output format) · 25 tool uses · 69.6k tokens └─ Bash: head -200 /Users/matsen/re/dasm-experiments/dnsme... ctrl+b to run in background

The research is thorough. Here the agent is discussing taxonomy tools, comparing ETE3’s NCBITaxa with alternatives like taxoniq and direct NCBI dumps:

So for the rooting algorithm you described: - Tree traversal: ETE3 (has viz) - NCBI taxonomy queries: Still need to figure out the best tooling For the taxonomy MRCA lookups, options I know of: 1. ETE3's NCBITaxa - downloads and uses a local SQLite database of NCBI taxonomy 2. taxoniq - another Python library with local SQLite 3. Direct NCBI Taxonomy dump - download taxdump.tar.gz and query yourself Should I research these taxonomy tools more, or do you have a preference? ETE3's NCBITaxa would keep everything in one library. yeah let's use NCBITaxa, but can it do the MRCA that I'm hoping for?

It even writes code sketches to verify approaches will work. In response to the question above, it did:

Fetch(http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html) └ Received 30KB (200 OK) get_topology() returns the minimal tree connecting your taxa - so the root of that tree is the MRCA. But that seems indirect for just getting an MRCA taxid. Let me check if there's a more direct method: Web Search("ete3 NCBITaxa get_common_ancestor MRCA LCA taxid python") └ Did 1 search in 20s Yes! The approach works like this: from ete3 import NCBITaxa ncbi = NCBITaxa() # Get MRCA of multiple taxids tree = ncbi.get_topology([9606, 9598, 10090]) # human, chimp, mouse mrca_node = tree.get_tree_root() print(mrca_node.taxid, mrca_node.rank) # e.g., Euarchontoglires, superorder

All the research culminates in a decision summary—a clear table of choices made and their rationale:

Updated research.md with: - Decision 7: AA sequences primary; NT optional via EMBL/ENA xrefs for subset - Decision 8: Both FastTree and IQ-TREE as config options (IQ-TREE default) - Decision 9: NCBITaxa pre-download for cluster deployment Also updated data-model.md config schema with tree_builder option. Summary of all decisions made: # Topic Decision ────────────────────────────────────────────────────────────── 1 Workflow engine Snakemake 2 Tree rooting Taxonomic MRCA propagation + midpoint fallback (ETE3/NCBITaxa) 3 PCP format Simplified CSV for AA sequences 4 Edge cases Replace non-standard AA with X; skip identical clusters; log failures 5 UniProt API Retry with exponential backoff 6 Dependencies Pixi, conda-forge only 7 Sequence type AA primary; NT optional future enhancement 8 Tree builder IQ-TREE default, FastTree as option 9 NCBITaxa Pre-download for cluster deployment Anything else, or ready for /speckit.tasks?

/speckit.tasks: Breaking Down the Work

With the plan in place, /speckit.tasks generates a comprehensive task breakdown. The output looks like:

★ Generated /Users/matsen/re/superfamily-pcp/specs/001-pcp-pipeline/tasks.md - Total Tasks: 49 - Task breakdown by user story: - Setup (Phase 1): 5 tasks (TB01-TB05) - Foundational (Phase 2): 5 tasks (TB06-TB10) - User Story 1 - Test Dataset (P1): 6 tasks (TB11-T829) w/ MVP - User Story 2 - Sialidase Superfamily(P2): 6 tasks (TB30-TB35) - User Story 3 - Clustering Parameters (P2): 5 tasks (TB36-TB38+) - User Story 4 - Quality Filtering(P3): 4 tasks (TB39-TB42) - User Story 5 - Influenza HA (P3): 4 tasks (TB43-TB46) - Polish (Phase 8): 4 tasks (TB46-TB49) - Parallel opportunities identified: - Setup: TB02, TB03, TB04 can run in parallel - Foundational: TB06, TB08 can run in parallel - Polish: TB46, TB47 can run in parallel - Independent test criteria: - US01: Run test config (~100 seq), verify combined.csv output - US02: Verify PCPs span virus/Bacteria/Eukaryota domains - US03: Compare cluster sizes at different identity thresholds - US04: Verify filtered clusters excluded from processing - US05: Verify bacterial HA sequences in output - Suggested MVP scope: Complete Phases 1-3 (Setup + Foundational + User Story 1) to get a working pipeline on test data. - Format validation: ✓ All tasks follow checklist format (checkbox, ID, labels, file paths)

/speckit.analyze: Catching Problems Early

Before implementation, /speckit.analyze performs a read-only consistency check across the spec, plan, and tasks. Like /speckit.clarify, it’s a quality gate: it catches duplications, ambiguities, coverage gaps, and contradictions between artifacts before you’ve written any code. It also checks alignment with your constitution, so principles you established at the start don’t silently erode as the spec evolves.

/speckit.implement: Implementation and Validation

From here, the agent executes the implementation. After working through the tasks, here’s the summary of key features implemented:

Key Features Implemented 1. Pipeline Flow: UniProt → MMseqs2 cluster → MAFFT align → FastTree/IQ-TREE tree → Taxonomic rooting → PRANK ASR → PCP extraction 2. Config validation: Fail-fast with clear error messages for missing/invalid parameters 3. Taxonomic rooting: MRCA-based approach with midpoint fallback 4. Quality filtering: Cluster size thresholds with skip logging 5. Provenance tracking: Logs and metrics collection for all rules 6. Multiple superfamilies: Sialidases (SSF50939) and Influenza HA (SSF49818) 7. Scaling tiers: test (~100 seq) → pilot (~1K) → production (~157K) Remaining Tasks (require pixi environment) - TB28-TB29: Verify pipeline dry-run and actual execution - TB34: Verify pilot config runs - TB45: Verify HA pilot config runs - TB46-TB49: Documentation, validation and final review To test the pipeline, run: pixi run pixi run snakemake --configfile config/test.yaml -n # Dry run pixi run snakemake --configfile config/test.yaml --cores 4 # Actual run Crunched for 11m 25s

The pipeline validation shows concrete results—92 sequences processed, 5 clusters created, 146 PCPs extracted:

Pipeline validation complete. Here's a summary: Results ┌─────────────────────────┬────────────────────────────────────────┐ │ Metric │ Value │ ├─────────────────────────┼────────────────────────────────────────┤ │ Total sequences │ 92 │ │ Clusters │ 5 │ │ Total PCPs │ 146 │ │ Taxonomic breakdown │ Eukaryota: 29, Bacteria: 45, Virus: 5 │ └─────────────────────────┴────────────────────────────────────────┘ Issues Fixed 1. Python version: Pinned to <3.13 in pixi.toml because ete3 uses deprecated cgi module 2. Sequence name format: Changed from taxid:NNNNN to taxid_NNNNN to avoid Newick colon conflicts 3. PRANK flags: Added -once -prunedata flags for robustness with name mismatches 4. ASR tree output: Preserved PRANK's ancestor tree (ancestors.nwk) for PCP extraction since it contains internal node names matching ancestral sequences Files Changed - pixi.toml - Python version constraint - data/samples/test_sialidases.fasta - sequence name format - workflow/scripts/root_tree.py - taxid pattern matching

The agent then helped set up remote execution using the tooling we have available in our group.

OK, great. what's the next step? should we run this on remote? note that ../dasm2-experiments has a nice makefile for driving remote computation, I esp like the tmux thing. Good question! Let me check what's in the dasm2-experiments Makefile to understand the pattern, then we can discuss next steps. Read(~/re/dasm2-experiments/Makefile) └─ Read 183 lines Nice pattern! The remote-tmux is perfect for Snakemake runs. Let me check what server resources are available and propose next steps. ┌Server Status CPU Memory Load Avg GPU Usage─── ├────────────────────────────────────────────────────────────────── │... +12 lines (ctrl+o to expand) Plenty of resources available. Let me check the pilot config and then we can set up remote execution.

And here’s the pipeline actually running MMseqs2 clustering on a remote server:

Index table: Masked residues: 297313 Index table: fill ] 24.70K 0s 456ms Index statistics Entries: 7083765 DB size: 520 MB Avg k-mer size: 8.118778 Top 10 k-mers RSDGWS 2236 RDMGWT 1486 ... (clustering output continues...)

Finally, the pilot run completes with a summary of what was accomplished and issues discovered:

Bash(bd create "Fix taxonomy lookup returning all Unknown" -p 2 -d "Pilot run shows all PCPs have Unknown taxonomy. Need to investigate why taxonomic_breakdown shows...") └─ ✓ Created issue: superfamily-pcp-s4q Title: Fix taxonomy lookup returning all Unknown Priority: P2 Status: open Pilot run complete! Summary: - 1,225 PCPs extracted from 10 clusters - 157K sequences processed - Pipeline end-to-end working - Taxonomy lookup bug filed (P2) Fixes made during this session: 1. Sequence ID mismatch (UniProt accession extraction) 2. FASTA description stripping for PRANK compatibility 3. Added max_clusters config for pilot runs

Conclusion

This for me is a really remarkable milestone: we were able to go from zero to a mostly working pipeline, only starting with an idea, and mostly participating by answering some completely reasonable questions. Even without the agentic implementation step, this is a highly productive way to think through all of the gotchas that usually require code rewrites down the road. Give it a shot and let me know how it goes.


This is part 5 of a series on agentic coding:

  1. Agentic Coding from First Principles
  2. Agentic Git Flow
  3. Writing Scientific Code Using Agents
  4. The Human Experience of Coding with an Agent
  5. Spec-Driven Development with spec-kit (this post)

View the complete series →