A deep, evidence-based comparison of two AI software-engineering toolkits. Every claim below traces to files read and tests actually run inside both repositories.
The question worth asking
One of these projects has 105,637 GitHub stars. The other has 956. The natural assumption is that a 110x star gap reflects a 110x quality gap. It does not. Stars measure reach, trust, and timing. They are a weak proxy for engineering depth, and an actively misleading proxy when one author happens to run Y Combinator.
This is a head-to-head between gstack (garrytan/gstack) and Loki Mode (asklokesh/loki-mode). They are often mentioned in the same breath because both turn an AI coding agent into something more structured than a blank prompt. But they are not the same kind of tool, and "which is better" only has a real answer once you separate what they actually do.
The short version: they sit in different categories. gstack augments a human. Loki Mode tries to remove the human from the build loop. Judge them on that, not on the star counter.
The raw numbers (verified live)
| Metric | gstack | Loki Mode |
|---|---|---|
| GitHub stars | 105,637 | 956 |
| Forks | 15,742 | 184 |
| Open issues | 571 | 2 |
| Created | 2026-03-11 | 2025-12-26 |
| Age at measurement | ~3 months | ~5 months |
| Primary language | TypeScript | Python (plus TS and shell) |
| License | MIT (true OSI open source) | BUSL-1.1 (source-available, restricted) |
| Commits on default branch | 307 | 1,043 |
| Distinct authors (default branch) | 1 dominant (297 of 307) | 1 dominant (~1,022 of 1,043) |
| Current version | 1.55.0.0 | 7.8.3 |
| Source LOC (real product, approx) | ~156K TS repo-wide | ~318K across py/ts/sh |
Two facts jump out. First, gstack accumulated 105K stars in roughly three months, faster than Loki Mode reached 1K in five. That velocity is not an engineering measurement. It is distribution: Garry Tan is President and CEO of Y Combinator, and gstack launched on the back of a viral productivity narrative (the Karpathy "I have not typed a line of code since December" quote, the OpenClaw "247K stars solo" story). A founder with that platform starts every launch at a different altitude.
Second, both projects are effectively single-author. gstack is ~97% Garry Tan on the default branch with ten community contributors at exactly one squashed commit each. Loki Mode is ~98% one person (asklokesh / Lokesh Mure are the same author) with exactly two external human commits. Bus factor of one, on both sides. If you are betting your workflow on either, you are betting on one maintainer.
What each one actually is
gstack: a virtual engineering team as slash commands
gstack ("Garry's Stack") is a bundle of about 54 Markdown "skills" (slash commands) that give a Claude Code agent specialist personas, plus a genuinely substantial headless browser CLI for QA. The core unit is the SKILL.md file: a prompt template that turns the agent into a CEO reviewer, an eng manager, a designer who hunts AI slop, a security officer running OWASP and STRIDE, a QA lead who opens a real browser, or a release engineer who ships the PR.
It is human-in-the-loop by design and by stated philosophy. The workflow is a sprint (Think, Plan, Build, Review, Test, Ship, Reflect), and each skill writes artifacts the next one reads. The README is blunt about the positioning: "Eight commands, end to end. That is not a copilot. That is a team." Skills pause for your input at decision points. The agent does not run off and build the whole thing while you sleep.
The browser is the part people underrate. browse/ is a real Playwright-based headless browser compiled to a single Bun binary, roughly 24,000 lines of source exposing 76 commands (navigate, click, fill, extract text/html/forms/accessibility, screenshot, inspect console/network/perf, drive tabs, bridge to CDP). It runs as a daemon with a layered prompt-injection defense (datamarking, an ONNX classifier, a Haiku transcript check, canary tokens). This is not a prompt pack pretending to browse. It is a working automation engine.
Target users, per the README: founders and CEOs who still ship, first-time Claude Code users who want structure over a blank prompt, and tech leads who want review/QA/release rigor on every PR.
Loki Mode: autonomous spec-to-product
Loki Mode is an autonomous build system. You hand it a spec (a PRD, a GitHub or Jira issue, an OpenAPI/JSON/YAML file, or a one-line brief, auto-detected), and it drives a coding-agent CLI through repeated autonomous cycles until automated verification passes, emitting a Git repo with source, tests, configs, and audit logs.
The core abstraction is the RARV-C closure loop: Reason, Act, Reflect, Verify, Close. It is implemented in autonomy/run.sh, a 13,322-line bash orchestrator (the CLI dispatcher autonomy/loki is another 24,523 lines). The loop rotates model tiers per phase: a planning tier for Reason, a dev tier for Act and Reflect, a fast tier for Verify. The pitch is "describe what you want, walk away, come back to working code with tests."
Several of the headline claims hold up when you read the code:
- 41 agent types across 8 swarms is real.
agents/types.jsonis a 41-element array (Engineering 8, Operations 8, Business 8, Data 3, Product 3, Growth 4, Review 3, Orchestration 4). - Blind 3-reviewer code review and anti-sycophancy are code-enforced, not just prompt text.
run.shdispatches three parallel background reviews, waits, then flags a unanimous PASS as a "potential sycophancy risk" and triggers a devil's advocate pass. - Cross-project memory is real, with episodic/semantic/procedural tiers and a deliberately dependency-free numpy vector index (no FAISS). ChromaDB shows up only in the MCP server's code-search tool, a separate subsystem.
- Legacy healing (
loki heal, with archaeology/stabilize/isolate/modernize/validate phases) is a genuinely unusual feature with its own backward-compat gate.
Target user: a developer or team that wants to delegate a whole feature to an agent fleet, self-hosted, on their own provider keys, rather than use a hosted tool like Replit or Lovable.
Capabilities, side by side
| Dimension | gstack | Loki Mode |
|---|---|---|
| Core model | Human-in-the-loop specialist roles | Autonomous closure loop (RARV-C) |
| Primary unit | ~54 Markdown skills (slash commands) | Agent swarms + bash orchestrator |
| Browser / QA | Real 24K-LOC Playwright browser, 76 commands, daemon | Drives an agent CLI; no first-party browser engine |
| Code review | /review, /cso (OWASP + STRIDE), /investigate | Code-enforced blind 3-reviewer + anti-sycophancy |
| Memory | Optional gbrain semantic search (per-repo) | Cross-project episodic/semantic/procedural + vector search |
| Design tooling | Strong (/design-consultation, /design-shotgun, /design-html) | Minimal |
| Multi-provider | Claude-first; 10 hosts but others degraded | Provider-agnostic with failover (Claude/Codex/Cline/Aider) |
| Deploy | /ship, /land-and-deploy, /canary | Generates configs/CI but explicitly does NOT deploy |
| iOS support | Real on-device SwiftUI QA via DebugBridge | None |
| UIs | CLI + Chrome extension with live agent PTY pane | FastAPI dashboard + Purple Lab Monaco web IDE |
| Install | git clone + ./setup | npm / Bun / Homebrew / Docker / git, plus a GitHub Action |
Quality and maturity: both ran green
This is where the star gap matters least. I ran tests in both repos.
gstack has a serious three-tier test system: free static validation (fast), paid end-to-end runs via claude -p (about $3.85 each), and LLM-as-judge evals (about $0.15 each), with diff-based test selection. The free slice I executed (test/skill-validation.test.ts) returned 329 pass, 0 fail in about 600ms. CI is 10 GitHub workflows plus GitLab parity, including an AI-slop linter and static-grep tripwire tests that fail CI on regressions. Release cadence is extreme: 212 changelog entries over ~80 days, roughly 2.6 releases per day. The 54KB CLAUDE.md encodes real institutional knowledge (security thresholds, teardown contracts, CDP lifecycle invariants).
Loki Mode also has real, green tests. The protocol suite I ran (node --test tests/protocols/*.test.js) returned 137 pass, 0 fail. CI is 23 GitHub Actions workflows including SBOM, provenance, mutation testing, and a security audit. The changelog is 762KB with 424 version headers. Packaging discipline is high: multi-channel distribution, Dockerfiles, shell completions, a composite GitHub Action.
Both are polished. Both ship constantly. Neither is a weekend toy. The difference in maturity is not 110x. It is closer to even, with each strong in different places: gstack in eval rigor and browser engineering, Loki Mode in orchestration plumbing and packaging breadth.
Where the marketing outruns the code (on both sides)
A fair comparison has to name the overclaims, including in the user's own project.
Loki Mode:
- The README shows an "Open source: Yes" row and repeatedly uses "open source" language. The actual license is BUSL-1.1: source-available, with an Additional Use Grant that forbids offering it as a competing commercial or hosted product. It converts to Apache 2.0 on 2030-03-19. That is open-core / source-available monetization, not OSI open source. This is a concrete, real differentiator, and it is the place Loki Mode's own framing overstates its license.
- The "SWE-bench 299/300" figure is patch generation, not solved. The README itself notes the evaluator was not run. Reading 299/300 as a solve rate would be wrong. Those 649
.patchfiles in the repo are exactly these benchmark output artifacts, not product source. - HumanEval 98.78% is self-reported and unverified.
- The autonomy is real but bounded: by the project's own admission, it generates Dockerfiles and CI but does not deploy. A human runs the deploy commands.
gstack:
- The headline productivity numbers (810x, "team of twenty," 600K lines) are self-reported from Garry Tan's own repos and not independently verified. To its credit,
docs/ON_THE_LOC_CONTROVERSY.mdis unusually self-aware and shows the deflated math, but the figures are still the author's own. - The voice is enforced: community PR guardrails hard-block any PR that softens the YC/founder framing. That is a constraint on it being a true community project.
- It is effectively Claude Code-first. The other nine hosts are degraded, and the deepest features target Claude.
Both projects, in other words, lead with founder narrative. The underlying engineering on both sides is more substantial than the marketing is trustworthy.
Pros and cons
gstack
Pros
- A real, substantial browser automation engine (24K LOC, 76 commands), not a prompt that pretends to browse
- Genuinely permissive MIT license; fork it, sell it, embed it
- Mature paid-eval and CI infrastructure with diff-based test selection
- Strong design and security tooling (
/design-shotgun,/cso) - Cohesive human-in-the-loop workflow where skills chain artifacts
- Massive, fast-growing community and ecosystem (105K stars, 15.7K forks)
Cons
- Effectively single-author with enforced voice and gatekeeping
- Claude Code-first; other hosts are second-class
- Heavy, self-promotional README with unverified productivity multipliers
- macOS/arm64-centric; Windows path uses stale copies, not symlinks
- Full eval suite is paid (about $4/run) and slow
- Not autonomous: it structures you, it does not replace you
Loki Mode
Pros
- Real autonomous build loop (RARV-C), not just a prompt scaffold
- Provider-agnostic with actual failover (Claude/Codex/Cline/Aider): no single-vendor lock-in
- Code-enforced blind multi-reviewer and anti-sycophancy logic
- Cross-project memory with vector search, dependency-free by design
- Unusual and useful legacy-healing mode
- Self-hosted, your-keys, air-gappable; strong packaging and CI breadth
- Larger real codebase than the star count suggests
Cons
- BUSL-1.1 license restricts commercial/hosted use; "open source" labeling overstates it
- Does not deploy; stops at generated configs
- Benchmark headline numbers are soft or mislabeled (299/300 is generation, not solve)
- Mid-migration: bash is still the source of truth while a Bun port proceeds (dual-maintenance risk)
- Bus factor of one
- Much of the ambition (enterprise TLS/OIDC/RBAC, big test counts) is env-gated and self-reported
So which is better?
Segment it, then commit to a call.
If you want to stay in control and ship faster as a human: gstack. The roles, the review gates, the real browser, and the proven workflow make it the safer, more polished daily driver for most people today. The MIT license means you can build on it without legal friction. For a solo founder or a tech lead who wants rigor on every PR, this is the stronger pick right now.
If you want to delegate whole features and walk away: Loki Mode. Nothing in gstack attempts true autonomy. If your goal is spec-in, repo-out with quality gates enforcing the loop, Loki Mode is the only one of the two even trying, and the autonomy is real (if bounded by "does not deploy"). The provider-agnostic failover and cross-project memory are genuine advantages for teams that refuse vendor lock-in or run self-hosted and private.
Overall, for the typical user in mid-2026: gstack has the edge. It is more polished for its stated job, genuinely open, backed by a real browser engine and mature evals, and proven by adoption. That is a defensible call, not a popularity vote.
But the honest caveat cuts the other way too: the 110x star gap massively overstates the engineering-effort difference and understates Loki Mode's ambition. Loki Mode is a larger, more technically ambitious codebase attempting a harder problem (full autonomy) with less distribution behind it. If autonomous spec-to-product is where the field is going, Loki Mode is aimed at the bigger target, and it is further along than 956 stars implies.
Stars told you which project got more attention. They did not tell you which one is better built, and they certainly did not tell you which one fits your work. Pick by the loop you want: a structured human, or a structured machine.
Methodology: figures drawn from live GitHub API calls and from reading source in both repositories on 2026-06-01. Test results are from suites executed locally (gstack: 329 pass in skill-validation; Loki Mode: 137 pass in the protocol suite). Both projects are effectively single-author; this comparison is written to be neutral on authorship.