Research

How HELIX reasons.

HELIX is not a scanner with a chat box bolted on. It is a planner that hypothesizes, executes with real tools, observes the result, and re-decides, under hard constraints, with a witness behind every claim. Here is how the engine thinks.

01, Planning

Tree-search planning for offense.

An offensive engagement is a search problem with a huge branching factor and expensive, irreversible moves. We treat it like one. HELIX runs a stateful Monte-Carlo Tree Search planner with UCB1 selection: it proposes candidate moves, executes the most promising one for real, scores what comes back, and re-decides.

Crucially, it learns from failure. When a branch dies, a WAF deflects a payload, an endpoint returns a uniform error, the planner prunes it and re-strategizes instead of retrying the same closed door. The loop is hypothesize → execute → observe → re-decide, the same loop a human operator runs, made stateful and parallel.

MCTS · UCB1

A stateful planner, not a single mega-prompt

State carries across moves: what is in scope, what has been tried, what the target revealed. The planner balances exploring new branches against exploiting promising ones, and abandons dead ends rather than looping on them.

Hypothesize Execute Observe Re-decide Prune

02, Evidence

Evidence-first, toward zero false positives.

A finding without proof is noise, and noise is what makes security teams stop reading reports. HELIX inverts the default: a hypothesis is not a finding until it has runtime corroboration. A correlator dedupes signals across agents; a filter drops any hypothesis that cannot be reproduced against the live target.

Then a dedicated Skeptic agent, codename DOUBT, "every claim needs a witness", actively tries to refute what survived. What ships is a confirmed finding with a copy-pasteable reproducer, a CVSS score, a CWE, and language-specific remediation, tracked through a nine-state triage machine.

DOUBT · PROOF

Refute before you report

The burden of proof sits with the engine, not the reader. Every claim must survive corroboration and an explicit refutation pass before it reaches a human. The goal is a report where every line is worth your attention.

Runtime corroboration curl reproducer CVSS + CWE 9-state triage

03, Architecture

Multi-agent over a shared blackboard.

No single prompt can be a recon specialist, an injection expert, an auth breaker, and a reporter at once. HELIX is 40+ specialized agents coordinating over a shared blackboard, a findings bus every agent reads and writes. VANGUARD maps the surface; VENOM probes injection; KEYMASTER works auth; ASCENT hunts access control.

The blackboard is what makes the whole more than its parts. The chain hunter CASCADE, "sees bugs as dominoes", watches the shared bus for findings that compose into something larger than any single bug. Specialization plus shared state is how a swarm reasons coherently instead of stepping on itself.

40+ AGENTS

Specialists on a findings bus

The agent reasons; roughly 100 real offensive tools, sqlmap, nuclei, ffuf, Frida, Playwright, Semgrep, and more, do the work. Not LLM-pretend-tools. The blackboard keeps every specialist working from the same evolving picture of the target.

VANGUARD VENOM KEYMASTER ASCENT CASCADE CONDUCTOR

04, Safe autonomy

Safety is a research constraint, not an afterthought.

Pointing an autonomous agent at real systems is only defensible if the constraints come first. Every tool call passes through a six-layer guardrail engine, in order, before anything executes. We design the planner inside these limits, not around them.

LAYER 1–2

Scan mode & scope

You set aggressiveness per engagement, passive, safe, or full. A hard in-scope allow-list bounds everything the engine is permitted to touch.

LAYER 3–4

Destructive blocking & budget

A pattern detector stops data-destroying or service-saturating actions before they run. A hard LLM-spend ceiling caps every engagement.

LAYER 5–6

Rate limiting & human gate

Request pacing protects availability. Production targets require explicit human approval, staging first, then prod, never silently.

05, Forward-looking R&D

What we are exploring next.

These are active research directions, not shipped features, we are honest about the difference. The first is a hardened execution sandbox: running offensive tooling inside microVM isolation so that even a misbehaving tool is contained at the hypervisor boundary, not just the process boundary.

The second is self-hosted offensive models: reducing dependence on third-party inference for sensitive engagements, giving teams the option to keep reasoning in-house. Both are research bets aimed at making autonomous offense safer and more deployable, not promises of a release date.

ROADMAP · R&D

Containment and control

The throughline is the same as everything above: make it safer to point real reasoning at real systems. Sandboxing hardens the blast radius; self-hosting hardens the trust boundary.

microVM isolation Hardened sandbox Self-hosted models Exploratory

How we validate

Reproducible, no cherry-picking.

Research claims are only as good as the way you test them. We validate HELIX on public benchmarks, including the OWASP Juice Shop deliberately-vulnerable application, where anyone can check our work, and on authorized bug-bounty scope, where we have explicit permission to test live systems.

Findings are reproducible by construction: every confirmed result carries a reproducer, so a third party can re-run it. We do not cherry-pick favorable runs to publish.

On numbers. We deliberately do not publish private validation metrics, exact vulnerability counts, run durations, or per-engagement costs, on this page. We will walk design partners through detailed results directly, under NDA.

Want to go deeper on the engine?

We are onboarding three to five design partners and walk them through the architecture and validation in detail.

Request a demo See how it works