Benchmark

Same model.
Better results.

We solved 5 business tasks with the same AI model — once as a simple prompt, once through the Reactor pipeline. Two independent judge models evaluated the results blind.

+15,5%overall quality
+23,7%critical thinking
+22,2%perspective diversity
25 / 25perfect score (T4)
Engine: v1 PureStream·Base model: Gemini 3.1 Flash Lite (both conditions)·Judges: GPT-4o-mini + Claude Sonnet 4.6·Blinded evaluation
Test Design

How we tested

Two conditions

Standard prompt

The task is sent directly to the model. No system prompt, no method binding, no agent decomposition. This is how most people use AI today.

Reactor

The same task, processed through a full recipe: multi-step orchestration, method binding to documented thinking frameworks, and specialized agents in structured sequence.

5 quality dimensions (1–5 points)

D1

Structural depth

Clear hierarchy, tables, argumentation architecture

D2

Analytical rigor

Causal chains, evidence-based reasoning

D3

Perspective diversity

Genuine counterpositions, not just variations

D4

Actionability

Concrete steps, timelines, priorities

D5

Critical friction

Challenges assumptions instead of telling the user what they want to hear

5 real-world tasks

T1

B2B → B2C pivot analysis

Strategic analysis

Strategy Pivot Engine

T2

Gen-Z coffee packaging brief

Creative ideation

Ergodic Hive

T3

US market entry ranking

Comparative evaluation

TRIZ Swarm v9

T4

Reducing ER wait times

TRIZ contradiction resolution

TRIZ System Analyzer

T5

Fintech data breach crisis response

Crisis management

Red Team Simulator (CART Wargame)

Results

Quality in detail

Average scores across all 5 tasks, rated on a scale of 1–5 per dimension. The Reactor wins in every category.

Dimension (Ø 1–5)
ReactorStandard
D1 Structural depth+7%
Reactor
4.9
Standard
4.6
D2 Analytical rigor+15%
Reactor
4.7
Standard
4.1
D3 Perspective diversity+22%
Reactor
4.4
Standard
3.6
D4 Actionability+14%
Reactor
4.8
Standard
4.2
D5 Critical friction+24%
Reactor
4.7
Standard
3.8
Total score (max. 25)
23,820,6+15,5%
Individual results

Task by task

The Reactor wins all 5 tasks. The largest advantages emerge in creative and crisis scenarios — precisely where structured multi-agent orchestration makes the difference.

Task

B2B → B2C pivot analysis

Strategic analysis

Recipe

Strategy Pivot Engine

Reactor

23.0

Δ vs. Std.

+7%

Gen-Z coffee packaging brief

Creative ideation

Recipe

Ergodic Hive

Reactor

24.5

Δ vs. Std.

+26%

US market entry ranking

Comparative evaluation

Recipe

TRIZ Swarm v9

Reactor

25.0

Δ vs. Std.

+14%

Reducing ER wait times

TRIZ contradiction resolution

Recipe

TRIZ System Analyzer

Reactor

25.0

Δ vs. Std.

+19%

Fintech data breach crisis response

Crisis management

Recipe

Red Team Simulator (CART Wargame)

Reactor

23.5

Δ vs. Std.

+24%

★ T4 achieved a perfect score of 25 / 25 — maximum in all dimensions. Standard: T1 21,5 · T2 19,5 · T3 22,0 · T4 21,0 · T5 19,0

Architecture

Why this works

Decomposition, not mega-prompts

The longer a prompt gets, the more results deteriorate — research calls this attention dilution. The Reactor breaks complex tasks into isolated micro-steps. The AI model doesn't need to manage the process — it only solves one clearly defined sub-task per step.

True diversity, not variations

When a single model generates '3 different ideas', it samples all three from the same probability space. The result: stylistic variation, not real diversity. The Reactor deploys parallel agents with orthogonal methods — TRIZ, biomimicry, red team — enforcing genuine counterpositions.

Fair evaluation

AI models systematically rate their own text higher (self-enhancement bias). That's why we use judges from different providers. Instead of averages, we use the median — the more robust method for error-prone measurements. Reasoning is generated before the score, not the other way around.

Cognitive offloading

An empty input field creates maximum cognitive load: the user must simultaneously formulate, structure, and prioritize. The Reactor handles navigation and structure. Humans can focus on what matters: the strategic evaluation of results.

Limitations

This benchmark covers 5 tasks across 5 dimensions — a relevant but deliberately focused sample. Results cannot be directly generalized to all task types, domains, or base models. The judge panel consists of two models; additional judges would increase statistical confidence. We consider this v1 of an ongoing evaluation program. As the engine evolves, tests will be repeated and published here.

See for yourself.

The best benchmark is your own problem. Pick a task that actually matters to you and compare the results.