Same model.
Better results.
We solved 5 business tasks with the same AI model — once as a simple prompt, once through the Reactor pipeline. Two independent judge models evaluated the results blind.
How we tested
Two conditions
Standard prompt
The task is sent directly to the model. No system prompt, no method binding, no agent decomposition. This is how most people use AI today.
Reactor
The same task, processed through a full recipe: multi-step orchestration, method binding to documented thinking frameworks, and specialized agents in structured sequence.
5 quality dimensions (1–5 points)
Structural depth
Clear hierarchy, tables, argumentation architecture
Analytical rigor
Causal chains, evidence-based reasoning
Perspective diversity
Genuine counterpositions, not just variations
Actionability
Concrete steps, timelines, priorities
Critical friction
Challenges assumptions instead of telling the user what they want to hear
5 real-world tasks
B2B → B2C pivot analysis
Strategic analysis
Strategy Pivot Engine
Gen-Z coffee packaging brief
Creative ideation
Ergodic Hive
US market entry ranking
Comparative evaluation
TRIZ Swarm v9
Reducing ER wait times
TRIZ contradiction resolution
TRIZ System Analyzer
Fintech data breach crisis response
Crisis management
Red Team Simulator (CART Wargame)
Quality in detail
Average scores across all 5 tasks, rated on a scale of 1–5 per dimension. The Reactor wins in every category.
Task by task
The Reactor wins all 5 tasks. The largest advantages emerge in creative and crisis scenarios — precisely where structured multi-agent orchestration makes the difference.
B2B → B2C pivot analysis
Strategic analysis
Recipe
Strategy Pivot Engine
Reactor
23.0
Δ vs. Std.
+7%
Gen-Z coffee packaging brief
Creative ideation
Recipe
Ergodic Hive
Reactor
24.5
Δ vs. Std.
+26%
US market entry ranking
Comparative evaluation
Recipe
TRIZ Swarm v9
Reactor
25.0
Δ vs. Std.
+14%
Reducing ER wait times
TRIZ contradiction resolution
Recipe
TRIZ System Analyzer
Reactor
25.0
Δ vs. Std.
+19%
Fintech data breach crisis response
Crisis management
Recipe
Red Team Simulator (CART Wargame)
Reactor
23.5
Δ vs. Std.
+24%
★ T4 achieved a perfect score of 25 / 25 — maximum in all dimensions. Standard: T1 21,5 · T2 19,5 · T3 22,0 · T4 21,0 · T5 19,0
Why this works
Decomposition, not mega-prompts
The longer a prompt gets, the more results deteriorate — research calls this attention dilution. The Reactor breaks complex tasks into isolated micro-steps. The AI model doesn't need to manage the process — it only solves one clearly defined sub-task per step.
True diversity, not variations
When a single model generates '3 different ideas', it samples all three from the same probability space. The result: stylistic variation, not real diversity. The Reactor deploys parallel agents with orthogonal methods — TRIZ, biomimicry, red team — enforcing genuine counterpositions.
Fair evaluation
AI models systematically rate their own text higher (self-enhancement bias). That's why we use judges from different providers. Instead of averages, we use the median — the more robust method for error-prone measurements. Reasoning is generated before the score, not the other way around.
Cognitive offloading
An empty input field creates maximum cognitive load: the user must simultaneously formulate, structure, and prioritize. The Reactor handles navigation and structure. Humans can focus on what matters: the strategic evaluation of results.
This benchmark covers 5 tasks across 5 dimensions — a relevant but deliberately focused sample. Results cannot be directly generalized to all task types, domains, or base models. The judge panel consists of two models; additional judges would increase statistical confidence. We consider this v1 of an ongoing evaluation program. As the engine evolves, tests will be repeated and published here.
See for yourself.
The best benchmark is your own problem. Pick a task that actually matters to you and compare the results.