OpenEnv Hackathon 2026

SuperGeneral

Compositional Tool Environments for Long-Horizon Agents
├─ tree (composition)
├─ trunk curve.js
├─ branches path.js
├─ forks circle.js
└─ blossoms circle.js
└─ growing (animation)
├─ grow-up clip-path
├─ draw-in stroke-dash
└─ pop scale
Tool Use Tool Composition Tool Creation

Frontier models are surprisingly bad at using tools.

They cheat by memorizing knowledge instead of learning to act.
But tool use is the key to adapting to new domains and long-horizon tasks.

01 — The Evidence

We evaluated Claude, GPT, Qwen, and DeepSeek across SVG illustration, law, consulting, and investment banking. No model achieves tool composition + creation across all domains.

Hand-Draw: Hourglass
ModelTool UseTool Comp.Tool CreateReward
Claude Sonnet 4
0.700
Qwen3-Coder-30B
0.700
OpenAI GPT-5.4
0.600
OpenAI GPT-4o-mini
0.600
DeepSeek V3DNF
Law: Royalty Dispute
ModelTool UseTool Comp.Tool CreateReward
Claude Sonnet 4
0.730
OpenAI GPT-5.4
0.700
Qwen3-Coder-30B
0.670
OpenAI GPT-4o-mini
0.210
DeepSeek V3DNF
Consulting: Market Entry
ModelTool UseTool Comp.Tool CreateReward
Claude Sonnet 4
0.850
OpenAI GPT-5.4
0.700
Qwen3-Coder-30B
0.550
OpenAI GPT-4o-mini
0.280
DeepSeek V3DNF
Investment Banking: Financial Analysis
ModelTool UseTool Comp.Tool CreateReward
Claude Sonnet 4
0.741
OpenAI GPT-5.4
0.741
OpenAI GPT-4o-mini
0.327
Qwen3-Coder-30B
0.327
DeepSeek V3DNF
02 — How We Test This
One strategy. Every domain.
Core Strategy
├─ Tool Use — use existing building blocks
├─ Tool Composition — combine tools for new goal
└─ Tool Creation — create new tools for the task
Measured by file-system diff. Domain-agnostic. Learned once, applied everywhere.
Hand-Draw
├─use triangle.js + line.js
├─compose: 2×triangle → hourglass
└─hourglass.html
visual
Law
├─use precedent_template.txt
├─compose: facts + statute → memo
└─legal_memo.txt
text-heavy
Consulting
├─use framework_template.md
├─compose: framework + data → deck
└─strategy_deck.md
text + tools
Investment Banking
├─use xirr_tool.py + brief.txt
├─compose: cashflows + xirr → analysis
└─analysis.txt
tool-heavy
How the environment teaches meta-strategy

Each domain has a worked example showing how to compose building blocks into a finished output. The agent learns the method — decompose → find blocks → compose — not the specific answer.

Domain Worked Example Building Blocks Meta-Strategy Tip
Hand-Draw diamond.html — 2 triangles → diamond elements/*.js “See how illustrations compose from building blocks”
Law precedent_memo.txt — case → tool → memo tools/royalty_calc.py “See how memo uses tools and case data”
Consulting market_analysis.txt — data → framework → strategy tools/tam_tool.py “See how strategy uses frameworks and market data”
Investment Banking alpha_analysis.txt — brief → tool → result tools/xirr_tool.py “See how analysis uses tools and data”
03 — See It In Action

Click a model to see its full agent trajectory on the Hand-Draw: Hourglass task. Watch how each model discovers (or fails to discover) the building blocks.

› Compose an hourglass
Try It Live — run an agent on the SuperGeneral environment