OpenEnv Hackathon 2026

SuperGeneral

Compositional Tool Environments for Long-Horizon Agents

├─ tree (composition)

├─ trunk → curve.js ✓

├─ branches → path.js ✓

├─ forks → circle.js ✓

└─ blossoms → circle.js ✓

└─ growing (animation)

├─ grow-up → clip-path ✓

├─ draw-in → stroke-dash ✓

└─ pop → scale ✓

Tool Use → Tool Composition → Tool Creation

Frontier models are surprisingly bad at using tools.

They cheat by memorizing knowledge instead of learning to act.
But tool use is the key to adapting to new domains and long-horizon tasks.

01 — The Evidence

We evaluated Claude, GPT, Qwen, and DeepSeek across SVG illustration, law, consulting, and investment banking. No model achieves tool composition + creation across all domains.

Hand-Draw: Hourglass

Model	Tool Use	Tool Comp.	Tool Create	Reward
Claude Sonnet 4	✓	✗	✗	0.700
Qwen3-Coder-30B	✓	✗	✗	0.700
OpenAI GPT-5.4	✓	✗	✗	0.600
OpenAI GPT-4o-mini	✓	✗	✗	0.600
DeepSeek V3	DNF			—

Law: Royalty Dispute

Model	Tool Use	Tool Comp.	Tool Create	Reward
Claude Sonnet 4	✓	✓	✗	0.730
OpenAI GPT-5.4	✓	✗	✗	0.700
Qwen3-Coder-30B	✓	✓	✗	0.670
OpenAI GPT-4o-mini	✗	✓	✗	0.210
DeepSeek V3	DNF			—

Consulting: Market Entry

Model	Tool Use	Tool Comp.	Tool Create	Reward
Claude Sonnet 4	✓	✓	✗	0.850
OpenAI GPT-5.4	✓	✗	✗	0.700
Qwen3-Coder-30B	✓	✓	✗	0.550
OpenAI GPT-4o-mini	✓	✗	✗	0.280
DeepSeek V3	DNF			—

Investment Banking: Financial Analysis

Model	Tool Use	Tool Comp.	Tool Create	Reward
Claude Sonnet 4	✓	✓	✗	0.741
OpenAI GPT-5.4	✓	✓	✗	0.741
OpenAI GPT-4o-mini	✗	✗	✗	0.327
Qwen3-Coder-30B	✗	✗	✗	0.327
DeepSeek V3	DNF			—

02 — How We Test This

One strategy. Every domain.

Core Strategy

├─ Tool Use — use existing building blocks

├─ Tool Composition — combine tools for new goal

└─ Tool Creation — create new tools for the task

Measured by file-system diff. Domain-agnostic. Learned once, applied everywhere.

→

Hand-Draw

├─use triangle.js + line.js

├─compose: 2×triangle → hourglass

└─hourglass.html

visual

Law

├─use precedent_template.txt

├─compose: facts + statute → memo

└─legal_memo.txt

text-heavy

Consulting

├─use framework_template.md

├─compose: framework + data → deck

└─strategy_deck.md

text + tools

Investment Banking

├─use xirr_tool.py + brief.txt

├─compose: cashflows + xirr → analysis

└─analysis.txt

tool-heavy

How the environment teaches meta-strategy

Each domain has a worked example showing how to compose building blocks into a finished output. The agent learns the method — decompose → find blocks → compose — not the specific answer.

Domain	Worked Example	Building Blocks	Meta-Strategy Tip
Hand-Draw	`diamond.html` — 2 triangles → diamond	`elements/*.js`	“See how illustrations compose from building blocks”
Law	`precedent_memo.txt` — case → tool → memo	`tools/royalty_calc.py`	“See how memo uses tools and case data”
Consulting	`market_analysis.txt` — data → framework → strategy	`tools/tam_tool.py`	“See how strategy uses frameworks and market data”
Investment Banking	`alpha_analysis.txt` — brief → tool → result	`tools/xirr_tool.py`	“See how analysis uses tools and data”

03 — See It In Action

Click a model to see its full agent trajectory on the Hand-Draw: Hourglass task. Watch how each model discovers (or fails to discover) the building blocks.

› Compose an hourglass

Try It Live — run an agent on the SuperGeneral environment