They cheat by memorizing knowledge instead of learning to act.
But tool use is the key to adapting to new domains and long-horizon tasks.
We evaluated Claude, GPT, Qwen, and DeepSeek across SVG illustration, law, consulting, and investment banking. No model achieves tool composition + creation across all domains.
| Model | Tool Use | Tool Comp. | Tool Create | Reward |
|---|---|---|---|---|
| Claude Sonnet 4 | ✓ | ✗ | ✗ | 0.700 |
| Qwen3-Coder-30B | ✓ | ✗ | ✗ | 0.700 |
| OpenAI GPT-5.4 | ✓ | ✗ | ✗ | 0.600 |
| OpenAI GPT-4o-mini | ✓ | ✗ | ✗ | 0.600 |
| DeepSeek V3 | DNF | — | ||
| Model | Tool Use | Tool Comp. | Tool Create | Reward |
|---|---|---|---|---|
| Claude Sonnet 4 | ✓ | ✓ | ✗ | 0.730 |
| OpenAI GPT-5.4 | ✓ | ✗ | ✗ | 0.700 |
| Qwen3-Coder-30B | ✓ | ✓ | ✗ | 0.670 |
| OpenAI GPT-4o-mini | ✗ | ✓ | ✗ | 0.210 |
| DeepSeek V3 | DNF | — | ||
| Model | Tool Use | Tool Comp. | Tool Create | Reward |
|---|---|---|---|---|
| Claude Sonnet 4 | ✓ | ✓ | ✗ | 0.850 |
| OpenAI GPT-5.4 | ✓ | ✗ | ✗ | 0.700 |
| Qwen3-Coder-30B | ✓ | ✓ | ✗ | 0.550 |
| OpenAI GPT-4o-mini | ✓ | ✗ | ✗ | 0.280 |
| DeepSeek V3 | DNF | — | ||
| Model | Tool Use | Tool Comp. | Tool Create | Reward |
|---|---|---|---|---|
| Claude Sonnet 4 | ✓ | ✓ | ✗ | 0.741 |
| OpenAI GPT-5.4 | ✓ | ✓ | ✗ | 0.741 |
| OpenAI GPT-4o-mini | ✗ | ✗ | ✗ | 0.327 |
| Qwen3-Coder-30B | ✗ | ✗ | ✗ | 0.327 |
| DeepSeek V3 | DNF | — | ||
Each domain has a worked example showing how to compose building blocks into a finished output. The agent learns the method — decompose → find blocks → compose — not the specific answer.
| Domain | Worked Example | Building Blocks | Meta-Strategy Tip |
|---|---|---|---|
| Hand-Draw | diamond.html — 2 triangles → diamond |
elements/*.js |
“See how illustrations compose from building blocks” |
| Law | precedent_memo.txt — case → tool → memo |
tools/royalty_calc.py |
“See how memo uses tools and case data” |
| Consulting | market_analysis.txt — data → framework → strategy |
tools/tam_tool.py |
“See how strategy uses frameworks and market data” |
| Investment Banking | alpha_analysis.txt — brief → tool → result |
tools/xirr_tool.py |
“See how analysis uses tools and data” |
Click a model to see its full agent trajectory on the Hand-Draw: Hourglass task. Watch how each model discovers (or fails to discover) the building blocks.