Feature Request: Lazy-Loading Architecture for Token Optimization (~70% reduction possible)
Summary
A simple "hi" message currently costs ~53k tokens before any conversation begins. This proposal outlines architectural changes to reduce baseline context overhead by 60-70% through lazy-loading patterns.
Core Insight: Human memory doesn't load everything at once — it uses cue-based retrieval. Claude Code should work the same way.
Current: Load ALL → Process request → Respond
Proposed: Load INDEX → Match cues → Fetch relevant → Respond
The Problem
Current token breakdown for a fresh session:
| Component | Tokens | % | Controllable? |
|---|---|---|---|
| System tools | 20.4k | 38% | No |
| Memory files (CLAUDE.md) | 10-18k | 19-34% | Yes |
| MCP tools | 9.1k | 17% | Partially |
| Custom agents (in Task tool) | 3.3k | 6% | Yes |
| Skills (in Skill tool) | 2.6k | 5% | Yes |
~35k tokens are controllable but currently load upfront regardless of need.
Proposed Features
1. Lazy-Loading Memory Files (CLAUDE.md)
Problem: All CLAUDE.md files load immediately (~10-18k tokens)
Proposed: Index-based retrieval:
# CLAUDE.md.index (~500 tokens)
sections:
timestamps:
triggers: ["timestamp", "time format"]
file: CLAUDE.md#timestamps
memory_system:
triggers: ["RAG", "memory", "CTM"]
file: RAG_GUIDE.md
agents:
triggers: ["agent", "delegate"]
file: AGENTS_INDEX.md
Fetch matching sections on-demand based on user message keywords.
Savings: ~15k tokens
2. Lazy-Loading Agent Definitions in Task Tool
Problem: Task tool embeds ALL agent descriptions (~82 agents = 3.3k+ tokens)
Proposed: Summary list with on-demand fetch:
{
"name": "Task",
"description": "Launch agent. Available: Explore, Plan, Bash, worker, ... (82 total)",
"dynamic_schema": {
"subagent_type": { "fetch_on_use": true }
}
}
Savings: ~3k tokens
3. Lazy-Loading Skill Definitions
Problem: Skill tool embeds all skill descriptions (~44 skills = 2.6k tokens)
Proposed: Same pattern as agents — summary list, fetch full schema on invoke.
Savings: ~2k tokens
4. Sub-Agent Context Inheritance Control
Problem: Sub-agents inherit full parent context including entire CLAUDE.md.
Proposed: New Task parameter:
Task:
subagent_type: "Explore"
prompt: "Find TypeScript files"
context_inheritance: "minimal" # NEW: minimal | full | none
Savings: ~8k tokens per sub-agent
5. Usage-Based Tool Pruning
Problem: Tools/agents never used still load every session.
Proposed: Track usage, suggest disabling after N sessions of non-use:
📊 Token Optimization Suggestions
These tools haven't been used in 30+ sessions:
- mcp__fathom__create_webhook (214 tokens)
- agent: mermaid-converter (38 tokens)
Disable to save ~400 tokens? [y/n/never ask]
Savings: 1-5k tokens (varies by user)
Total Impact
| Metric | Current | Optimized | Reduction |
|---|---|---|---|
| Baseline for "hi" | ~53k | ~15k | ~70% |
| With 3 sub-agents | ~80k | ~20k | ~75% |
The Brain Analogy
Human memory architecture that inspired this:
| Human Memory | Proposed Claude Code |
|---|---|
| Index/tags always loaded | CLAUDE.md.index (~500 tokens) |
| Full memories fetched on cue | Sections fetched on keyword match |
| Recent memories cached | Session cache |
| Unused memories pruned | Usage-based suggestions |
| Context-dependent recall | Project-type detection |
The current architecture is like a human who recites their entire autobiography before every conversation.
Current Workarounds
We've implemented partial solutions:
- Slim/Full CLAUDE.md switcher — wrapper script with `--slim` flag
- Sub-agent detection hook — auto-switches to slim for Task-spawned agents
- Project-level .mcp.json — disables unneeded MCP servers per project
These save ~20k tokens but require manual setup and maintenance.
Environment
- Claude Code Version: 2.1+
- Model: Claude Opus 4.5
- Platform: macOS
Happy to provide more details or help test implementations.