Feature Request: Lazy-Loading Architecture for Token Optimization (~70% reduction possible)

Open rfenaux opened this issue 3 weeks ago • 1 comments

Summary

A simple "hi" message currently costs ~53k tokens before any conversation begins. This proposal outlines architectural changes to reduce baseline context overhead by 60-70% through lazy-loading patterns.

Core Insight: Human memory doesn't load everything at once — it uses cue-based retrieval. Claude Code should work the same way.

Current:  Load ALL → Process request → Respond
Proposed: Load INDEX → Match cues → Fetch relevant → Respond

The Problem

Current token breakdown for a fresh session:

Component	Tokens	%	Controllable?
System tools	20.4k	38%	No
Memory files (CLAUDE.md)	10-18k	19-34%	Yes
MCP tools	9.1k	17%	Partially
Custom agents (in Task tool)	3.3k	6%	Yes
Skills (in Skill tool)	2.6k	5%	Yes

~35k tokens are controllable but currently load upfront regardless of need.

Proposed Features

1. Lazy-Loading Memory Files (CLAUDE.md)

Problem: All CLAUDE.md files load immediately (~10-18k tokens)

Proposed: Index-based retrieval:

# CLAUDE.md.index (~500 tokens)
sections:
  timestamps:
    triggers: ["timestamp", "time format"]
    file: CLAUDE.md#timestamps
  memory_system:
    triggers: ["RAG", "memory", "CTM"]
    file: RAG_GUIDE.md
  agents:
    triggers: ["agent", "delegate"]
    file: AGENTS_INDEX.md

Fetch matching sections on-demand based on user message keywords.

Savings: ~15k tokens

2. Lazy-Loading Agent Definitions in Task Tool

Problem: Task tool embeds ALL agent descriptions (~82 agents = 3.3k+ tokens)

Proposed: Summary list with on-demand fetch:

{
  "name": "Task",
  "description": "Launch agent. Available: Explore, Plan, Bash, worker, ... (82 total)",
  "dynamic_schema": {
    "subagent_type": { "fetch_on_use": true }
  }
}

Savings: ~3k tokens

3. Lazy-Loading Skill Definitions

Problem: Skill tool embeds all skill descriptions (~44 skills = 2.6k tokens)

Proposed: Same pattern as agents — summary list, fetch full schema on invoke.

Savings: ~2k tokens

4. Sub-Agent Context Inheritance Control

Problem: Sub-agents inherit full parent context including entire CLAUDE.md.

Proposed: New Task parameter:

Task:
  subagent_type: "Explore"
  prompt: "Find TypeScript files"
  context_inheritance: "minimal"  # NEW: minimal | full | none

Savings: ~8k tokens per sub-agent

5. Usage-Based Tool Pruning

Problem: Tools/agents never used still load every session.

Proposed: Track usage, suggest disabling after N sessions of non-use:

📊 Token Optimization Suggestions

These tools haven't been used in 30+ sessions:
- mcp__fathom__create_webhook (214 tokens)
- agent: mermaid-converter (38 tokens)

Disable to save ~400 tokens? [y/n/never ask]

Savings: 1-5k tokens (varies by user)

Total Impact

Metric	Current	Optimized	Reduction
Baseline for "hi"	~53k	~15k	~70%
With 3 sub-agents	~80k	~20k	~75%

The Brain Analogy

Human memory architecture that inspired this:

Human Memory	Proposed Claude Code
Index/tags always loaded	CLAUDE.md.index (~500 tokens)
Full memories fetched on cue	Sections fetched on keyword match
Recent memories cached	Session cache
Unused memories pruned	Usage-based suggestions
Context-dependent recall	Project-type detection

The current architecture is like a human who recites their entire autobiography before every conversation.

Current Workarounds

We've implemented partial solutions:

Slim/Full CLAUDE.md switcher — wrapper script with `--slim` flag
Sub-agent detection hook — auto-switches to slim for Task-spawned agents
Project-level .mcp.json — disables unneeded MCP servers per project

These save ~20k tokens but require manual setup and maintenance.

Environment

Claude Code Version: 2.1+
Model: Claude Opus 4.5
Platform: macOS

Happy to provide more details or help test implementations.

Jan 19 '26 00:01 rfenaux