claude-code icon indicating copy to clipboard operation
claude-code copied to clipboard

Feature Request: Lazy-Loading Architecture for Token Optimization (~70% reduction possible)

Open rfenaux opened this issue 3 weeks ago • 1 comments

Summary

A simple "hi" message currently costs ~53k tokens before any conversation begins. This proposal outlines architectural changes to reduce baseline context overhead by 60-70% through lazy-loading patterns.

Core Insight: Human memory doesn't load everything at once — it uses cue-based retrieval. Claude Code should work the same way.

Current:  Load ALL → Process request → Respond
Proposed: Load INDEX → Match cues → Fetch relevant → Respond

The Problem

Current token breakdown for a fresh session:

Component Tokens % Controllable?
System tools 20.4k 38% No
Memory files (CLAUDE.md) 10-18k 19-34% Yes
MCP tools 9.1k 17% Partially
Custom agents (in Task tool) 3.3k 6% Yes
Skills (in Skill tool) 2.6k 5% Yes

~35k tokens are controllable but currently load upfront regardless of need.


Proposed Features

1. Lazy-Loading Memory Files (CLAUDE.md)

Problem: All CLAUDE.md files load immediately (~10-18k tokens)

Proposed: Index-based retrieval:

# CLAUDE.md.index (~500 tokens)
sections:
  timestamps:
    triggers: ["timestamp", "time format"]
    file: CLAUDE.md#timestamps
  memory_system:
    triggers: ["RAG", "memory", "CTM"]
    file: RAG_GUIDE.md
  agents:
    triggers: ["agent", "delegate"]
    file: AGENTS_INDEX.md

Fetch matching sections on-demand based on user message keywords.

Savings: ~15k tokens


2. Lazy-Loading Agent Definitions in Task Tool

Problem: Task tool embeds ALL agent descriptions (~82 agents = 3.3k+ tokens)

Proposed: Summary list with on-demand fetch:

{
  "name": "Task",
  "description": "Launch agent. Available: Explore, Plan, Bash, worker, ... (82 total)",
  "dynamic_schema": {
    "subagent_type": { "fetch_on_use": true }
  }
}

Savings: ~3k tokens


3. Lazy-Loading Skill Definitions

Problem: Skill tool embeds all skill descriptions (~44 skills = 2.6k tokens)

Proposed: Same pattern as agents — summary list, fetch full schema on invoke.

Savings: ~2k tokens


4. Sub-Agent Context Inheritance Control

Problem: Sub-agents inherit full parent context including entire CLAUDE.md.

Proposed: New Task parameter:

Task:
  subagent_type: "Explore"
  prompt: "Find TypeScript files"
  context_inheritance: "minimal"  # NEW: minimal | full | none

Savings: ~8k tokens per sub-agent


5. Usage-Based Tool Pruning

Problem: Tools/agents never used still load every session.

Proposed: Track usage, suggest disabling after N sessions of non-use:

📊 Token Optimization Suggestions

These tools haven't been used in 30+ sessions:
- mcp__fathom__create_webhook (214 tokens)
- agent: mermaid-converter (38 tokens)

Disable to save ~400 tokens? [y/n/never ask]

Savings: 1-5k tokens (varies by user)


Total Impact

Metric Current Optimized Reduction
Baseline for "hi" ~53k ~15k ~70%
With 3 sub-agents ~80k ~20k ~75%

The Brain Analogy

Human memory architecture that inspired this:

Human Memory Proposed Claude Code
Index/tags always loaded CLAUDE.md.index (~500 tokens)
Full memories fetched on cue Sections fetched on keyword match
Recent memories cached Session cache
Unused memories pruned Usage-based suggestions
Context-dependent recall Project-type detection

The current architecture is like a human who recites their entire autobiography before every conversation.


Current Workarounds

We've implemented partial solutions:

  • Slim/Full CLAUDE.md switcher — wrapper script with `--slim` flag
  • Sub-agent detection hook — auto-switches to slim for Task-spawned agents
  • Project-level .mcp.json — disables unneeded MCP servers per project

These save ~20k tokens but require manual setup and maintenance.


Environment

  • Claude Code Version: 2.1+
  • Model: Claude Opus 4.5
  • Platform: macOS

Happy to provide more details or help test implementations.

rfenaux avatar Jan 19 '26 00:01 rfenaux