agents icon indicating copy to clipboard operation
agents copied to clipboard

feat: add context tracking evaluation system for trace viewer

Open amikofalvy opened this issue 2 months ago • 3 comments

Summary

  • Add a context tracking evaluation system that shows token usage breakdown for each component of the LLM context window in the trace viewer
  • Uses character-based token approximation (~4 chars/token) which works universally across all models (OpenAI, Anthropic, Gemini, custom)
  • Records context breakdown as OTEL span attributes on agent.generate spans

Changes

Backend (agents-run-api)

  • token-estimator.ts: New utility for character-based token estimation with ContextBreakdown interface
  • Phase1Config.ts / Phase2Config.ts: Return AssembleResult with prompt and breakdown from assembly
  • SystemPromptBuilder.ts: Pass through breakdown data
  • Agent.ts: Record 12 context breakdown attributes as OTEL span attributes

Shared (agents-core)

  • otel-attributes.ts: Add context.breakdown.* attribute keys

Frontend (agents-manage-ui)

  • context-breakdown.tsx: New visualization component with stacked bar chart and detailed list
  • conversation-detail.tsx: Integrate breakdown component into summary cards
  • route.ts: Parse breakdown attributes from span data

Components Tracked

Component Description
System Prompt Template Base XML template tokens
Core Instructions Core prompt instructions
Agent Prompt Agent-level context
Tools Section MCP, function, and relation tools
Artifacts Section Artifact definitions
Data Components Phase 2 data component instructions
Artifact Components Artifact creation instructions
Transfer Instructions Agent transfer guidance
Delegation Instructions Agent delegation guidance
Thinking Preparation Extended thinking instructions
Conversation History Chat history tokens

Test plan

  • [ ] Verify type checking passes across all packages
  • [ ] Test trace viewer shows context breakdown card for agent generation spans
  • [ ] Verify breakdown values are reasonable (sum to total)
  • [ ] Test with different agent configurations to ensure all components are tracked

amikofalvy avatar Dec 01 '25 21:12 amikofalvy

🦋 Changeset detected

Latest commit: fa0aa24c59235ca367d17cc9dc98e63233b3b7c6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
@inkeep/agents-core Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-run-api Patch
@inkeep/agents-cli Patch
@inkeep/agents-manage-api Patch
@inkeep/agents-manage-mcp Patch
@inkeep/agents-sdk Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

changeset-bot[bot] avatar Dec 01 '25 21:12 changeset-bot[bot]

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
agents-docs Ready Ready Preview, Comment Dec 22, 2025 7:35pm
agents-manage-api Error Error Dec 22, 2025 7:35pm
agents-manage-ui Ready Ready Preview, Comment Dec 22, 2025 7:35pm
agents-run-api Ready Ready Preview, Comment Dec 22, 2025 7:35pm

vercel[bot] avatar Dec 01 '25 21:12 vercel[bot]

Claude finished @amikofalvy's task —— View job


TypeScript Staff Engineer & System Architect Review

I've completed a comprehensive review of the context tracking evaluation system. This is a well-designed feature that adds valuable observability to LLM context usage. Here are my key findings and recommendations:

Review Tasks

  • [x] Understand existing codebase patterns and architecture
  • [x] Review token estimation utility and approach
  • [x] Analyze backend changes (Agent.ts, Phase1/2Config, SystemPromptBuilder)
  • [x] Review shared package changes (otel-attributes)
  • [x] Evaluate frontend visualization component
  • [x] Check for consistency across packages and existing patterns
  • [x] Assess extensibility and design decisions
  • [x] Provide comprehensive feedback

Strengths & Positive Aspects

Excellent Architecture & Design:

  • Smart choice of character-based token approximation (~4 chars/token) for universal compatibility
  • Well-structured separation between token estimation utility, prompt assembly, and OTEL instrumentation
  • Consistent AssembleResult return pattern across Phase1 and Phase2 configurations
  • Thoughtful breakdown into logical components that align with actual prompt structure

Strong Implementation Quality:

  • Comprehensive OTEL attribute coverage with consistent naming (context.breakdown.*)
  • Robust frontend visualization with proper responsive design and accessibility considerations
  • Good error handling and fallback patterns (e.g., getNumber with fallbacks in route.ts)
  • Proper TypeScript typing throughout

🔍 Areas for Consideration

1. Token Estimation Accuracy

Severity: Medium

The 4 chars/token approximation may be less accurate for some content types:

// agents-run-api/src/utils/token-estimator.ts:12
const CHARS_PER_TOKEN = 4;

Considerations:

  • XML/structured content typically has lower char/token ratios (~3.5)
  • Code snippets may have higher ratios (~4.5-5)
  • Different models have varying tokenization strategies

Recommendations:

  • Consider exposing the ratio as a configuration parameter
  • Add documentation about accuracy expectations
  • Consider future extensibility for model-specific estimators

2. Calculation Consistency

Severity: Low-Medium

Total calculation is duplicated across files, creating potential maintenance issues:

// agents-run-api/src/agents/Agent.ts:1056-1065 (manual calculation)
contextBreakdown.total =
  contextBreakdown.systemPromptTemplate +
  contextBreakdown.coreInstructions +
  // ... 9 more lines

// vs agents-run-api/src/utils/token-estimator.ts:82-95 (helper function)
export function calculateBreakdownTotal(breakdown: ContextBreakdown): ContextBreakdown

Recommendation: Use the calculateBreakdownTotal helper consistently instead of manual recalculation in Agent.ts:1056.

3. Frontend Data Safety

Severity: Low

The frontend parsing could be more defensive:

// agents-manage-ui/src/app/api/signoz/conversations/[conversationId]/route.ts:568-580
contextBreakdown = {
  systemPromptTemplate: Number(data['context.breakdown.system_template_tokens']) || 0,
  // ... potential NaN issues if non-numeric values exist
};

Recommendation: Use the existing getNumber helper function which includes Number.isFinite checks for better safety.

4. Architecture & Extensibility

System Design Considerations:

  • The character-based approach is pragmatic but may need evolution for different use cases
  • Consider future scenarios where users might want:
    • Model-specific token counting
    • Cost estimation (tokens × cost per token)
    • Context window utilization percentages
    • Historical trending data

Naming & Semantics: The OTEL attribute naming is well-structured and follows existing patterns. The context.breakdown.* namespace clearly separates these metrics.

🚀 Recommended Actions

High Priority:

  1. Use consistent total calculation - Replace manual total calculation in Agent.ts with calculateBreakdownTotal helper
  2. Add defensive parsing - Use getNumber helper in route.ts for safer attribute parsing

Medium Priority: 3. Enhance documentation - Add accuracy expectations and model compatibility notes to token-estimator.ts 4. Consider configurability - Make chars/token ratio configurable for future model-specific tuning

Low Priority: 5. Add validation - Consider adding runtime validation that breakdown components sum to total 6. Future extensibility - Plan for cost estimation and context window percentage features

📊 Component Analysis

Frontend Visualization (context-breakdown.tsx):

  • Excellent responsive design with stacked bar chart + detailed breakdown
  • Proper accessibility with titles and color-coded indicators
  • Smart filtering (skips segments <0.5% in stacked view)
  • Good use of Tailwind utility classes and proper TypeScript typing

Backend Integration:

  • Seamless OTEL integration following existing patterns
  • Proper span attribute recording on agent.generate spans
  • Consistent with existing telemetry approach

Data Flow:

  • Clean data flow: Phase1/Phase2Config → Agent.ts → OTEL → SigNoz → Frontend
  • Proper separation of concerns between estimation, assembly, and instrumentation

🎯 Overall Assessment

This is a high-quality implementation that adds significant value for understanding LLM context utilization. The architecture is sound, the code is well-structured, and it follows existing patterns effectively. The character-based approximation is a pragmatic choice that balances accuracy with simplicity.

The feature provides immediate utility for developers debugging context issues and understanding where tokens are being consumed, which is crucial for managing costs and context window limits.

Approval Recommendation:Approved with minor suggestions

The identified issues are mostly minor improvements and future considerations rather than blocking concerns. This PR is ready for merge with the understanding that the suggested refinements could be addressed in follow-up work.


claude[bot] avatar Dec 01 '25 21:12 claude[bot]

@claude review

shagun-singh-inkeep avatar Dec 15 '25 16:12 shagun-singh-inkeep

Claude finished @shagun-singh-inkeep's task in 2m 48s —— View job


Code Review for Context Tracking Evaluation System

I've completed a comprehensive review of the PR that adds a context tracking evaluation system for the trace viewer. This feature tracks token usage breakdown for LLM context windows using character-based approximation.

Review Tasks:

  • [x] Read repository guidelines (CLAUDE.md)
  • [x] Review token estimation utility and approach
  • [x] Analyze backend changes (Agent.ts, Phase1/2Config, SystemPromptBuilder)
  • [x] Review shared package changes (otel-attributes)
  • [x] Evaluate frontend visualization component
  • [x] Check for consistency across packages
  • [x] Assess overall architecture and design decisions
  • [x] Provide comprehensive feedback

Overall Assessment

This is a well-designed and implemented feature that adds valuable observability to LLM context usage. The character-based token approximation approach (~4 chars/token) is pragmatic and provides universal compatibility across different models.

✅ Strengths

Architecture & Design

  • Smart token estimation approach: The 4 chars/token approximation works universally across all models (OpenAI, Anthropic, Gemini) without external dependencies
  • Clean separation of concerns: Well-structured with dedicated token estimation utility, consistent AssembleResult pattern, and proper OTEL instrumentation
  • Comprehensive tracking: Covers all major context components (system prompt, instructions, tools, artifacts, conversation history, etc.)

Implementation Quality

  • Type safety: Properly typed interfaces (ContextBreakdown, AssembleResult) with consistent usage across packages
  • OTEL integration: Follows existing telemetry patterns with well-named attributes (context.breakdown.*)
  • Frontend visualization: Excellent responsive UI with both stacked bar chart and detailed breakdown, proper accessibility considerations

Code Organization

  • Consistent patterns: The AssembleResult return type is used consistently across Phase1Config and Phase2Config
  • Helper functions: Good utilities like createEmptyBreakdown(), estimateTokens(), and calculateBreakdownTotal()
  • Error handling: Proper fallbacks with || 0 in frontend parsing

🔍 Areas for Improvement

1. Potential Calculation Consistency Issue (Minor)

The calculateBreakdownTotal() helper function exists in token-estimator.ts but appears to be called correctly in Agent.ts:2279. Good use of the helper function.

2. Frontend Data Parsing Safety (Minor)

In route.ts:1137-1149, the parsing uses Number() with || 0 fallback, which is safe but could be more robust:

// Current approach is acceptable but could use existing helper
systemPromptTemplate: Number(data['context.breakdown.system_template_tokens']) || 0,

Consider using a helper function like getNumber() if it exists in the codebase for consistency with other parsing patterns.

3. Token Estimation Accuracy Documentation (Suggestion)

The 4 chars/token ratio is documented in comments but could benefit from more detailed accuracy expectations:

  • XML/structured content typically has lower ratios (~3.5 chars/token)
  • Code snippets may have higher ratios (~4.5-5 chars/token)
  • Different models have varying tokenization strategies

Consider adding a comment block in token-estimator.ts explaining accuracy expectations and potential future enhancements.

4. Visualization Threshold (Minor)

The frontend skips segments <0.5% in the stacked bar (context-breakdown.tsx:132). This is reasonable but could be made configurable if needed in the future.

🎯 Recommendations

High Priority

None - the implementation is solid and ready for production.

Medium Priority

  1. Documentation: Add more detailed comments about token estimation accuracy expectations in token-estimator.ts
  2. Consistency: Consider using existing number parsing helpers in the frontend if available

Low Priority

  1. Future extensibility: Consider making the chars/token ratio configurable for future model-specific tuning
  2. Validation: Could add runtime validation that breakdown components sum to total (though the current implementation with calculateBreakdownTotal() makes this unlikely to be an issue)

📊 Component-by-Component Review

Backend (agents-run-api)

  • token-estimator.ts: Clean, well-documented utility with appropriate interfaces
  • Agent.ts: Proper integration with OTEL span attributes, correct use of helper functions
  • Phase1/Phase2Config: Consistent implementation of AssembleResult return type
  • SystemPromptBuilder: Correctly passes through breakdown data

Shared (agents-core)

  • otel-attributes.ts: Well-organized attribute constants following existing naming patterns

Frontend (agents-manage-ui)

  • context-breakdown.tsx: Excellent visualization component with responsive design
  • route.ts: Proper parsing of OTEL attributes with fallbacks
  • types.ts: Clean interface definition matching backend structure

✅ Test Coverage

The test plan in the PR description is comprehensive. The changes to test files show proper adaptation to the new AssembleResult return type.

Conclusion

This is a high-quality implementation that adds immediate value for developers debugging context issues and managing token costs. The architecture is sound, the code follows existing patterns well, and the feature is production-ready.

Recommendation: ✅ APPROVED

The minor suggestions above can be addressed in follow-up PRs if desired, but none are blocking issues.

claude[bot] avatar Dec 15 '25 16:12 claude[bot]