agents feat: add context tracking evaluation system for trace viewer

Summary

Add a context tracking evaluation system that shows token usage breakdown for each component of the LLM context window in the trace viewer
Uses character-based token approximation (~4 chars/token) which works universally across all models (OpenAI, Anthropic, Gemini, custom)
Records context breakdown as OTEL span attributes on agent.generate spans

Changes

Backend (agents-run-api)

token-estimator.ts: New utility for character-based token estimation with ContextBreakdown interface
Phase1Config.ts / Phase2Config.ts: Return AssembleResult with prompt and breakdown from assembly
SystemPromptBuilder.ts: Pass through breakdown data
Agent.ts: Record 12 context breakdown attributes as OTEL span attributes

Shared (agents-core)

otel-attributes.ts: Add context.breakdown.* attribute keys

Frontend (agents-manage-ui)

context-breakdown.tsx: New visualization component with stacked bar chart and detailed list
conversation-detail.tsx: Integrate breakdown component into summary cards
route.ts: Parse breakdown attributes from span data

Components Tracked

Component	Description
System Prompt Template	Base XML template tokens
Core Instructions	Core prompt instructions
Agent Prompt	Agent-level context
Tools Section	MCP, function, and relation tools
Artifacts Section	Artifact definitions
Data Components	Phase 2 data component instructions
Artifact Components	Artifact creation instructions
Transfer Instructions	Agent transfer guidance
Delegation Instructions	Agent delegation guidance
Thinking Preparation	Extended thinking instructions
Conversation History	Chat history tokens

Test plan

[ ] Verify type checking passes across all packages
[ ] Test trace viewer shows context breakdown card for agent generation spans
[ ] Verify breakdown values are reasonable (sum to total)
[ ] Test with different agent configurations to ensure all components are tracked

Dec 01 '25 21:12 amikofalvy

🦋 Changeset detected

Latest commit: fa0aa24c59235ca367d17cc9dc98e63233b3b7c6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages

Name	Type
@inkeep/agents-core	Patch
@inkeep/agents-manage-ui	Patch
@inkeep/agents-run-api	Patch
@inkeep/agents-cli	Patch
@inkeep/agents-manage-api	Patch
@inkeep/agents-manage-mcp	Patch
@inkeep/agents-sdk	Patch
@inkeep/ai-sdk-provider	Patch
@inkeep/create-agents	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Dec 01 '25 21:12 changeset-bot[bot]

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
agents-docs	Ready	Preview, Comment	Dec 22, 2025 7:35pm
agents-manage-api	Error		Dec 22, 2025 7:35pm
agents-manage-ui	Ready	Preview, Comment	Dec 22, 2025 7:35pm
agents-run-api	Ready	Preview, Comment	Dec 22, 2025 7:35pm

Dec 01 '25 21:12 vercel[bot]

Claude finished @amikofalvy's task —— View job

TypeScript Staff Engineer & System Architect Review

I've completed a comprehensive review of the context tracking evaluation system. This is a well-designed feature that adds valuable observability to LLM context usage. Here are my key findings and recommendations:

Review Tasks

[x] Understand existing codebase patterns and architecture
[x] Review token estimation utility and approach
[x] Analyze backend changes (Agent.ts, Phase1/2Config, SystemPromptBuilder)
[x] Review shared package changes (otel-attributes)
[x] Evaluate frontend visualization component
[x] Check for consistency across packages and existing patterns
[x] Assess extensibility and design decisions
[x] Provide comprehensive feedback

✅ Strengths & Positive Aspects

Excellent Architecture & Design:

Smart choice of character-based token approximation (~4 chars/token) for universal compatibility
Well-structured separation between token estimation utility, prompt assembly, and OTEL instrumentation
Consistent AssembleResult return pattern across Phase1 and Phase2 configurations
Thoughtful breakdown into logical components that align with actual prompt structure

Strong Implementation Quality:

Comprehensive OTEL attribute coverage with consistent naming (context.breakdown.*)
Robust frontend visualization with proper responsive design and accessibility considerations
Good error handling and fallback patterns (e.g., getNumber with fallbacks in route.ts)
Proper TypeScript typing throughout

🔍 Areas for Consideration

1. Token Estimation Accuracy

Severity: Medium

The 4 chars/token approximation may be less accurate for some content types:

// agents-run-api/src/utils/token-estimator.ts:12
const CHARS_PER_TOKEN = 4;

Considerations:

XML/structured content typically has lower char/token ratios (~3.5)
Code snippets may have higher ratios (~4.5-5)
Different models have varying tokenization strategies

Recommendations:

Consider exposing the ratio as a configuration parameter
Add documentation about accuracy expectations
Consider future extensibility for model-specific estimators

2. Calculation Consistency

Severity: Low-Medium

Total calculation is duplicated across files, creating potential maintenance issues:

// agents-run-api/src/agents/Agent.ts:1056-1065 (manual calculation)
contextBreakdown.total =
  contextBreakdown.systemPromptTemplate +
  contextBreakdown.coreInstructions +
  // ... 9 more lines

// vs agents-run-api/src/utils/token-estimator.ts:82-95 (helper function)
export function calculateBreakdownTotal(breakdown: ContextBreakdown): ContextBreakdown

Recommendation: Use the calculateBreakdownTotal helper consistently instead of manual recalculation in Agent.ts:1056.

3. Frontend Data Safety

Severity: Low

The frontend parsing could be more defensive:

// agents-manage-ui/src/app/api/signoz/conversations/[conversationId]/route.ts:568-580
contextBreakdown = {
  systemPromptTemplate: Number(data['context.breakdown.system_template_tokens']) || 0,
  // ... potential NaN issues if non-numeric values exist
};

Recommendation: Use the existing getNumber helper function which includes Number.isFinite checks for better safety.

4. Architecture & Extensibility

System Design Considerations:

The character-based approach is pragmatic but may need evolution for different use cases
Consider future scenarios where users might want:
- Model-specific token counting
- Cost estimation (tokens × cost per token)
- Context window utilization percentages
- Historical trending data

Naming & Semantics: The OTEL attribute naming is well-structured and follows existing patterns. The context.breakdown.* namespace clearly separates these metrics.

🚀 Recommended Actions

High Priority:

Use consistent total calculation - Replace manual total calculation in Agent.ts with calculateBreakdownTotal helper
Add defensive parsing - Use getNumber helper in route.ts for safer attribute parsing

Medium Priority: 3. Enhance documentation - Add accuracy expectations and model compatibility notes to token-estimator.ts 4. Consider configurability - Make chars/token ratio configurable for future model-specific tuning

Low Priority: 5. Add validation - Consider adding runtime validation that breakdown components sum to total 6. Future extensibility - Plan for cost estimation and context window percentage features

📊 Component Analysis

Frontend Visualization (context-breakdown.tsx):

Excellent responsive design with stacked bar chart + detailed breakdown
Proper accessibility with titles and color-coded indicators
Smart filtering (skips segments <0.5% in stacked view)
Good use of Tailwind utility classes and proper TypeScript typing

Backend Integration:

Seamless OTEL integration following existing patterns
Proper span attribute recording on agent.generate spans
Consistent with existing telemetry approach

Data Flow:

Clean data flow: Phase1/Phase2Config → Agent.ts → OTEL → SigNoz → Frontend
Proper separation of concerns between estimation, assembly, and instrumentation

🎯 Overall Assessment

This is a high-quality implementation that adds significant value for understanding LLM context utilization. The architecture is sound, the code is well-structured, and it follows existing patterns effectively. The character-based approximation is a pragmatic choice that balances accuracy with simplicity.

The feature provides immediate utility for developers debugging context issues and understanding where tokens are being consumed, which is crucial for managing costs and context window limits.

Approval Recommendation: ✅ Approved with minor suggestions

The identified issues are mostly minor improvements and future considerations rather than blocking concerns. This PR is ready for merge with the understanding that the suggested refinements could be addressed in follow-up work.

Dec 01 '25 21:12 claude[bot]

@claude review

Dec 15 '25 16:12 shagun-singh-inkeep

Claude finished @shagun-singh-inkeep's task in 2m 48s —— View job

Code Review for Context Tracking Evaluation System

I've completed a comprehensive review of the PR that adds a context tracking evaluation system for the trace viewer. This feature tracks token usage breakdown for LLM context windows using character-based approximation.

Review Tasks:

[x] Read repository guidelines (CLAUDE.md)
[x] Review token estimation utility and approach
[x] Analyze backend changes (Agent.ts, Phase1/2Config, SystemPromptBuilder)
[x] Review shared package changes (otel-attributes)
[x] Evaluate frontend visualization component
[x] Check for consistency across packages
[x] Assess overall architecture and design decisions
[x] Provide comprehensive feedback

Overall Assessment

This is a well-designed and implemented feature that adds valuable observability to LLM context usage. The character-based token approximation approach (~4 chars/token) is pragmatic and provides universal compatibility across different models.

✅ Strengths

Architecture & Design

Smart token estimation approach: The 4 chars/token approximation works universally across all models (OpenAI, Anthropic, Gemini) without external dependencies
Clean separation of concerns: Well-structured with dedicated token estimation utility, consistent AssembleResult pattern, and proper OTEL instrumentation
Comprehensive tracking: Covers all major context components (system prompt, instructions, tools, artifacts, conversation history, etc.)

Implementation Quality

Type safety: Properly typed interfaces (ContextBreakdown, AssembleResult) with consistent usage across packages
OTEL integration: Follows existing telemetry patterns with well-named attributes (context.breakdown.*)
Frontend visualization: Excellent responsive UI with both stacked bar chart and detailed breakdown, proper accessibility considerations

Code Organization

Consistent patterns: The AssembleResult return type is used consistently across Phase1Config and Phase2Config
Helper functions: Good utilities like createEmptyBreakdown(), estimateTokens(), and calculateBreakdownTotal()
Error handling: Proper fallbacks with || 0 in frontend parsing

🔍 Areas for Improvement

1. Potential Calculation Consistency Issue (Minor)

The calculateBreakdownTotal() helper function exists in token-estimator.ts but appears to be called correctly in Agent.ts:2279. Good use of the helper function.

2. Frontend Data Parsing Safety (Minor)

In route.ts:1137-1149, the parsing uses Number() with || 0 fallback, which is safe but could be more robust:

// Current approach is acceptable but could use existing helper
systemPromptTemplate: Number(data['context.breakdown.system_template_tokens']) || 0,

Consider using a helper function like getNumber() if it exists in the codebase for consistency with other parsing patterns.

3. Token Estimation Accuracy Documentation (Suggestion)

The 4 chars/token ratio is documented in comments but could benefit from more detailed accuracy expectations:

XML/structured content typically has lower ratios (~3.5 chars/token)
Code snippets may have higher ratios (~4.5-5 chars/token)
Different models have varying tokenization strategies

Consider adding a comment block in token-estimator.ts explaining accuracy expectations and potential future enhancements.

4. Visualization Threshold (Minor)

The frontend skips segments <0.5% in the stacked bar (context-breakdown.tsx:132). This is reasonable but could be made configurable if needed in the future.

🎯 Recommendations

High Priority

None - the implementation is solid and ready for production.

Medium Priority

Documentation: Add more detailed comments about token estimation accuracy expectations in token-estimator.ts
Consistency: Consider using existing number parsing helpers in the frontend if available

Low Priority

Future extensibility: Consider making the chars/token ratio configurable for future model-specific tuning
Validation: Could add runtime validation that breakdown components sum to total (though the current implementation with calculateBreakdownTotal() makes this unlikely to be an issue)

📊 Component-by-Component Review

Backend (`agents-run-api`)

✅ token-estimator.ts: Clean, well-documented utility with appropriate interfaces
✅ Agent.ts: Proper integration with OTEL span attributes, correct use of helper functions
✅ Phase1/Phase2Config: Consistent implementation of AssembleResult return type
✅ SystemPromptBuilder: Correctly passes through breakdown data

Shared (`agents-core`)

✅ otel-attributes.ts: Well-organized attribute constants following existing naming patterns

Frontend (`agents-manage-ui`)

✅ context-breakdown.tsx: Excellent visualization component with responsive design
✅ route.ts: Proper parsing of OTEL attributes with fallbacks
✅ types.ts: Clean interface definition matching backend structure

✅ Test Coverage

The test plan in the PR description is comprehensive. The changes to test files show proper adaptation to the new AssembleResult return type.

Conclusion

This is a high-quality implementation that adds immediate value for developers debugging context issues and managing token costs. The architecture is sound, the code follows existing patterns well, and the feature is production-ready.

Recommendation: ✅ APPROVED

The minor suggestions above can be addressed in follow-up PRs if desired, but none are blocking issues.

Dec 15 '25 16:12 claude[bot]