feat: add context tracking evaluation system for trace viewer
Summary
- Add a context tracking evaluation system that shows token usage breakdown for each component of the LLM context window in the trace viewer
- Uses character-based token approximation (~4 chars/token) which works universally across all models (OpenAI, Anthropic, Gemini, custom)
- Records context breakdown as OTEL span attributes on
agent.generatespans
Changes
Backend (agents-run-api)
-
token-estimator.ts: New utility for character-based token estimation with
ContextBreakdowninterface -
Phase1Config.ts / Phase2Config.ts: Return
AssembleResultwith prompt and breakdown from assembly - SystemPromptBuilder.ts: Pass through breakdown data
- Agent.ts: Record 12 context breakdown attributes as OTEL span attributes
Shared (agents-core)
-
otel-attributes.ts: Add
context.breakdown.*attribute keys
Frontend (agents-manage-ui)
- context-breakdown.tsx: New visualization component with stacked bar chart and detailed list
- conversation-detail.tsx: Integrate breakdown component into summary cards
- route.ts: Parse breakdown attributes from span data
Components Tracked
| Component | Description |
|---|---|
| System Prompt Template | Base XML template tokens |
| Core Instructions | Core prompt instructions |
| Agent Prompt | Agent-level context |
| Tools Section | MCP, function, and relation tools |
| Artifacts Section | Artifact definitions |
| Data Components | Phase 2 data component instructions |
| Artifact Components | Artifact creation instructions |
| Transfer Instructions | Agent transfer guidance |
| Delegation Instructions | Agent delegation guidance |
| Thinking Preparation | Extended thinking instructions |
| Conversation History | Chat history tokens |
Test plan
- [ ] Verify type checking passes across all packages
- [ ] Test trace viewer shows context breakdown card for agent generation spans
- [ ] Verify breakdown values are reasonable (sum to total)
- [ ] Test with different agent configurations to ensure all components are tracked
🦋 Changeset detected
Latest commit: fa0aa24c59235ca367d17cc9dc98e63233b3b7c6
The changes in this PR will be included in the next version bump.
This PR includes changesets to release 9 packages
| Name | Type |
|---|---|
| @inkeep/agents-core | Patch |
| @inkeep/agents-manage-ui | Patch |
| @inkeep/agents-run-api | Patch |
| @inkeep/agents-cli | Patch |
| @inkeep/agents-manage-api | Patch |
| @inkeep/agents-manage-mcp | Patch |
| @inkeep/agents-sdk | Patch |
| @inkeep/ai-sdk-provider | Patch |
| @inkeep/create-agents | Patch |
Not sure what this means? Click here to learn what changesets are.
Click here if you're a maintainer who wants to add another changeset to this PR
The latest updates on your projects. Learn more about Vercel for GitHub.
| Project | Deployment | Review | Updated (UTC) |
|---|---|---|---|
| agents-docs | Preview, Comment | Dec 22, 2025 7:35pm | |
| agents-manage-api | Dec 22, 2025 7:35pm | ||
| agents-manage-ui | Preview, Comment | Dec 22, 2025 7:35pm | |
| agents-run-api | Preview, Comment | Dec 22, 2025 7:35pm |
Claude finished @amikofalvy's task —— View job
TypeScript Staff Engineer & System Architect Review
I've completed a comprehensive review of the context tracking evaluation system. This is a well-designed feature that adds valuable observability to LLM context usage. Here are my key findings and recommendations:
Review Tasks
- [x] Understand existing codebase patterns and architecture
- [x] Review token estimation utility and approach
- [x] Analyze backend changes (Agent.ts, Phase1/2Config, SystemPromptBuilder)
- [x] Review shared package changes (otel-attributes)
- [x] Evaluate frontend visualization component
- [x] Check for consistency across packages and existing patterns
- [x] Assess extensibility and design decisions
- [x] Provide comprehensive feedback
✅ Strengths & Positive Aspects
Excellent Architecture & Design:
- Smart choice of character-based token approximation (~4 chars/token) for universal compatibility
- Well-structured separation between token estimation utility, prompt assembly, and OTEL instrumentation
- Consistent
AssembleResultreturn pattern across Phase1 and Phase2 configurations - Thoughtful breakdown into logical components that align with actual prompt structure
Strong Implementation Quality:
- Comprehensive OTEL attribute coverage with consistent naming (
context.breakdown.*) - Robust frontend visualization with proper responsive design and accessibility considerations
- Good error handling and fallback patterns (e.g.,
getNumberwith fallbacks in route.ts) - Proper TypeScript typing throughout
🔍 Areas for Consideration
1. Token Estimation Accuracy
Severity: Medium
The 4 chars/token approximation may be less accurate for some content types:
// agents-run-api/src/utils/token-estimator.ts:12
const CHARS_PER_TOKEN = 4;
Considerations:
- XML/structured content typically has lower char/token ratios (~3.5)
- Code snippets may have higher ratios (~4.5-5)
- Different models have varying tokenization strategies
Recommendations:
- Consider exposing the ratio as a configuration parameter
- Add documentation about accuracy expectations
- Consider future extensibility for model-specific estimators
2. Calculation Consistency
Severity: Low-Medium
Total calculation is duplicated across files, creating potential maintenance issues:
// agents-run-api/src/agents/Agent.ts:1056-1065 (manual calculation)
contextBreakdown.total =
contextBreakdown.systemPromptTemplate +
contextBreakdown.coreInstructions +
// ... 9 more lines
// vs agents-run-api/src/utils/token-estimator.ts:82-95 (helper function)
export function calculateBreakdownTotal(breakdown: ContextBreakdown): ContextBreakdown
Recommendation:
Use the calculateBreakdownTotal helper consistently instead of manual recalculation in Agent.ts:1056.
3. Frontend Data Safety
Severity: Low
The frontend parsing could be more defensive:
// agents-manage-ui/src/app/api/signoz/conversations/[conversationId]/route.ts:568-580
contextBreakdown = {
systemPromptTemplate: Number(data['context.breakdown.system_template_tokens']) || 0,
// ... potential NaN issues if non-numeric values exist
};
Recommendation:
Use the existing getNumber helper function which includes Number.isFinite checks for better safety.
4. Architecture & Extensibility
System Design Considerations:
- The character-based approach is pragmatic but may need evolution for different use cases
- Consider future scenarios where users might want:
- Model-specific token counting
- Cost estimation (tokens × cost per token)
- Context window utilization percentages
- Historical trending data
Naming & Semantics:
The OTEL attribute naming is well-structured and follows existing patterns. The context.breakdown.* namespace clearly separates these metrics.
🚀 Recommended Actions
High Priority:
-
Use consistent total calculation - Replace manual total calculation in Agent.ts with
calculateBreakdownTotalhelper -
Add defensive parsing - Use
getNumberhelper in route.ts for safer attribute parsing
Medium Priority: 3. Enhance documentation - Add accuracy expectations and model compatibility notes to token-estimator.ts 4. Consider configurability - Make chars/token ratio configurable for future model-specific tuning
Low Priority: 5. Add validation - Consider adding runtime validation that breakdown components sum to total 6. Future extensibility - Plan for cost estimation and context window percentage features
📊 Component Analysis
Frontend Visualization (context-breakdown.tsx):
- Excellent responsive design with stacked bar chart + detailed breakdown
- Proper accessibility with titles and color-coded indicators
- Smart filtering (skips segments <0.5% in stacked view)
- Good use of Tailwind utility classes and proper TypeScript typing
Backend Integration:
- Seamless OTEL integration following existing patterns
- Proper span attribute recording on
agent.generatespans - Consistent with existing telemetry approach
Data Flow:
- Clean data flow: Phase1/Phase2Config → Agent.ts → OTEL → SigNoz → Frontend
- Proper separation of concerns between estimation, assembly, and instrumentation
🎯 Overall Assessment
This is a high-quality implementation that adds significant value for understanding LLM context utilization. The architecture is sound, the code is well-structured, and it follows existing patterns effectively. The character-based approximation is a pragmatic choice that balances accuracy with simplicity.
The feature provides immediate utility for developers debugging context issues and understanding where tokens are being consumed, which is crucial for managing costs and context window limits.
Approval Recommendation: ✅ Approved with minor suggestions
The identified issues are mostly minor improvements and future considerations rather than blocking concerns. This PR is ready for merge with the understanding that the suggested refinements could be addressed in follow-up work.
@claude review
Claude finished @shagun-singh-inkeep's task in 2m 48s —— View job
Code Review for Context Tracking Evaluation System
I've completed a comprehensive review of the PR that adds a context tracking evaluation system for the trace viewer. This feature tracks token usage breakdown for LLM context windows using character-based approximation.
Review Tasks:
- [x] Read repository guidelines (CLAUDE.md)
- [x] Review token estimation utility and approach
- [x] Analyze backend changes (Agent.ts, Phase1/2Config, SystemPromptBuilder)
- [x] Review shared package changes (otel-attributes)
- [x] Evaluate frontend visualization component
- [x] Check for consistency across packages
- [x] Assess overall architecture and design decisions
- [x] Provide comprehensive feedback
Overall Assessment
This is a well-designed and implemented feature that adds valuable observability to LLM context usage. The character-based token approximation approach (~4 chars/token) is pragmatic and provides universal compatibility across different models.
✅ Strengths
Architecture & Design
- Smart token estimation approach: The 4 chars/token approximation works universally across all models (OpenAI, Anthropic, Gemini) without external dependencies
-
Clean separation of concerns: Well-structured with dedicated token estimation utility, consistent
AssembleResultpattern, and proper OTEL instrumentation - Comprehensive tracking: Covers all major context components (system prompt, instructions, tools, artifacts, conversation history, etc.)
Implementation Quality
-
Type safety: Properly typed interfaces (
ContextBreakdown,AssembleResult) with consistent usage across packages -
OTEL integration: Follows existing telemetry patterns with well-named attributes (
context.breakdown.*) - Frontend visualization: Excellent responsive UI with both stacked bar chart and detailed breakdown, proper accessibility considerations
Code Organization
-
Consistent patterns: The
AssembleResultreturn type is used consistently across Phase1Config and Phase2Config -
Helper functions: Good utilities like
createEmptyBreakdown(),estimateTokens(), andcalculateBreakdownTotal() -
Error handling: Proper fallbacks with
|| 0in frontend parsing
🔍 Areas for Improvement
1. Potential Calculation Consistency Issue (Minor)
The calculateBreakdownTotal() helper function exists in token-estimator.ts but appears to be called correctly in Agent.ts:2279. Good use of the helper function.
2. Frontend Data Parsing Safety (Minor)
In route.ts:1137-1149, the parsing uses Number() with || 0 fallback, which is safe but could be more robust:
// Current approach is acceptable but could use existing helper
systemPromptTemplate: Number(data['context.breakdown.system_template_tokens']) || 0,
Consider using a helper function like getNumber() if it exists in the codebase for consistency with other parsing patterns.
3. Token Estimation Accuracy Documentation (Suggestion)
The 4 chars/token ratio is documented in comments but could benefit from more detailed accuracy expectations:
- XML/structured content typically has lower ratios (~3.5 chars/token)
- Code snippets may have higher ratios (~4.5-5 chars/token)
- Different models have varying tokenization strategies
Consider adding a comment block in token-estimator.ts explaining accuracy expectations and potential future enhancements.
4. Visualization Threshold (Minor)
The frontend skips segments <0.5% in the stacked bar (context-breakdown.tsx:132). This is reasonable but could be made configurable if needed in the future.
🎯 Recommendations
High Priority
None - the implementation is solid and ready for production.
Medium Priority
-
Documentation: Add more detailed comments about token estimation accuracy expectations in
token-estimator.ts - Consistency: Consider using existing number parsing helpers in the frontend if available
Low Priority
- Future extensibility: Consider making the chars/token ratio configurable for future model-specific tuning
-
Validation: Could add runtime validation that breakdown components sum to total (though the current implementation with
calculateBreakdownTotal()makes this unlikely to be an issue)
📊 Component-by-Component Review
Backend (agents-run-api)
- ✅ token-estimator.ts: Clean, well-documented utility with appropriate interfaces
- ✅ Agent.ts: Proper integration with OTEL span attributes, correct use of helper functions
- ✅ Phase1/Phase2Config: Consistent implementation of
AssembleResultreturn type - ✅ SystemPromptBuilder: Correctly passes through breakdown data
Shared (agents-core)
- ✅ otel-attributes.ts: Well-organized attribute constants following existing naming patterns
Frontend (agents-manage-ui)
- ✅ context-breakdown.tsx: Excellent visualization component with responsive design
- ✅ route.ts: Proper parsing of OTEL attributes with fallbacks
- ✅ types.ts: Clean interface definition matching backend structure
✅ Test Coverage
The test plan in the PR description is comprehensive. The changes to test files show proper adaptation to the new AssembleResult return type.
Conclusion
This is a high-quality implementation that adds immediate value for developers debugging context issues and managing token costs. The architecture is sound, the code follows existing patterns well, and the feature is production-ready.
Recommendation: ✅ APPROVED
The minor suggestions above can be addressed in follow-up PRs if desired, but none are blocking issues.