Feature Request: Integrate mgrep for Semantic Code Search
Feature Request: Integrate mgrep for Semantic Code Search
Add support for mgrep - a semantic grep tool that enables natural language queries across code, images, and PDFs.
Summary
When working with large codebases, finding code often requires knowing exact function/class names. OpenCode's current search works well for exact text matching, but fails when I ask:
- "Where do we handle API rate limiting?"
- "Show me error handling for payment processing"
- "Find all places that validate user input"
- "Search for database migration files"
These questions describe what code does, not what it's called.
mgrep solves this with semantic search:
- Matches concepts and meanings, not just text strings
- Works across code, images, and PDFs
- CLI-native, fast, lightweight
- Uses embeddings for natural language understanding
Why mgrep?
Key Features
-
Natural Language Queries: Search by intent ("where do we handle errors?" vs
grep "error") - Multimodal: Search code + images + PDFs in one query
- 100% Local: Embeddings run locally via transformers.js, no API keys
- Repository Isolation: Each project gets its own index
- Token Efficient: Reduces LLM token usage by ~20% vs grep
Use Case Examples
Current (grep/ripgrep):
# User asks: "Where is authentication logic?"
# Must know file is IdentityManager.ts
grep -r "authenticate" ./src
With mgrep:
# User asks: "Where is authentication logic?"
# Returns: IdentityManager.ts, AuthController.ts, LoginService.ts
# With line numbers and context
mgrep --semantic "authentication logic"
Proposed Implementation
Option 1: Built-in Tool Integration (Recommended)
Add mgrep as a built-in tool in OpenCode's search arsenal:
{
name: "mgrep_semantic_search",
description: "Semantic search using mgrep - finds code by meaning, not just text",
parameters: {
type: "object",
properties: {
query: {
type: "string",
description: "Natural language query (e.g., 'where do we handle rate limiting?')"
},
path: {
type: "string",
description: "Search path (default: current directory)"
},
file_types: {
type: "array",
description: "File extensions to search (e.g., ['ts', 'js', 'py'])"
}
},
required: ["query"]
}
}
Option 2: Auto-Detection
Automatically integrate mgrep when:
-
mgrepCLI is installed and in PATH -
mgrepindex exists for project - User enables semantic search in config
Config:
{
"$schema": "https://opencode.ai/config.json",
"tools": {
"mgrep": {
"enabled": true,
"autoIndex": true,
"defaultPath": ".",
"indexOnStart": false
}
}
}
Option 3: MCP Server
Build mgrep as MCP server (requires upstream work):
{
"mcp": {
"mgrep": {
"type": "local",
"command": "mgrep",
"args": ["mcp"],
"enabled": true
}
}
}
Note: This requires mgrep team to add MCP support first.
Use Cases
1. Finding Logic Without Knowing Names
Scenario: Developer wants to find "rate limiting" but doesn't know file is named RateLimiter.ts
User prompt:
"Where do we handle API rate limiting in the payment service?"
Current behavior:
- Uses grep for "rate" - returns many false positives (late, create, separate)
- User must manually scan results
With mgrep:
- Returns
src/payment/RateLimiter.tswith relevant code snippet - Also finds
src/api/ThrottlingMiddleware.ts(related concept) - Shows context from both files
2. Cross-Filetype Search
Scenario: Architecture documentation is in PDFs, code is in TS
User prompt:
"Show me how payment flow is documented"
Current behavior:
- Only searches code files
- Misses architecture PDFs
With mgrep:
- Returns
docs/architecture/payment-flow.pdf - Returns
src/payment/checkout.ts - Returns
docs/api/payment-endpoints.md
3. Bug Investigation
Scenario: Bug report mentions "error when processing payments"
User prompt:
"Find all error handling for payment processing"
Current behavior:
-
grep "error" payment/- returns hundreds of matches - Includes console.log errors, unrelated error variables
With mgrep:
- Returns only
PaymentErrorHandler.tsandPaymentException.ts - Shows try-catch blocks in
PaymentService.ts - Returns error handling tests in
payment.test.ts
4. Code Discovery
Scenario: New developer joining team, exploring codebase
User prompt:
"How do we validate user input?"
With mgrep:
- Returns
ValidationService.ts - Returns
UserValidator.ts - Returns input validation schemas in
src/schemas/ - Shows usage examples across project
Technical Details
mgrep Capabilities
From mgrep repo:
- Semantic search using local embeddings
- Multi-format support (code, images, PDFs)
- Automatic repository indexing
- Natural language query processing
- Fast, CLI-native
Performance Characteristics
- Indexing: O(n) - one-time per project
- Query: O(log n) - binary search over embeddings
- Memory: ~10-50MB index per project (depending on size)
- Startup: <100ms to load existing index
Integration Effort
- Built-in Tool: Medium (1-2 days) - Add tool registration
- Auto-Detection: Low (4-8 hours) - Add detection + config
- MCP Server: High (Requires mgrep team) - Out of scope for now
Comparison with Existing Tools
| Tool | Text Search | Semantic Search | Multimodal | Local |
|---|---|---|---|---|
grep |
✅ | ❌ | ❌ | ✅ |
ripgrep |
✅ | ❌ | ❌ | ✅ |
fzf |
✅ | ❌ | ❌ | ✅ |
mgrep |
✅ | ✅ | ✅ | ✅ |
Why mgrep Complements Existing Tools
- grep/ripgrep: Still needed for exact text matching
- fzf: Still needed for interactive file finding
- mgrep: Adds semantic understanding layer on top
Token Efficiency Analysis
Current Workflow (grep)
User: "Find payment error handling"
AI → Runs grep → 5000 matches
AI → Reads top 50 results → 15K tokens
AI → Filters relevant → 10K tokens
Total context: 25K tokens
With mgrep
User: "Find payment error handling"
AI → Runs mgrep → 5 relevant results
AI → Reads all results → 3K tokens
Total context: 3K tokens
Savings: ~22K tokens (88% reduction)
Implementation Considerations
1. Auto-Indexing Strategy
- On startup: Check if mgrep index exists
- On file change: Re-index only changed files (incremental)
-
User control: Config option
autoIndex(default: false)
2. Error Handling
- If mgrep not installed: Fall back to grep
- If index missing: Show user-friendly message
- If query fails: Retry with grep as backup
3. User Experience
- Discovery: "Semantic search using mgrep" (new option in search)
- Fallback: "Falling back to text search" if mgrep fails
- Feedback: "Found 5 relevant results in 23ms"
Related Issues
- Semantic code understanding (Serena): [see companion issue]
References
- mgrep repo: https://github.com/mixedbread-ai/mgrep
- mgrep demo: https://www.scriptbyai.com/osgrep-semantic-search/
- Codebase indexing: https://developertoolkit.ai/en/shared-workflows/context-management/codebase-indexing/
Acceptance Criteria
- [ ] mgrep integrated as built-in tool
- [ ] Semantic search option available to users
- [ ] Auto-indexing configurable (on/off)
- [ ] Fallback to grep if mgrep unavailable
- [ ] Documentation updated with mgrep examples
- [ ] Performance benchmarked (should be <500ms per query)
- [ ] Token efficiency demonstrated (>20% savings)
Questions for Maintainers
- Should mgrep be opt-in (config) or auto-detected?
- Should mgrep auto-index on startup, or on-demand?
- How should mgrep results be ranked vs grep results?
- Any concerns about adding another binary dependency?