Feature Request: Integrate mgrep for Semantic Code Search

Open layeddie opened this issue 1 month ago • 4 comments

Feature Request: Integrate mgrep for Semantic Code Search

Add support for mgrep - a semantic grep tool that enables natural language queries across code, images, and PDFs.

Summary

When working with large codebases, finding code often requires knowing exact function/class names. OpenCode's current search works well for exact text matching, but fails when I ask:

"Where do we handle API rate limiting?"
"Show me error handling for payment processing"
"Find all places that validate user input"
"Search for database migration files"

These questions describe what code does, not what it's called.

mgrep solves this with semantic search:

Matches concepts and meanings, not just text strings
Works across code, images, and PDFs
CLI-native, fast, lightweight
Uses embeddings for natural language understanding

Why mgrep?

Key Features

Natural Language Queries: Search by intent ("where do we handle errors?" vs grep "error")
Multimodal: Search code + images + PDFs in one query
100% Local: Embeddings run locally via transformers.js, no API keys
Repository Isolation: Each project gets its own index
Token Efficient: Reduces LLM token usage by ~20% vs grep

Use Case Examples

Current (grep/ripgrep):

# User asks: "Where is authentication logic?"
# Must know file is IdentityManager.ts
grep -r "authenticate" ./src

With mgrep:

# User asks: "Where is authentication logic?"
# Returns: IdentityManager.ts, AuthController.ts, LoginService.ts
# With line numbers and context
mgrep --semantic "authentication logic"

Proposed Implementation

Option 1: Built-in Tool Integration (Recommended)

Add mgrep as a built-in tool in OpenCode's search arsenal:

{
  name: "mgrep_semantic_search",
  description: "Semantic search using mgrep - finds code by meaning, not just text",
  parameters: {
    type: "object",
    properties: {
      query: {
        type: "string",
        description: "Natural language query (e.g., 'where do we handle rate limiting?')"
      },
      path: {
        type: "string",
        description: "Search path (default: current directory)"
      },
      file_types: {
        type: "array",
        description: "File extensions to search (e.g., ['ts', 'js', 'py'])"
      }
    },
    required: ["query"]
  }
}

Option 2: Auto-Detection

Automatically integrate mgrep when:

mgrep CLI is installed and in PATH
mgrep index exists for project
User enables semantic search in config

Config:

{
  "$schema": "https://opencode.ai/config.json",
  "tools": {
    "mgrep": {
      "enabled": true,
      "autoIndex": true,
      "defaultPath": ".",
      "indexOnStart": false
    }
  }
}

Option 3: MCP Server

Build mgrep as MCP server (requires upstream work):

{
  "mcp": {
    "mgrep": {
      "type": "local",
      "command": "mgrep",
      "args": ["mcp"],
      "enabled": true
    }
  }
}

Note: This requires mgrep team to add MCP support first.

Use Cases

1. Finding Logic Without Knowing Names

Scenario: Developer wants to find "rate limiting" but doesn't know file is named RateLimiter.ts

User prompt:

"Where do we handle API rate limiting in the payment service?"

Current behavior:

Uses grep for "rate" - returns many false positives (late, create, separate)
User must manually scan results

With mgrep:

Returns src/payment/RateLimiter.ts with relevant code snippet
Also finds src/api/ThrottlingMiddleware.ts (related concept)
Shows context from both files

2. Cross-Filetype Search

Scenario: Architecture documentation is in PDFs, code is in TS

User prompt:

"Show me how payment flow is documented"

Current behavior:

Only searches code files
Misses architecture PDFs

With mgrep:

Returns docs/architecture/payment-flow.pdf
Returns src/payment/checkout.ts
Returns docs/api/payment-endpoints.md

3. Bug Investigation

Scenario: Bug report mentions "error when processing payments"

User prompt:

"Find all error handling for payment processing"

Current behavior:

grep "error" payment/ - returns hundreds of matches
Includes console.log errors, unrelated error variables

With mgrep:

Returns only PaymentErrorHandler.ts and PaymentException.ts
Shows try-catch blocks in PaymentService.ts
Returns error handling tests in payment.test.ts

4. Code Discovery

Scenario: New developer joining team, exploring codebase

User prompt:

"How do we validate user input?"

With mgrep:

Returns ValidationService.ts
Returns UserValidator.ts
Returns input validation schemas in src/schemas/
Shows usage examples across project

Technical Details

mgrep Capabilities

From mgrep repo:

Semantic search using local embeddings
Multi-format support (code, images, PDFs)
Automatic repository indexing
Natural language query processing
Fast, CLI-native

Performance Characteristics

Indexing: O(n) - one-time per project
Query: O(log n) - binary search over embeddings
Memory: ~10-50MB index per project (depending on size)
Startup: <100ms to load existing index

Integration Effort

Built-in Tool: Medium (1-2 days) - Add tool registration
Auto-Detection: Low (4-8 hours) - Add detection + config
MCP Server: High (Requires mgrep team) - Out of scope for now

Comparison with Existing Tools

Tool	Text Search	Semantic Search	Multimodal	Local
`grep`	✅	❌	❌	✅
`ripgrep`	✅	❌	❌	✅
`fzf`	✅	❌	❌	✅
`mgrep`	✅	✅	✅	✅

Why mgrep Complements Existing Tools

grep/ripgrep: Still needed for exact text matching
fzf: Still needed for interactive file finding
mgrep: Adds semantic understanding layer on top

Token Efficiency Analysis

Current Workflow (grep)

User: "Find payment error handling"
AI → Runs grep → 5000 matches
AI → Reads top 50 results → 15K tokens
AI → Filters relevant → 10K tokens
Total context: 25K tokens

With mgrep

User: "Find payment error handling"
AI → Runs mgrep → 5 relevant results
AI → Reads all results → 3K tokens
Total context: 3K tokens

Savings: ~22K tokens (88% reduction)

Implementation Considerations

1. Auto-Indexing Strategy

On startup: Check if mgrep index exists
On file change: Re-index only changed files (incremental)
User control: Config option autoIndex (default: false)

2. Error Handling

If mgrep not installed: Fall back to grep
If index missing: Show user-friendly message
If query fails: Retry with grep as backup

3. User Experience

Discovery: "Semantic search using mgrep" (new option in search)
Fallback: "Falling back to text search" if mgrep fails
Feedback: "Found 5 relevant results in 23ms"

Related Issues

Semantic code understanding (Serena): [see companion issue]

References

mgrep repo: https://github.com/mixedbread-ai/mgrep
mgrep demo: https://www.scriptbyai.com/osgrep-semantic-search/
Codebase indexing: https://developertoolkit.ai/en/shared-workflows/context-management/codebase-indexing/

Acceptance Criteria

[ ] mgrep integrated as built-in tool
[ ] Semantic search option available to users
[ ] Auto-indexing configurable (on/off)
[ ] Fallback to grep if mgrep unavailable
[ ] Documentation updated with mgrep examples
[ ] Performance benchmarked (should be <500ms per query)
[ ] Token efficiency demonstrated (>20% savings)

Questions for Maintainers

Should mgrep be opt-in (config) or auto-detected?
Should mgrep auto-index on startup, or on-demand?
How should mgrep results be ranked vs grep results?
Any concerns about adding another binary dependency?

Dec 27 '25 20:12 layeddie