opencode icon indicating copy to clipboard operation
opencode copied to clipboard

Feature Request: Integrate mgrep for Semantic Code Search

Open layeddie opened this issue 1 month ago • 4 comments

Feature Request: Integrate mgrep for Semantic Code Search

Add support for mgrep - a semantic grep tool that enables natural language queries across code, images, and PDFs.

Summary

When working with large codebases, finding code often requires knowing exact function/class names. OpenCode's current search works well for exact text matching, but fails when I ask:

  • "Where do we handle API rate limiting?"
  • "Show me error handling for payment processing"
  • "Find all places that validate user input"
  • "Search for database migration files"

These questions describe what code does, not what it's called.

mgrep solves this with semantic search:

  • Matches concepts and meanings, not just text strings
  • Works across code, images, and PDFs
  • CLI-native, fast, lightweight
  • Uses embeddings for natural language understanding

Why mgrep?

Key Features

  1. Natural Language Queries: Search by intent ("where do we handle errors?" vs grep "error")
  2. Multimodal: Search code + images + PDFs in one query
  3. 100% Local: Embeddings run locally via transformers.js, no API keys
  4. Repository Isolation: Each project gets its own index
  5. Token Efficient: Reduces LLM token usage by ~20% vs grep

Use Case Examples

Current (grep/ripgrep):

# User asks: "Where is authentication logic?"
# Must know file is IdentityManager.ts
grep -r "authenticate" ./src

With mgrep:

# User asks: "Where is authentication logic?"
# Returns: IdentityManager.ts, AuthController.ts, LoginService.ts
# With line numbers and context
mgrep --semantic "authentication logic"

Proposed Implementation

Option 1: Built-in Tool Integration (Recommended)

Add mgrep as a built-in tool in OpenCode's search arsenal:

{
  name: "mgrep_semantic_search",
  description: "Semantic search using mgrep - finds code by meaning, not just text",
  parameters: {
    type: "object",
    properties: {
      query: {
        type: "string",
        description: "Natural language query (e.g., 'where do we handle rate limiting?')"
      },
      path: {
        type: "string",
        description: "Search path (default: current directory)"
      },
      file_types: {
        type: "array",
        description: "File extensions to search (e.g., ['ts', 'js', 'py'])"
      }
    },
    required: ["query"]
  }
}

Option 2: Auto-Detection

Automatically integrate mgrep when:

  1. mgrep CLI is installed and in PATH
  2. mgrep index exists for project
  3. User enables semantic search in config

Config:

{
  "$schema": "https://opencode.ai/config.json",
  "tools": {
    "mgrep": {
      "enabled": true,
      "autoIndex": true,
      "defaultPath": ".",
      "indexOnStart": false
    }
  }
}

Option 3: MCP Server

Build mgrep as MCP server (requires upstream work):

{
  "mcp": {
    "mgrep": {
      "type": "local",
      "command": "mgrep",
      "args": ["mcp"],
      "enabled": true
    }
  }
}

Note: This requires mgrep team to add MCP support first.

Use Cases

1. Finding Logic Without Knowing Names

Scenario: Developer wants to find "rate limiting" but doesn't know file is named RateLimiter.ts

User prompt:

"Where do we handle API rate limiting in the payment service?"

Current behavior:

  • Uses grep for "rate" - returns many false positives (late, create, separate)
  • User must manually scan results

With mgrep:

  • Returns src/payment/RateLimiter.ts with relevant code snippet
  • Also finds src/api/ThrottlingMiddleware.ts (related concept)
  • Shows context from both files

2. Cross-Filetype Search

Scenario: Architecture documentation is in PDFs, code is in TS

User prompt:

"Show me how payment flow is documented"

Current behavior:

  • Only searches code files
  • Misses architecture PDFs

With mgrep:

  • Returns docs/architecture/payment-flow.pdf
  • Returns src/payment/checkout.ts
  • Returns docs/api/payment-endpoints.md

3. Bug Investigation

Scenario: Bug report mentions "error when processing payments"

User prompt:

"Find all error handling for payment processing"

Current behavior:

  • grep "error" payment/ - returns hundreds of matches
  • Includes console.log errors, unrelated error variables

With mgrep:

  • Returns only PaymentErrorHandler.ts and PaymentException.ts
  • Shows try-catch blocks in PaymentService.ts
  • Returns error handling tests in payment.test.ts

4. Code Discovery

Scenario: New developer joining team, exploring codebase

User prompt:

"How do we validate user input?"

With mgrep:

  • Returns ValidationService.ts
  • Returns UserValidator.ts
  • Returns input validation schemas in src/schemas/
  • Shows usage examples across project

Technical Details

mgrep Capabilities

From mgrep repo:

  • Semantic search using local embeddings
  • Multi-format support (code, images, PDFs)
  • Automatic repository indexing
  • Natural language query processing
  • Fast, CLI-native

Performance Characteristics

  • Indexing: O(n) - one-time per project
  • Query: O(log n) - binary search over embeddings
  • Memory: ~10-50MB index per project (depending on size)
  • Startup: <100ms to load existing index

Integration Effort

  • Built-in Tool: Medium (1-2 days) - Add tool registration
  • Auto-Detection: Low (4-8 hours) - Add detection + config
  • MCP Server: High (Requires mgrep team) - Out of scope for now

Comparison with Existing Tools

Tool Text Search Semantic Search Multimodal Local
grep
ripgrep
fzf
mgrep

Why mgrep Complements Existing Tools

  • grep/ripgrep: Still needed for exact text matching
  • fzf: Still needed for interactive file finding
  • mgrep: Adds semantic understanding layer on top

Token Efficiency Analysis

Current Workflow (grep)

User: "Find payment error handling"
AI → Runs grep → 5000 matches
AI → Reads top 50 results → 15K tokens
AI → Filters relevant → 10K tokens
Total context: 25K tokens

With mgrep

User: "Find payment error handling"
AI → Runs mgrep → 5 relevant results
AI → Reads all results → 3K tokens
Total context: 3K tokens

Savings: ~22K tokens (88% reduction)

Implementation Considerations

1. Auto-Indexing Strategy

  • On startup: Check if mgrep index exists
  • On file change: Re-index only changed files (incremental)
  • User control: Config option autoIndex (default: false)

2. Error Handling

  • If mgrep not installed: Fall back to grep
  • If index missing: Show user-friendly message
  • If query fails: Retry with grep as backup

3. User Experience

  • Discovery: "Semantic search using mgrep" (new option in search)
  • Fallback: "Falling back to text search" if mgrep fails
  • Feedback: "Found 5 relevant results in 23ms"

Related Issues

  • Semantic code understanding (Serena): [see companion issue]

References

  • mgrep repo: https://github.com/mixedbread-ai/mgrep
  • mgrep demo: https://www.scriptbyai.com/osgrep-semantic-search/
  • Codebase indexing: https://developertoolkit.ai/en/shared-workflows/context-management/codebase-indexing/

Acceptance Criteria

  • [ ] mgrep integrated as built-in tool
  • [ ] Semantic search option available to users
  • [ ] Auto-indexing configurable (on/off)
  • [ ] Fallback to grep if mgrep unavailable
  • [ ] Documentation updated with mgrep examples
  • [ ] Performance benchmarked (should be <500ms per query)
  • [ ] Token efficiency demonstrated (>20% savings)

Questions for Maintainers

  1. Should mgrep be opt-in (config) or auto-detected?
  2. Should mgrep auto-index on startup, or on-demand?
  3. How should mgrep results be ranked vs grep results?
  4. Any concerns about adding another binary dependency?

layeddie avatar Dec 27 '25 20:12 layeddie