maxun icon indicating copy to clipboard operation
maxun copied to clipboard

Feat: Add image extraction for webpages (media parsing)

Open Aman-Raj-bat opened this issue 4 months ago • 3 comments

Media Parsing: Image Extraction

  • Extracts all image URLs and alt text from a given webpage.
  • Converts relative URLs to absolute URLs.
  • Removes duplicate images.
  • Returns an array of objects: { url, altText }.

Closes Issue #164

Summary by CodeRabbit

  • New Features

    • Extract images from a webpage URL with alt text; handle responsive sources, resolve relative links, and deduplicate.
    • On-click media capture in the browser: capture images and PDFs, extract text (including OCR for images and PDF text), and forward captured media/text to the active session/workflow.
  • Behavior Changes

    • Iframe-originated media messages are now strictly origin-validated before forwarding.
    • Fetch and processing errors surface clear messages; invalid media URLs generate warnings.
  • Chores

    • Added HTTP/HTML parsing, OCR, and PDF libraries.

Aman-Raj-bat avatar Oct 03 '25 14:10 Aman-Raj-bat

Walkthrough

Adds a media extraction pipeline: new server-side mediaParser.extractImages(url); in-page scraper that extracts image/PDF text (alt/title, OCR, pdf.js) and posts maxun:media-extracted; front-end relay emits dom:media-extracted to the server; socket input handler forwards to the active generator; WorkflowGenerator records/emit media events; dependency updates.

Changes

Cohort / File(s) Summary of Changes
Media parsing module
mediaParser.js
New async extractImages(url) — validates input, fetches HTML with axios (10s timeout, 10MB limits, custom User-Agent, up to 5 redirects), requires HTML content-type, parses with cheerio, extracts images from <img> (src/srcset) and <picture>/<source> (srcset), resolves absolute URLs, deduplicates (ignores data:), preserves altText, logs warnings for invalid URLs, returns [{ url, altText }], logs and throws on failure.
Client-side media scraper
maxun-core/src/browserSide/scraper.js
Adds click handler and helpers to extract media: image text via alt/title or Tesseract OCR fallback, PDF text via pdfjs-dist across pages, builds structural selector (via GetSelectorStructural if present), posts maxun:media-extracted to parent with { url, tag, selector, extractedText }; dynamically loads tesseract.js and pdfjs-dist.
Front-end relay
src/components/recorder/DOMBrowserRenderer.tsx
Adds effect listening for message events of type maxun:media-extracted; strictly validates origin matches recorded iframe (and that data.url has same origin when snapshot.baseUrl present), constructs payload { url, tag, selector, extractedText }, and emits dom:media-extracted on the socket; cleans up listener on unmount.
Server socket input handler
server/src/browser-management/inputHandlers.ts
Adds onMediaExtracted wrapper and handleMediaExtracted to forward media extraction payloads to the active generator with page-closed checks, error handling/logging, and registers dom:media-extracted socket listener.
Workflow generator
server/src/workflow-management/classes/Generator.ts
Adds public method handleMediaExtracted(data, page) that appends a media event to workflowRecord.events (type, url, tag, selector, extractedText, timestamp) and attempts to emit workflow:media-added to the client; errors are caught and logged.
Dependencies
package.json
Bumps axios to ^1.12.2; adds cheerio ^1.1.2, pdfjs-dist ^5.4.296, and tesseract.js ^6.0.1.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant U as User (click)
    participant B as In-page Scraper
    participant P as Parent Window (iframe)
    participant R as DOMBrowserRenderer
    participant S as Socket / Server
    participant H as Input Handler
    participant G as WorkflowGenerator

    U->>B: click element (img / iframe / object)
    alt image
        B->>B: resolve url, read alt/title or OCR (tesseract)
    else pdf/frame
        B->>B: fetch/read PDF text (pdfjs)
    end
    B->>P: postMessage {type: "maxun:media-extracted", url, tag, selector, extractedText}
    P->>R: message received (origin checked)
    R->>S: socket.emit "dom:media-extracted" {url, tag, selector, extractedText}
    S->>H: receive dom:media-extracted
    H->>G: call handleMediaExtracted(data, page)
    G->>G: append media event to workflowRecord.events (timestamp)
    G-->>S: emit "workflow:media-added" (best-effort)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • getmaxun/maxun#637 — related axios dependency bump and context for axios usage in the new media parsing code.

Suggested labels

Type: Feature, Scope: Recorder

Suggested reviewers

  • amhsirak
  • RohitR311

Poem

I hop through pages, whiskers keen,
sniff srcsets, alt text, every scene.
Axios fetches, Cheerio peers,
Tesseract reads what sight once feared.
A rabbit cheers: media found — hop, click, glean! 🐇📸

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights the primary feature added—image extraction for webpages via media parsing—mirroring the main functional change implemented in the PR and is clear and concise.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a68e600b688d829467710c1f44b922314a47c79 and 976884473a5098fd133fbef815a0b4fa5713032a.

📒 Files selected for processing (1)
  • src/components/recorder/DOMBrowserRenderer.tsx (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/components/recorder/DOMBrowserRenderer.tsx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 03 '25 14:10 coderabbitai[bot]

Worked on Issue #164 to add image extraction for webpages. Let me know if any improvements are needed!

Aman-Raj-bat avatar Oct 03 '25 14:10 Aman-Raj-bat

Thanks for the feedback! I understand that this PR doesn’t address the requested feature in issue #164 . I’ll work on updating it to extract data from web-hosted media files and integrate it directly into the robot recording flow. I’ll submit an updated PR soon.

Aman-Raj-bat avatar Oct 07 '25 17:10 Aman-Raj-bat