Feat: Add image extraction for webpages (media parsing)
Media Parsing: Image Extraction
- Extracts all image URLs and alt text from a given webpage.
- Converts relative URLs to absolute URLs.
- Removes duplicate images.
- Returns an array of objects: { url, altText }.
Closes Issue #164
Summary by CodeRabbit
-
New Features
- Extract images from a webpage URL with alt text; handle responsive sources, resolve relative links, and deduplicate.
- On-click media capture in the browser: capture images and PDFs, extract text (including OCR for images and PDF text), and forward captured media/text to the active session/workflow.
-
Behavior Changes
- Iframe-originated media messages are now strictly origin-validated before forwarding.
- Fetch and processing errors surface clear messages; invalid media URLs generate warnings.
-
Chores
- Added HTTP/HTML parsing, OCR, and PDF libraries.
Walkthrough
Adds a media extraction pipeline: new server-side mediaParser.extractImages(url); in-page scraper that extracts image/PDF text (alt/title, OCR, pdf.js) and posts maxun:media-extracted; front-end relay emits dom:media-extracted to the server; socket input handler forwards to the active generator; WorkflowGenerator records/emit media events; dependency updates.
Changes
| Cohort / File(s) | Summary of Changes |
|---|---|
Media parsing modulemediaParser.js |
New async extractImages(url) — validates input, fetches HTML with axios (10s timeout, 10MB limits, custom User-Agent, up to 5 redirects), requires HTML content-type, parses with cheerio, extracts images from <img> (src/srcset) and <picture>/<source> (srcset), resolves absolute URLs, deduplicates (ignores data:), preserves altText, logs warnings for invalid URLs, returns [{ url, altText }], logs and throws on failure. |
Client-side media scrapermaxun-core/src/browserSide/scraper.js |
Adds click handler and helpers to extract media: image text via alt/title or Tesseract OCR fallback, PDF text via pdfjs-dist across pages, builds structural selector (via GetSelectorStructural if present), posts maxun:media-extracted to parent with { url, tag, selector, extractedText }; dynamically loads tesseract.js and pdfjs-dist. |
Front-end relaysrc/components/recorder/DOMBrowserRenderer.tsx |
Adds effect listening for message events of type maxun:media-extracted; strictly validates origin matches recorded iframe (and that data.url has same origin when snapshot.baseUrl present), constructs payload { url, tag, selector, extractedText }, and emits dom:media-extracted on the socket; cleans up listener on unmount. |
Server socket input handlerserver/src/browser-management/inputHandlers.ts |
Adds onMediaExtracted wrapper and handleMediaExtracted to forward media extraction payloads to the active generator with page-closed checks, error handling/logging, and registers dom:media-extracted socket listener. |
Workflow generatorserver/src/workflow-management/classes/Generator.ts |
Adds public method handleMediaExtracted(data, page) that appends a media event to workflowRecord.events (type, url, tag, selector, extractedText, timestamp) and attempts to emit workflow:media-added to the client; errors are caught and logged. |
Dependenciespackage.json |
Bumps axios to ^1.12.2; adds cheerio ^1.1.2, pdfjs-dist ^5.4.296, and tesseract.js ^6.0.1. |
Sequence Diagram(s)
sequenceDiagram
autonumber
participant U as User (click)
participant B as In-page Scraper
participant P as Parent Window (iframe)
participant R as DOMBrowserRenderer
participant S as Socket / Server
participant H as Input Handler
participant G as WorkflowGenerator
U->>B: click element (img / iframe / object)
alt image
B->>B: resolve url, read alt/title or OCR (tesseract)
else pdf/frame
B->>B: fetch/read PDF text (pdfjs)
end
B->>P: postMessage {type: "maxun:media-extracted", url, tag, selector, extractedText}
P->>R: message received (origin checked)
R->>S: socket.emit "dom:media-extracted" {url, tag, selector, extractedText}
S->>H: receive dom:media-extracted
H->>G: call handleMediaExtracted(data, page)
G->>G: append media event to workflowRecord.events (timestamp)
G-->>S: emit "workflow:media-added" (best-effort)
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
- getmaxun/maxun#637 — related axios dependency bump and context for axios usage in the new media parsing code.
Suggested labels
Type: Feature, Scope: Recorder
Suggested reviewers
- amhsirak
- RohitR311
Poem
I hop through pages, whiskers keen,
sniff srcsets, alt text, every scene.
Axios fetches, Cheerio peers,
Tesseract reads what sight once feared.
A rabbit cheers: media found — hop, click, glean! 🐇📸
Pre-merge checks and finishing touches
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title Check | ✅ Passed | The title succinctly highlights the primary feature added—image extraction for webpages via media parsing—mirroring the main functional change implemented in the PR and is clear and concise. |
| Docstring Coverage | ✅ Passed | No functions found in the changes. Docstring coverage check skipped. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
📜 Recent review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 3a68e600b688d829467710c1f44b922314a47c79 and 976884473a5098fd133fbef815a0b4fa5713032a.
📒 Files selected for processing (1)
-
src/components/recorder/DOMBrowserRenderer.tsx(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/components/recorder/DOMBrowserRenderer.tsx
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
Worked on Issue #164 to add image extraction for webpages. Let me know if any improvements are needed!
Thanks for the feedback! I understand that this PR doesn’t address the requested feature in issue #164 . I’ll work on updating it to extract data from web-hosted media files and integrate it directly into the robot recording flow. I’ll submit an updated PR soon.