maxun Feat: Add image extraction for webpages (media parsing)

Media Parsing: Image Extraction

Extracts all image URLs and alt text from a given webpage.
Converts relative URLs to absolute URLs.
Removes duplicate images.
Returns an array of objects: { url, altText }.

Closes Issue #164

Summary by CodeRabbit

New Features
- Extract images from a webpage URL with alt text; handle responsive sources, resolve relative links, and deduplicate.
- On-click media capture in the browser: capture images and PDFs, extract text (including OCR for images and PDF text), and forward captured media/text to the active session/workflow.
Behavior Changes
- Iframe-originated media messages are now strictly origin-validated before forwarding.
- Fetch and processing errors surface clear messages; invalid media URLs generate warnings.
Chores
- Added HTTP/HTML parsing, OCR, and PDF libraries.

Oct 03 '25 14:10 Aman-Raj-bat

Walkthrough

Adds a media extraction pipeline: new server-side mediaParser.extractImages(url); in-page scraper that extracts image/PDF text (alt/title, OCR, pdf.js) and posts maxun:media-extracted; front-end relay emits dom:media-extracted to the server; socket input handler forwards to the active generator; WorkflowGenerator records/emit media events; dependency updates.

Changes

Cohort / File(s)	Summary of Changes
Media parsing module `mediaParser.js`	New async `extractImages(url)` — validates input, fetches HTML with `axios` (10s timeout, 10MB limits, custom User-Agent, up to 5 redirects), requires HTML content-type, parses with `cheerio`, extracts images from `<img>` (`src`/`srcset`) and `<picture>`/`<source>` (`srcset`), resolves absolute URLs, deduplicates (ignores `data:`), preserves `altText`, logs warnings for invalid URLs, returns `[{ url, altText }]`, logs and throws on failure.
Client-side media scraper `maxun-core/src/browserSide/scraper.js`	Adds click handler and helpers to extract media: image text via `alt/title` or Tesseract OCR fallback, PDF text via `pdfjs-dist` across pages, builds structural selector (via `GetSelectorStructural` if present), posts `maxun:media-extracted` to parent with `{ url, tag, selector, extractedText }`; dynamically loads `tesseract.js` and `pdfjs-dist`.
Front-end relay `src/components/recorder/DOMBrowserRenderer.tsx`	Adds effect listening for `message` events of type `maxun:media-extracted`; strictly validates origin matches recorded iframe (and that `data.url` has same origin when snapshot.baseUrl present), constructs payload `{ url, tag, selector, extractedText }`, and emits `dom:media-extracted` on the socket; cleans up listener on unmount.
Server socket input handler `server/src/browser-management/inputHandlers.ts`	Adds `onMediaExtracted` wrapper and `handleMediaExtracted` to forward media extraction payloads to the active generator with page-closed checks, error handling/logging, and registers `dom:media-extracted` socket listener.
Workflow generator `server/src/workflow-management/classes/Generator.ts`	Adds public method `handleMediaExtracted(data, page)` that appends a media event to `workflowRecord.events` (type, url, tag, selector, extractedText, timestamp) and attempts to emit `workflow:media-added` to the client; errors are caught and logged.
Dependencies `package.json`	Bumps `axios` to `^1.12.2`; adds `cheerio` `^1.1.2`, `pdfjs-dist` `^5.4.296`, and `tesseract.js` `^6.0.1`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant U as User (click)
    participant B as In-page Scraper
    participant P as Parent Window (iframe)
    participant R as DOMBrowserRenderer
    participant S as Socket / Server
    participant H as Input Handler
    participant G as WorkflowGenerator

    U->>B: click element (img / iframe / object)
    alt image
        B->>B: resolve url, read alt/title or OCR (tesseract)
    else pdf/frame
        B->>B: fetch/read PDF text (pdfjs)
    end
    B->>P: postMessage {type: "maxun:media-extracted", url, tag, selector, extractedText}
    P->>R: message received (origin checked)
    R->>S: socket.emit "dom:media-extracted" {url, tag, selector, extractedText}
    S->>H: receive dom:media-extracted
    H->>G: call handleMediaExtracted(data, page)
    G->>G: append media event to workflowRecord.events (timestamp)
    G-->>S: emit "workflow:media-added" (best-effort)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

getmaxun/maxun#637 — related axios dependency bump and context for axios usage in the new media parsing code.

Suggested labels

Type: Feature, Scope: Recorder

Suggested reviewers

amhsirak
RohitR311

Poem

I hop through pages, whiskers keen,
sniff srcsets, alt text, every scene.
Axios fetches, Cheerio peers,
Tesseract reads what sight once feared.
A rabbit cheers: media found — hop, click, glean! 🐇📸

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly highlights the primary feature added—image extraction for webpages via media parsing—mirroring the main functional change implemented in the PR and is clear and concise.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

[ ] 📝 Generate docstrings

🧪 Generate unit tests (beta)

[ ] Create PR with unit tests
[ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a68e600b688d829467710c1f44b922314a47c79 and 976884473a5098fd133fbef815a0b4fa5713032a.

📒 Files selected for processing (1)

src/components/recorder/DOMBrowserRenderer.tsx (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/components/recorder/DOMBrowserRenderer.tsx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Oct 03 '25 14:10 coderabbitai[bot]

Worked on Issue #164 to add image extraction for webpages. Let me know if any improvements are needed!

Oct 03 '25 14:10 Aman-Raj-bat

Thanks for the feedback! I understand that this PR doesn’t address the requested feature in issue #164 . I’ll work on updating it to extract data from web-hosted media files and integrate it directly into the robot recording flow. I’ll submit an updated PR soon.

Oct 07 '25 17:10 Aman-Raj-bat