[Feature Request] Support actions beyond the current viewport (full-page interaction)
I briefly mentioned this on Discord, but I think it's worth discussing here.
Midscene is a fantastic project - I've been using it with Playwright and Puppeteer inside a Docker container, combined with the Qwen 2.5 VL model. However, there's a limitation that makes it hard to fully use aiAction, aiAssert, and aiQuery: they only work with elements visible in the current viewport.
This is a problem because important elements are often outside the visible area - at the bottom of the page or just below the first fold. As a result, it’s not reliable for building an autonomous AI agent.
I’d love to hear your thoughts. Are there any plans to support full-page interactions, or is the focus mainly on using Midscene with the browser extension?
As a temporary workaround, I tried running aiAction in a loop—if it failed, I’d scroll down by one viewport height and retry. This worked on most of the pages, in cases where the prompt described only a single action. Apart from this "single action" limitation, t’s not a robust or clean solution, and it's not cost-effective for long pages due to LLM token usage.
Here's the full Discord message for reference:
Yeah, so maybe I'll start with a quick introduction. 🙂 I'm the author of monity.ai, an app for tracking website changes.
It works like this: The user enters a page URL, and the app takes a full-page screenshot with a DOM overlay, allowing users to define actions like clicks to interact with the webpage. Based on the configuration, it tracks specific areas of the page, compares them periodically, and sends alerts if changes are detected.
There's a blog on my website with more details. It shares some similarities with Midscene, but for actions/automation, I currently rely more on CSS selectors and XPaths.
I came across Midscene.js and really like its API and design—great job! 😊 I'm currently working on something very similar. Once the user gets a full-page preview, they can define various actions using prompts, like: Filter articles by "newest" Get the latest 10 articles Notify when a new article is added today Once I found Midscene, I realized that these are pretty much the equivalents of aiActions, aiAsserts, and aiQueries.
Interacting with only elements in the viewport is not ideal because important data or element to interact is often outside of it. Since I have a full-page screenshot, I resize and send it in "slices" to a VL model, which kind of works for data extraction - but it's not very cost-effective at the moment. I might rely more on the actual HTML/DOM instead.
Anyone have any thoughts on this? I'm a bit surprised that there doesn't seem to be much demand for this feature. I'm open to collaborating and contributing to the project if there's interest.
Now Midscene can see the DOM outside of viewport: https://midscenejs.com/changelog-0.17.4.html