Flowise icon indicating copy to clipboard operation
Flowise copied to clipboard

[FEATURE] Web scrappers - ignore / remove some elements or add webpage content transformer

Open bendadaniel opened this issue 1 year ago • 4 comments

Hello, I have a flowise workflow to web scrape our entire web (150+ pages) and then save it to Pinecone. We are currently using Cheerio Web scrapper node. (it could be Puppeteer, Playwright - it doesn't matter). We use setting 'Selector (CSS)': 'main' to ignore the header/footer of the page to scrape only valid data.

Problem: When I look at the data in Pinecone, I can see that there is a lot of invalid/unwated text data.

For example:

  • it scraped 'video html tag' and saved fallback text "Your browser does not support video"
  • it scraped CTA block which is on all pages - those texts have zero value for retriaval and it duplicate same data per all pages
  • it scraped 'last articles' block - also duplicated, it doesnt make sence when it is next to other information of that page
  • etc. etc.

I have tried: I tried to extend 'Selector (CSS)' to something like this "main > *:not(.headbar__video), main > *:not(.headbar__video) *" but this doesnt work

Question: So do you have any idea if it is possible to somehow exclude some html elements from webscrape or transform result of the page before saving? I think there is no way in flowise now.

Idea: I can see that Puppeteer and Playwright have in code function "evaluate" where we could theoretically "transform" result of scraped page. Github Puppeteer code - evaluate function. So maybe there could be an option in flowise to add one input node to Puppeteer node (optional), which would accept for example 'Custom JS function node'. So evaluete function would pass data to this 'Custom JS function node', this function can transform data how user wants and then return updated data. Or something completely different, this was just my idea.

I think this would be good feature, because I think it is important to have good retrieval dataset without garbage data.

What do you think about it? Thanks Daniel

For example: Screenshot 2024-05-05 at 22 00 01

bendadaniel avatar May 05 '24 20:05 bendadaniel

I have noticed the same with the Cheerio node today. It did not work as expected.

For your web of 150 pages I would recomend you to use Apify, and there, the actor called Website Content Crawler. It is free to set up and you get 5$ every month for scrapping. This scrapper provide really clean text from websites and the dataset can be filtered down and exported in json, xml, csv, etc.

image

Here is an example dataset_website-content-crawler_2024-05-06_01-56-33-249.json

toi500 avatar May 06 '24 01:05 toi500

We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature

HenryHengZJ avatar May 08 '24 22:05 HenryHengZJ

We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature

You mean 'Custom Document Loader' node right?

bendadaniel avatar May 09 '24 06:05 bendadaniel

We have added a new Custom JS Loader, so users can perform custom operation on their data. You can also have better visibility of the chunks by doing it on the new document store feature

You mean 'Custom Document Loader' node right?

But today I can't use any of the Text Splitters for Custom Document Loader 😟

Giusti10 avatar Aug 25 '24 06:08 Giusti10