dify icon indicating copy to clipboard operation
dify copied to clipboard

feat: Integrate WaterCrawl.dev as a new knowledge base provider

Open amirasaran opened this issue 10 months ago • 3 comments

Summary

Add WaterCrawl.dev as an alternative provider for website crawling in datasets/knowledge base alongside Firecrawl and Jina Reader.

This integration enhances the data source options for knowledge bases, allowing users to configure and use WaterCrawl for their website content extraction needs.

Resolves #15950

Screenshots

configuration configuration-modal configuration-success crawl-process crawl-result-preview importing sync crawl-config

Checklist

[!IMPORTANT]
Please review the checklist below before submitting your pull request.

  • [x] This change requires a documentation update, included: Dify Document
  • [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • [x] I've updated the documentation accordingly.
  • [x] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

amirasaran avatar Mar 20 '25 23:03 amirasaran

This is great to have different options for website crawling. Competition brings people price down and level up the quality

alexmofidi avatar Mar 21 '25 00:03 alexmofidi

Hello I have fixed the lint by running ./dev/reformat. Can you fix the errors in the tests.

crazywoola avatar Mar 24 '25 10:03 crazywoola

Hello I have fixed the lint by running ./dev/reformat. Can you fix the errors in the tests.

Hey @crazywoola, Thank you for the update. Yes, I will do it.

amirasaran avatar Mar 24 '25 10:03 amirasaran

Hey @crazywoola, @JohnJyong Any updates?

amirasaran avatar Apr 04 '25 05:04 amirasaran

@amirasaran Hi, I have a problem with crawling the pages. Could you check it out?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/403


Traceback (most recent call last):
  File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask/app.py", line 917, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask/app.py", line 902, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask_restful/__init__.py", line 489, in wrapper
    resp = resource(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask/views.py", line 110, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)  # type: ignore[no-any-return]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask_restful/__init__.py", line 604, in dispatch_request
    resp = meth(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/controllers/console/wraps.py", line 198, in decorated
    return view(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/libs/login.py", line 94, in decorated_view
    return current_app.ensure_sync(func)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/controllers/console/wraps.py", line 30, in decorated
    return view(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/crazywoola/Program/dify/api/controllers/console/datasets/website.py", line 32, in post
    raise WebsiteCrawlError(str(e))
controllers.console.datasets.error.WebsiteCrawlError: 500 Internal Server Error: 403 Client Error: Forbidden for url: https://app.watercrawl.dev/api/v1/core/crawl-requests/

It turns out to be I haven't activate the plan. I think we can refine the error message later.

crazywoola avatar Apr 16 '25 03:04 crazywoola

@crazywoola it is 403 error that's means it is an permission error.

I got the problem, I think it is related to plan that you activate in the WaterCrawl. If you are using free plan you may faced with the following issue:

  1. Daily limit proceed
  2. Monthly limit proceed
  3. Max Depth is more that 2
  4. you have 1 crawl running at the moment

Plans in waterCrawl

These are limitation for free plan.

1,000 page credit 100 daily page credit 1 Seat (Team collaboration) Max depth: 2 Max page limit: 50 Max Concurrent Crawls: 1

If you want to continue or you need more. you have to upgrade your plan or using self hosted version.


However, I will prepare a fix to handle such errors better and present the user with better text.

amirasaran avatar Apr 16 '25 05:04 amirasaran