feat: Integrate WaterCrawl.dev as a new knowledge base provider
Summary
Add WaterCrawl.dev as an alternative provider for website crawling in datasets/knowledge base alongside Firecrawl and Jina Reader.
This integration enhances the data source options for knowledge bases, allowing users to configure and use WaterCrawl for their website content extraction needs.
Resolves #15950
Screenshots
Checklist
[!IMPORTANT]
Please review the checklist below before submitting your pull request.
- [x] This change requires a documentation update, included: Dify Document
- [x] I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
- [x] I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
- [x] I've updated the documentation accordingly.
- [x] I ran
dev/reformat(backend) andcd web && npx lint-staged(frontend) to appease the lint gods
This is great to have different options for website crawling. Competition brings people price down and level up the quality
Hello I have fixed the lint by running ./dev/reformat. Can you fix the errors in the tests.
Hello I have fixed the lint by running
./dev/reformat. Can you fix the errors in the tests.
Hey @crazywoola, Thank you for the update. Yes, I will do it.
Hey @crazywoola, @JohnJyong Any updates?
@amirasaran Hi, I have a problem with crawling the pages. Could you check it out?
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/403
Traceback (most recent call last):
File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask/app.py", line 917, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask/app.py", line 902, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask_restful/__init__.py", line 489, in wrapper
resp = resource(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask/views.py", line 110, in view
return current_app.ensure_sync(self.dispatch_request)(**kwargs) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/.venv/lib/python3.12/site-packages/flask_restful/__init__.py", line 604, in dispatch_request
resp = meth(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/controllers/console/wraps.py", line 198, in decorated
return view(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/libs/login.py", line 94, in decorated_view
return current_app.ensure_sync(func)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/controllers/console/wraps.py", line 30, in decorated
return view(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/crazywoola/Program/dify/api/controllers/console/datasets/website.py", line 32, in post
raise WebsiteCrawlError(str(e))
controllers.console.datasets.error.WebsiteCrawlError: 500 Internal Server Error: 403 Client Error: Forbidden for url: https://app.watercrawl.dev/api/v1/core/crawl-requests/
It turns out to be I haven't activate the plan. I think we can refine the error message later.
@crazywoola it is 403 error that's means it is an permission error.
I got the problem, I think it is related to plan that you activate in the WaterCrawl. If you are using free plan you may faced with the following issue:
- Daily limit proceed
- Monthly limit proceed
- Max Depth is more that 2
- you have 1 crawl running at the moment
These are limitation for free plan.
1,000 page credit 100 daily page credit 1 Seat (Team collaboration) Max depth: 2 Max page limit: 50 Max Concurrent Crawls: 1
If you want to continue or you need more. you have to upgrade your plan or using self hosted version.
However, I will prepare a fix to handle such errors better and present the user with better text.