Add compatibility so that both URL path types are supported (absolute and relative)
Hello,
I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.
In fact, I have many instances of my documentation website hosted in different places
That means that I have to run the docs-scraper for each site (update of repository).
I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:
- Is it possible to replace absolute with relative URLs in docs-scraper?.
I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).
Thanks in advance!!
Hello @suppadeliux!
Does the start_urls option work for your usecase?
https://github.com/meilisearch/docs-scraper#start_urls
You should be able to define the absolute URL in the array like:
{
"start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}
Sorry if I miss understood your issue.
PS: I transfer your issue into the docs-scraper repo
For more documentation your can check out the README of this repo (docs-scraper)
Hello @suppadeliux!
Does the
start_urlsoption work for your usecase? https://github.com/meilisearch/docs-scraper#start_urls You should be able to define the absolute URL in the array like:{ "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"] }Sorry if I miss understood your issue.
PS: I transfer your issue into the docs-scraper repo
Hello @curquiza , and thanks for taking the time to answer to my question.
I have many docs-scraper.config files, each one containing the url for each documention website. Each time I run the scraper, I run it for each site.
What I wish I could do is only run the scraper once, and having relative URLS on my index, instead of absolute (like in the meilisearch doc site).

So When I will have the response from my meilisearch instance, I will only have relative paths (e.g. /getting-started/introduction or /about-us) to redirect the user to the the right result just using the relative urls. This way, each of my documentation website, doesn't contain the raw URL from another site in the search API response.
I hope it clears it up a little bit :+1:
I understand now. Unfortunately, and if I'm not wrong, there is no way to change the url field...
Once your documents are added to MeiliSearch, what you can do is to update all the url fields in your documents:
- you get all of them (browsing them using
offset) with this route: https://docs.meilisearch.com/reference/api/documents.html#get-documents - you update them with this route: https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents
We have many clients depending on your favorite language here to update your documents: https://github.com/meilisearch/integration-guides#-sdks-for-meilisearch-api
We need a PR that add a compatibility with both path techniques. Thanks for raising this 🔥 Feel free to implement it otherwise we wait for a contributor to do so
@suppadeliux I had this issue as well on my end and ended up writing a really hacky patch that simply makes all URLs relatives.
This works for my very narrow use-case and will very likely break for yours, but, in the off chance this patch can help you, here it is:
diff --git a/scraper/src/documentation_spider.py b/scraper/src/documentation_spider.py
index 88bd125..704b13d 100644
--- a/scraper/src/documentation_spider.py
+++ b/scraper/src/documentation_spider.py
@@ -13,6 +13,8 @@ import os
# End of import for the sitemap behavior
+from urllib.parse import urlparse
+
from scrapy.spidermiddlewares.httperror import HttpError
from scrapy.exceptions import CloseSpider
@@ -148,6 +150,11 @@ class DocumentationSpider(CrawlSpider, SitemapSpider):
return super()._parse(response, **kwargs)
def add_records(self, response, from_sitemap):
+
+ parsedURL = urlparse(response.url)
+ response = response.replace(url=parsedURL._replace(scheme="",netloc=None).geturl())
+ print("Changed {} to relative URL {}".format(parsedURL.geturl(), response.url))
+
records = self.strategy.get_records_from_response(response)
self.meilisearch_helper.add_records(records, response.url, from_sitemap)
Does it still work with absolute paths? It could be an acceptable solution. If you don't have time trying it out no problem :)
@bidoubiwa This code always changes the URL from absolute to relative. It would need to be adapted to provide the option to enable/disable this feature.
Unfortunately, the docs-scraper codebase is a little hard to follow and my python skills are lacking, so I can't really provide a better solution.
As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.