docs-scraper Add compatibility so that both URL path types are supported (absolute and relative)

Hello,

I am currently working on a documentation website generated with Jekyll, and Meilisearch has been pretty easy to add, with the docs-scraper and the docs-searchbar.

In fact, I have many instances of my documentation website hosted in different places

That means that I have to run the docs-scraper for each site (update of repository).

I wish I could run only one scraper for all my sites, and be independent of where each documentation site is hosted. so my question is:

Is it possible to replace absolute with relative URLs in docs-scraper?.

I guess, I can do that somehow overriding some of the logic from the source code of the scraper. But is there another way? (maybe someone else has already thought/discussed about that).

Thanks in advance!!

Mar 15 '21 10:03 suppadeliux

Hello @suppadeliux!

Does the start_urls option work for your usecase? https://github.com/meilisearch/docs-scraper#start_urls You should be able to define the absolute URL in the array like:

{
  "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}

Sorry if I miss understood your issue.

PS: I transfer your issue into the docs-scraper repo

Mar 16 '21 14:03 curquiza

For more documentation your can check out the README of this repo (docs-scraper)

Mar 16 '21 14:03 curquiza

Hello @suppadeliux!

Does the start_urls option work for your usecase? https://github.com/meilisearch/docs-scraper#start_urls You should be able to define the absolute URL in the array like:
{
  "start_urls": ["https://www.mysite1.com", "https://www.mysite2.com"]
}
Sorry if I miss understood your issue.

PS: I transfer your issue into the docs-scraper repo

Hello @curquiza , and thanks for taking the time to answer to my question.

I have many docs-scraper.config files, each one containing the url for each documention website. Each time I run the scraper, I run it for each site.

What I wish I could do is only run the scraper once, and having relative URLS on my index, instead of absolute (like in the meilisearch doc site).

So When I will have the response from my meilisearch instance, I will only have relative paths (e.g. /getting-started/introduction or /about-us) to redirect the user to the the right result just using the relative urls. This way, each of my documentation website, doesn't contain the raw URL from another site in the search API response.

I hope it clears it up a little bit :+1:

Mar 16 '21 15:03 suppadeliux

I understand now. Unfortunately, and if I'm not wrong, there is no way to change the url field... Once your documents are added to MeiliSearch, what you can do is to update all the url fields in your documents:

you get all of them (browsing them using offset) with this route: https://docs.meilisearch.com/reference/api/documents.html#get-documents
you update them with this route: https://docs.meilisearch.com/reference/api/documents.html#add-or-update-documents

We have many clients depending on your favorite language here to update your documents: https://github.com/meilisearch/integration-guides#-sdks-for-meilisearch-api

Mar 16 '21 15:03 curquiza

We need a PR that add a compatibility with both path techniques. Thanks for raising this 🔥 Feel free to implement it otherwise we wait for a contributor to do so

Sep 29 '21 08:09 bidoubiwa

@suppadeliux I had this issue as well on my end and ended up writing a really hacky patch that simply makes all URLs relatives.

This works for my very narrow use-case and will very likely break for yours, but, in the off chance this patch can help you, here it is:

diff --git a/scraper/src/documentation_spider.py b/scraper/src/documentation_spider.py
index 88bd125..704b13d 100644
--- a/scraper/src/documentation_spider.py
+++ b/scraper/src/documentation_spider.py
@@ -13,6 +13,8 @@ import os

 # End of import for the sitemap behavior

+from urllib.parse import urlparse
+
 from scrapy.spidermiddlewares.httperror import HttpError

 from scrapy.exceptions import CloseSpider
@@ -148,6 +150,11 @@ class DocumentationSpider(CrawlSpider, SitemapSpider):
         return super()._parse(response, **kwargs)

     def add_records(self, response, from_sitemap):
+
+        parsedURL = urlparse(response.url)
+        response = response.replace(url=parsedURL._replace(scheme="",netloc=None).geturl())
+        print("Changed {} to relative URL {}".format(parsedURL.geturl(), response.url))
+
         records = self.strategy.get_records_from_response(response)
         self.meilisearch_helper.add_records(records, response.url, from_sitemap)

Oct 07 '21 22:10 huguesalary

Does it still work with absolute paths? It could be an acceptable solution. If you don't have time trying it out no problem :)

Oct 11 '21 10:10 bidoubiwa

@bidoubiwa This code always changes the URL from absolute to relative. It would need to be adapted to provide the option to enable/disable this feature.

Unfortunately, the docs-scraper codebase is a little hard to follow and my python skills are lacking, so I can't really provide a better solution.

Oct 15 '21 23:10 huguesalary

As this repo is now low-maintenance, this PR is no longer relevant today. I'm closing all issues that are not bugs.

Sep 06 '23 11:09 alallema