cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

Rewrite the scrapper in JavaScript

Open TheMBeat opened this issue 5 years ago • 6 comments

Is your feature request related to a problem? Please describe. No

Describe the solution you'd like You should first test if there is a JSON on the website, if there is none you can iterate through a list of special scrapers for the websites. If no scrapper is registered for this website, a universal scrapper should be executed.

Additionally we can write our own converter in JavaScript for different recipe formats to import them.

First attempt from my side

TheMBeat avatar Nov 16 '20 07:11 TheMBeat

The problem here is that the "trivial" approach to allow the browser fetch the recipe from the 3rd party server conflicts with the strict CORS settings by NC. In fact the nextcloud server tries to avoid/forbid loading resources from foreign domains (for good reasons).

Apart from the technical implementation we might want to have a clear view what exactly we are allowing here.

Read also:

christianlupus avatar Nov 16 '20 10:11 christianlupus

For sake of completeness: The idea of @TheMBeat was to use a library like this one that provides a set of recipe site parsers.

One has to check if we can use such a library from a licensing perspective. Here are some links I quickly found.

christianlupus avatar Mar 19 '21 16:03 christianlupus

To push the issue again. I thought about it some more and came up with the following. The server fetches the page, determines the JSON and returns the complete DOM to the client in case of an error. Specific parsers can then run here and generate a JSON .

How is this idea?

TheMBeat avatar Apr 28 '22 19:04 TheMBeat

The license problem is not affected by your suggestion at all.

Of course, we could do a server parsing and then fall back to some heuristics if no schema.org data was found. This could both be done on the client or the server.

Having the server do the download will prevent CORS problems but cause the problem that we do not have a full-blown browser on the server. The trick with the browser is that the client browser would allow us to use JS altered pages (to prevent scraping). See #431.

I would even consider it better to avoid a back and forth with the server but instead add a new endpoint that opens a new index page that allows to use JS downloading by disabling CORS only for that page. Then everything could be done on the frontend side.

christianlupus avatar Apr 29 '22 17:04 christianlupus

To avoid licensing problems, we should write completely custom scrappers.

I like the idea with the temporary permission for the specific URL.

TheMBeat avatar Apr 29 '22 18:04 TheMBeat

Since it is your idea, i put you as a responsible for this issue. We can split this issue into smaller really good GitHub task lists. Apart from that, i do not see fit for this at the moment. PRs and solutions are welcome.

christianlupus avatar Jun 25 '22 09:06 christianlupus