crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

HTTP API for Spider

Open Ehsan-U opened this issue 1 year ago • 1 comments

Scrapy offers an HTTP API through a third-party library called ScrapyRT, which exposes an HTTP API for spiders. By sending a request to ScrapyRT with the spider name and URL, you receive the items collected by the spider from that URL.

It would be great if Crawlee could provide similar functionality out of the box. Its like turning a website into realtime API.

Ehsan-U avatar Jul 14 '24 04:07 Ehsan-U

I think it's pretty straightforward to expose the crawler.run method using FastAPI, for instance. The following snippet is without any guarantees, but it shouldn't be too incorrect :smile:

from fastapi import FastAPI
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from typing import Any

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
  pass

app = FastAPI()

@app.post("/crawl")
async def crawl(url: str) -> Any:
  await crawler.run([url])
  return await crawler.get_data()

Note that this has to be run as a FastAPI app, e.g. with fastapi run.

It is possible that we will implement something even more streamlined in the future though, so feel free to throw ideas here.

janbuchar avatar Jul 16 '24 09:07 janbuchar