scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

XML gets rendered as HTML

Open galloj opened this issue 1 year ago • 2 comments

I am trying to use SitemapSpider with Playwright, but it doesn't work because XML gets transformed into HTML (webkit-xml-viewer). Is there any way to prevent this from happening? I cannot disable Playwright for the sitemap request as it is protected by Cloudflare.

galloj avatar Feb 11 '25 13:02 galloj

It might be possible to do better response detection somewhere around this line. I'll look into it, thanks for the report.

elacuesta avatar Feb 19 '25 15:02 elacuesta

The response is correctly being created as an XmlResponse, you can check that by overriding the upstream _parse_sitemap method. This doesn't seem to be a problem for Firefox: setting PLAYWRIGHT_BROWSER_TYPE = "firefox" the spider works correctly. If you want or need to stick to Chromium you could override the upstream _get_sitemap_body method to remove the outer part of the page content (keep in mind that you'd be overriding a private method).

from scrapy.spiders import SitemapSpider
from scrapy.http import Response


class PlaywrightDownloaderMiddleware:
    def process_request(self, request, spider):
        request.meta.setdefault("playwright", True)
        return None


class TestSpider(SitemapSpider):
    name = "test"
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "USER_AGENT": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:142.0) Gecko/20100101 Firefox/142.0",
        "DOWNLOADER_MIDDLEWARES": {PlaywrightDownloaderMiddleware: 543},
    }
    sitemap_urls = ["https://.../sitemap.xml"]

    def _get_sitemap_body(self, response: Response) -> bytes | None:
        body_str = response.css("#webkit-xml-viewer-source-xml > *").get()
        return body_str.encode("utf-8") if body_str else super()._get_sitemap_body(response)

    def _parse_sitemap(self, response: Response):
        print("Sitemap response type:", type(response))
        return super()._parse_sitemap(response)

    def parse(self, response: Response):
        return {"url": response.url, "status": response.status}

elacuesta avatar Aug 03 '25 21:08 elacuesta