XML gets rendered as HTML
I am trying to use SitemapSpider with Playwright, but it doesn't work because XML gets transformed into HTML (webkit-xml-viewer). Is there any way to prevent this from happening? I cannot disable Playwright for the sitemap request as it is protected by Cloudflare.
It might be possible to do better response detection somewhere around this line. I'll look into it, thanks for the report.
The response is correctly being created as an XmlResponse, you can check that by overriding the upstream _parse_sitemap method.
This doesn't seem to be a problem for Firefox: setting PLAYWRIGHT_BROWSER_TYPE = "firefox" the spider works correctly. If you want or need to stick to Chromium you could override the upstream _get_sitemap_body method to remove the outer part of the page content (keep in mind that you'd be overriding a private method).
from scrapy.spiders import SitemapSpider
from scrapy.http import Response
class PlaywrightDownloaderMiddleware:
def process_request(self, request, spider):
request.meta.setdefault("playwright", True)
return None
class TestSpider(SitemapSpider):
name = "test"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"USER_AGENT": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:142.0) Gecko/20100101 Firefox/142.0",
"DOWNLOADER_MIDDLEWARES": {PlaywrightDownloaderMiddleware: 543},
}
sitemap_urls = ["https://.../sitemap.xml"]
def _get_sitemap_body(self, response: Response) -> bytes | None:
body_str = response.css("#webkit-xml-viewer-source-xml > *").get()
return body_str.encode("utf-8") if body_str else super()._get_sitemap_body(response)
def _parse_sitemap(self, response: Response):
print("Sitemap response type:", type(response))
return super()._parse_sitemap(response)
def parse(self, response: Response):
return {"url": response.url, "status": response.status}