crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

httpx.InvalidURL: Invalid non-printable ASCII character in URL

Open Jourdelune opened this issue 1 year ago • 2 comments

Hey, I try to scrap music but it seems that the crawler with the await context.enqueue_links(strategy="all") add invalid url, I run my code but I have the error:

[crawlee.autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
Traceback (most recent call last):
  File "/home/jourdelune/dev/Crawler/src/main.py", line 21, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/jourdelune/dev/Crawler/src/main.py", line 14, in main
    await crawler.run(
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 359, in run
    await run_task
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 398, in _run_crawler
    await self._pool.run()
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 185, in run
    await run.result
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 336, in _worker_task
    await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 734, in __run_task_function
    await self._commit_request_handler_result(crawling_context, result)
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 653, in _commit_request_handler_result
    destination = httpx.URL(request_model.url)
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urls.py", line 115, in __init__
    self._uri_reference = urlparse(url, **kwargs)
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urlparse.py", line 163, in urlparse
    raise InvalidURL("Invalid non-printable ASCII character in URL")
httpx.InvalidURL: Invalid non-printable ASCII character in URL

the invalid url is: https://www.linkedin.com/company/nic-br/

Code:

import re
import urllib.parse

from crawlee.basic_crawler import Router
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.playwright_crawler import PlaywrightCrawlingContext

router = Router[PlaywrightCrawlingContext]()

regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()!@:%_\+.~#?&\/\/=]*)\.(mp3|wav|ogg)"


@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    url = context.request.url
    html_page = str(context.soup).replace("\/", "/")

    matches = re.finditer(regex, html_page)

    audio_links = [html_page[match.start() : match.end()] for match in matches]

    for link in audio_links:
        link = urllib.parse.urljoin(url, link)

        data = {
            "url": link,
            "label": "audio",
        }

        await context.push_data(data)

    await context.enqueue_links(strategy="all")

Jourdelune avatar Jul 21 '24 19:07 Jourdelune

Hello, and thanks for your interest in Crawlee! Could you please provide a minimal reproducing example for this? The other file that imports your router should do it :slightly_smiling_face:

From what I see, you're trying to crawl the whole world-wide-web - that's what strategy="all" does. Do you really want this?

Also, could you try to get the offending URL and the page where it was found? Seeing it could help us find out how those non-printable characters got there.

janbuchar avatar Jul 22 '24 15:07 janbuchar

hey, thanks you for the answer, the code that imports the router:

"""
main script for the crawler
"""

import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler

from routes import router
from utils import process


async def main() -> None:
    """
    Function to launch the crawler
    """

    crawler = BeautifulSoupCrawler(
        request_handler=router,
    )

    await crawler.run(
        ["https://www.cgi.br/publicacao/revista-br-ano-07-2016-edicao-09/"]
    )
    await crawler.export_data("results.json")
    process("results.json")


if __name__ == "__main__":
    asyncio.run(main())

I want to crawl the full web to create a dataset of song url (to create an AI music generation model), that why I use strategy="all", if you run the code, you should get the error.

The url where it get the invalid url is: https://www.cgi.br/publicacao/revista-br-ano-07-2016-edicao-09/ the invalid url is: https://www.linkedin.com/company/nic-br/

Jourdelune avatar Jul 22 '24 15:07 Jourdelune

Huh, this is getting interesting. I added this to the request handler:

context.log.info(f'links found: {"\n".join([repr(link.attrs.get('href')) for link in context.soup.select('a')])}', )

...and it showed me that the linkedin link in fact contains a line break:

<a class="btn-floating btn-lg btn-li" type="button" role="button" href="https://www.linkedin.com/company/nic-br/
  " target="_blank">
  <i class="fab fa-linkedin-in"></i>
</a>

However unusual this is, I'll add a .strip() to the enqueue_links implementation.

janbuchar avatar Jul 23 '24 08:07 janbuchar