httpx.InvalidURL: Invalid non-printable ASCII character in URL
Hey, I try to scrap music but it seems that the crawler with the await context.enqueue_links(strategy="all") add invalid url, I run my code but I have the error:
[crawlee.autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
Traceback (most recent call last):
File "/home/jourdelune/dev/Crawler/src/main.py", line 21, in <module>
asyncio.run(main())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/jourdelune/dev/Crawler/src/main.py", line 14, in main
await crawler.run(
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 359, in run
await run_task
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 398, in _run_crawler
await self._pool.run()
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 185, in run
await run.result
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 336, in _worker_task
await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 734, in __run_task_function
await self._commit_request_handler_result(crawling_context, result)
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 653, in _commit_request_handler_result
destination = httpx.URL(request_model.url)
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urls.py", line 115, in __init__
self._uri_reference = urlparse(url, **kwargs)
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urlparse.py", line 163, in urlparse
raise InvalidURL("Invalid non-printable ASCII character in URL")
httpx.InvalidURL: Invalid non-printable ASCII character in URL
the invalid url is: https://www.linkedin.com/company/nic-br/
Code:
import re
import urllib.parse
from crawlee.basic_crawler import Router
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.playwright_crawler import PlaywrightCrawlingContext
router = Router[PlaywrightCrawlingContext]()
regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()!@:%_\+.~#?&\/\/=]*)\.(mp3|wav|ogg)"
@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
url = context.request.url
html_page = str(context.soup).replace("\/", "/")
matches = re.finditer(regex, html_page)
audio_links = [html_page[match.start() : match.end()] for match in matches]
for link in audio_links:
link = urllib.parse.urljoin(url, link)
data = {
"url": link,
"label": "audio",
}
await context.push_data(data)
await context.enqueue_links(strategy="all")
Hello, and thanks for your interest in Crawlee! Could you please provide a minimal reproducing example for this? The other file that imports your router should do it :slightly_smiling_face:
From what I see, you're trying to crawl the whole world-wide-web - that's what strategy="all" does. Do you really want this?
Also, could you try to get the offending URL and the page where it was found? Seeing it could help us find out how those non-printable characters got there.
hey, thanks you for the answer, the code that imports the router:
"""
main script for the crawler
"""
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from routes import router
from utils import process
async def main() -> None:
"""
Function to launch the crawler
"""
crawler = BeautifulSoupCrawler(
request_handler=router,
)
await crawler.run(
["https://www.cgi.br/publicacao/revista-br-ano-07-2016-edicao-09/"]
)
await crawler.export_data("results.json")
process("results.json")
if __name__ == "__main__":
asyncio.run(main())
I want to crawl the full web to create a dataset of song url (to create an AI music generation model), that why I use strategy="all", if you run the code, you should get the error.
The url where it get the invalid url is: https://www.cgi.br/publicacao/revista-br-ano-07-2016-edicao-09/ the invalid url is: https://www.linkedin.com/company/nic-br/
Huh, this is getting interesting. I added this to the request handler:
context.log.info(f'links found: {"\n".join([repr(link.attrs.get('href')) for link in context.soup.select('a')])}', )
...and it showed me that the linkedin link in fact contains a line break:
<a class="btn-floating btn-lg btn-li" type="button" role="button" href="https://www.linkedin.com/company/nic-br/
" target="_blank">
<i class="fab fa-linkedin-in"></i>
</a>
However unusual this is, I'll add a .strip() to the enqueue_links implementation.