Hey, I try to scrap music but it seems that the crawler with the await context.enqueue_links(strategy="all") add invalid url, I run my code but I have the error:
[crawlee.autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
Traceback (most recent call last):
File "/home/jourdelune/dev/Crawler/src/main.py", line 21, in <module>
asyncio.run(main())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/jourdelune/dev/Crawler/src/main.py", line 14, in main
await crawler.run(
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 359, in run
await run_task
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 398, in _run_crawler
await self._pool.run()
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 185, in run
await run.result
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 336, in _worker_task
await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 734, in __run_task_function
await self._commit_request_handler_result(crawling_context, result)
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 653, in _commit_request_handler_result
destination = httpx.URL(request_model.url)
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urls.py", line 115, in __init__
self._uri_reference = urlparse(url, **kwargs)
File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urlparse.py", line 163, in urlparse
raise InvalidURL("Invalid non-printable ASCII character in URL")
httpx.InvalidURL: Invalid non-printable ASCII character in URL
the invalid url is: https://www.linkedin.com/company/nic-br/
Code:
import re
import urllib.parse
from crawlee.basic_crawler import Router
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.playwright_crawler import PlaywrightCrawlingContext
router = Router[PlaywrightCrawlingContext]()
regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()!@:%_\+.~#?&\/\/=]*)\.(mp3|wav|ogg)"
@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
url = context.request.url
html_page = str(context.soup).replace("\/", "/")
matches = re.finditer(regex, html_page)
audio_links = [html_page[match.start() : match.end()] for match in matches]
for link in audio_links:
link = urllib.parse.urljoin(url, link)
data = {
"url": link,
"label": "audio",
}
await context.push_data(data)
await context.enqueue_links(strategy="all")
Hey, I try to scrap music but it seems that the crawler with the
await context.enqueue_links(strategy="all")add invalid url, I run my code but I have the error:the invalid url is: https://www.linkedin.com/company/nic-br/
Code: