Skip to content

httpx.InvalidURL: Invalid non-printable ASCII character in URL #337

@Jourdelune

Description

@Jourdelune

Hey, I try to scrap music but it seems that the crawler with the await context.enqueue_links(strategy="all") add invalid url, I run my code but I have the error:

[crawlee.autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
Traceback (most recent call last):
  File "/home/jourdelune/dev/Crawler/src/main.py", line 21, in <module>
    asyncio.run(main())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/jourdelune/dev/Crawler/src/main.py", line 14, in main
    await crawler.run(
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 359, in run
    await run_task
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 398, in _run_crawler
    await self._pool.run()
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 185, in run
    await run.result
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/autoscaling/autoscaled_pool.py", line 336, in _worker_task
    await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 734, in __run_task_function
    await self._commit_request_handler_result(crawling_context, result)
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/crawlee/basic_crawler/basic_crawler.py", line 653, in _commit_request_handler_result
    destination = httpx.URL(request_model.url)
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urls.py", line 115, in __init__
    self._uri_reference = urlparse(url, **kwargs)
  File "/home/jourdelune/dev/Crawler/env/lib/python3.10/site-packages/httpx/_urlparse.py", line 163, in urlparse
    raise InvalidURL("Invalid non-printable ASCII character in URL")
httpx.InvalidURL: Invalid non-printable ASCII character in URL

the invalid url is: https://www.linkedin.com/company/nic-br/

Code:

import re
import urllib.parse

from crawlee.basic_crawler import Router
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.playwright_crawler import PlaywrightCrawlingContext

router = Router[PlaywrightCrawlingContext]()

regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()!@:%_\+.~#?&\/\/=]*)\.(mp3|wav|ogg)"


@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    url = context.request.url
    html_page = str(context.soup).replace("\/", "/")

    matches = re.finditer(regex, html_page)

    audio_links = [html_page[match.start() : match.end()] for match in matches]

    for link in audio_links:
        link = urllib.parse.urljoin(url, link)

        data = {
            "url": link,
            "label": "audio",
        }

        await context.push_data(data)

    await context.enqueue_links(strategy="all")

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions