Skip to content

UTF8 characters cause valid links to be detected as broken #234

@matkoniecz

Description

@matkoniecz

I prepared test case with https://github.com/matkoniecz/broken-link-checker-local-utf8

blc https://matkoniecz.github.io/broken-link-checker-local-utf8 -r

See https://matkoniecz.github.io/broken-link-checker-local-utf8/ - both link work, one with utf8 characters gets BLC_UNKNOWN/HTTP_undefined errors

mateusz@grima:~$ blc https://matkoniecz.github.io/broken-link-checker-local-utf8 -r
Getting links from: https://matkoniecz.github.io/broken-link-checker-local-utf8
├───OK─── https://matkoniecz.github.io/broken-link-checker-local-utf8/test%20space.html
└─BROKEN─ https://matkoniecz.github.io/broken-link-checker-local-utf8/test_zażółć.html (BLC_UNKNOWN)
Finished! 2 links found. 1 broken.

Getting links from: https://matkoniecz.github.io/broken-link-checker-local-utf8/test%20space.html
└─BROKEN─ https://matkoniecz.github.io/broken-link-checker-local-utf8/test_zażółć.html (HTTP_undefined)
Finished! 2 links found. 1 excluded. 1 broken.

Finished! 4 links found. 1 excluded. 2 broken.
Elapsed time: 1 second

Sorry if that is my misunderstanding but as I understand it the UTF8 is de facto working in links

UTF8 may be internally different but browsers seems 100% fine with links including letters like https://en.wikipedia.org/wiki/Ogonek

Sanity check: https://stackoverflow.com/questions/22357509/can-urls-have-utf-8-characters

Even DNS supports URF8 characters (with some workarounds and restrictions) https://en.wikipedia.org/wiki/Internationalized_domain_name

replaces LukasHechenberger/broken-link-checker-local#50

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions