External links support #5

untitaker · 2020-10-17T14:12:46Z

External link support was not built because fetching remote content is slow and flaky. Ideas:

Support sitemap.xml only or at least attempt to use it as fastpath.
Maybe make user cache/store sitemaps for all external domains so flakiness can be kept in check
Add subcommand to generate sitemap.xml for own static site

Why do it this way? Because our actual usecase is only for checking links from docs.sentry.io to sentry.io. Both are static sites we control, so we could make sure everything has sitemaps and still get away with very fast builds. sentry.io already has a sitemap

However, for a general-purpose external links checker we probably really need to support real HTTP + build a local cache file, maybe. Also for anchor-checking sitemap.xml doesn't work.

untitaker · 2020-12-13T20:37:15Z

Thinking more about this, it does not make sense to check for broken external links when a new commit has been created on master. If an external reference becomes invalid, that is not logically associated with a code change so doing it on code change (and associating results with a code change) does not make for sensible UX.

GitHub actions could still be used to run hyperlink periodically, but instead of having a binary result (master broken/not broken) I think it would make more sense to create GitHub issues and assign them to the people who introduced the broken link. Similar to how Sentry tracks prod errors (Sentry + suspect commits could probably be abused to avoid building a new frontend)

External sites' URLs are too much of a moving target so nobody wants to check them in CI, so they end up not checking them at all. Or they use some spider service that does it for them, but those are either expensive or don't really have a issue tracking workflow of some sort.

mwcz · 2022-09-01T16:09:29Z

How would you feel about having an option to simply count and print a number of external links in the final summary, something like "Found 371 external links". I'm happy to put together a PR if this would be useful.

untitaker · 2022-09-01T16:24:14Z

@mwcz that's probably useful, yeah. feel free to give it a try

matklad · 2023-11-02T18:32:44Z

Another idea:

Allow to specify URL remaps on the command line, like hyperlink --remap-from https://github.com/tigerbeetle/tigerbeetle --remap-to ../local-tigerbeetle-repo.

Specific use-case here: we publish our docs as a static site to https://docs.tigerbeetle.com. From those docs, we occasionally want to refer to source files in the github repo. It would be great to check those links in HTML, but there's no need to go and actually curl github for these, we can check links against local files in the neighbouring dir.

P.S. Thank for building hyperlink, such a no-nosense piece of software, love it!

untitaker · 2023-11-03T01:13:12Z

Allow to specify URL remaps on the command line, like hyperlink --remap-from https://github.com/tigerbeetle/tigerbeetle --remap-to ../local-tigerbeetle-repo.

I think if the API looks like that, hyperlink will have to make assumptions about which paths are valid, that are incompatible with how static sites are typically served. For example, linking to a directory https://github.com/tigerbeetle/tigerbeetle/src/ should be considered valid because github can serve something valid at that URL, but hyperlink would not consider that generally valid today because most static site hosts do not do directory listings.

One could fix this particular example by adding a "assume directory listings" option, but that's just one example... particularly around anchors, github's way of serving up a directory tree just differs too much from a static site host (https://github.com/user/repo/file.txt#L123). And then there is the question about whether hyperlink should parse any HTML file in the referenced directory. It's a slippery slope of adding too many "server-specific" tweaking options.

Right now I believe the sitemap thing is the best idea, except make it not XML, but a simple textfile:

(cd ../local-tigerbeetle-repo && find .) > urls.txt
hyperlink --remap-from https://github.com/tigerbeetle/tigerbeetle --remap-to urls.txt

then one can add/remove -type f or use a completely different script to have control about the list of valid URLs.

the downside of course is that all URLs have to be enumerated upfront, which hurts if there are very few external links to check. another option is this:

hyperlink ... --remap-to 'test -f ../local-tigerbeetle-repo/{}'

but shelling out for every link is unacceptable in all other cases except the one where there's really very few external links to check.

thoughts?

matklad · 2023-11-03T08:51:08Z

🤔 maybe flip this around? Have hyperlink produce the list of external urls as a .txt on stdout to allow the user to pipe that into a custom script with arbitrary logic?

And also maybe a dual subcomand to take a list of urls as an input, and check them against the directory?

That way, the original issue with two cross-linking static sites could be solved by:

hyperlink ./site-a -print-urls | hyperlink ./site-b -read-urls

untitaker · 2023-11-03T11:26:10Z

yeah I think that's better

On Fri, Nov 3, 2023, at 09:51, Alex Kladov wrote: 🤔 maybe flip this around? Have hyperlink produce the list of external urls as a .txt on stdout to allow the user to pipe that into a custom script with arbitrary logic? And also maybe a dual subcomand to take a list of urls as an input, and check them against the directory? That way, the original issue with two cross-linking static sites could be solved by: `hyperlink ./site-a -print-urls | hyperlink ./site-b -read-urls

`

…

— Reply to this email directly, view it on GitHub <#5 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRJC5E7WVGHLNQAX4F3YCSWAPAVCNFSM4SUMLWGKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZZGIYDMNRRGA2A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

untitaker · 2023-11-05T14:05:55Z

Now I see no cheap way to validate changes to one site without caching the SSG output somewhere. If I want to run CI to validate a newly added external link to site A, I need to:

In site B's CI, store the SSG output as artifact somewhere (multiple megabytes)
In site A's CI, pull the SSG output from B, so I can run hyperlink ./site-a -print-urls | hyperlink ./site-b -read-urls

Compare to sitemap-style:

In site B's CI, store the sitemap as artifact (<1MB)
In site A's CI, pull the sitemap

Perhaps both approaches need to be implemented.

untitaker · 2023-11-29T14:53:50Z

version 0.1.32 is out which contains a new experimental dump-external-links subcommand. let me know if it helps scripting a solution to external links as outlined in #158 (comment) -- if that works out we can perhaps add better support for ingesting those external links and showing source information correctly.

untitaker added the enhancement New feature or request label Oct 24, 2020

mwcz mentioned this issue Sep 2, 2022

add dump-external-links subcommand #158

Closed

untitaker mentioned this issue Nov 29, 2023

add dump-external links command #168

Merged

untitaker closed this as completed in 66f8416 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External links support #5

External links support #5

untitaker commented Oct 17, 2020 •

edited

Loading

untitaker commented Dec 13, 2020 •

edited

Loading

mwcz commented Sep 1, 2022

untitaker commented Sep 1, 2022

matklad commented Nov 2, 2023

untitaker commented Nov 3, 2023

matklad commented Nov 3, 2023

untitaker commented Nov 3, 2023 via email

untitaker commented Nov 5, 2023

untitaker commented Nov 29, 2023

External links support #5

External links support #5

Comments

untitaker commented Oct 17, 2020 • edited Loading

untitaker commented Dec 13, 2020 • edited Loading

mwcz commented Sep 1, 2022

untitaker commented Sep 1, 2022

matklad commented Nov 2, 2023

untitaker commented Nov 3, 2023

matklad commented Nov 3, 2023

untitaker commented Nov 3, 2023 via email

untitaker commented Nov 5, 2023

untitaker commented Nov 29, 2023

untitaker commented Oct 17, 2020 •

edited

Loading

untitaker commented Dec 13, 2020 •

edited

Loading