Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External links support #5

Closed
untitaker opened this issue Oct 17, 2020 · 9 comments
Closed

External links support #5

untitaker opened this issue Oct 17, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@untitaker
Copy link
Owner

untitaker commented Oct 17, 2020

External link support was not built because fetching remote content is slow and flaky. Ideas:

  • Support sitemap.xml only or at least attempt to use it as fastpath.
  • Maybe make user cache/store sitemaps for all external domains so flakiness can be kept in check
  • Add subcommand to generate sitemap.xml for own static site

Why do it this way? Because our actual usecase is only for checking links from docs.sentry.io to sentry.io. Both are static sites we control, so we could make sure everything has sitemaps and still get away with very fast builds. sentry.io already has a sitemap

However, for a general-purpose external links checker we probably really need to support real HTTP + build a local cache file, maybe. Also for anchor-checking sitemap.xml doesn't work.

@untitaker untitaker added the enhancement New feature or request label Oct 24, 2020
@untitaker
Copy link
Owner Author

untitaker commented Dec 13, 2020

Thinking more about this, it does not make sense to check for broken external links when a new commit has been created on master. If an external reference becomes invalid, that is not logically associated with a code change so doing it on code change (and associating results with a code change) does not make for sensible UX.

GitHub actions could still be used to run hyperlink periodically, but instead of having a binary result (master broken/not broken) I think it would make more sense to create GitHub issues and assign them to the people who introduced the broken link. Similar to how Sentry tracks prod errors (Sentry + suspect commits could probably be abused to avoid building a new frontend)

External sites' URLs are too much of a moving target so nobody wants to check them in CI, so they end up not checking them at all. Or they use some spider service that does it for them, but those are either expensive or don't really have a issue tracking workflow of some sort.

@mwcz
Copy link
Contributor

mwcz commented Sep 1, 2022

How would you feel about having an option to simply count and print a number of external links in the final summary, something like "Found 371 external links". I'm happy to put together a PR if this would be useful.

@untitaker
Copy link
Owner Author

@mwcz that's probably useful, yeah. feel free to give it a try

@matklad
Copy link

matklad commented Nov 2, 2023

Another idea:

Allow to specify URL remaps on the command line, like hyperlink --remap-from https://github.com/tigerbeetle/tigerbeetle --remap-to ../local-tigerbeetle-repo.

Specific use-case here: we publish our docs as a static site to https://docs.tigerbeetle.com. From those docs, we occasionally want to refer to source files in the github repo. It would be great to check those links in HTML, but there's no need to go and actually curl github for these, we can check links against local files in the neighbouring dir.

P.S. Thank for building hyperlink, such a no-nosense piece of software, love it!

@untitaker
Copy link
Owner Author

Allow to specify URL remaps on the command line, like hyperlink --remap-from https://github.com/tigerbeetle/tigerbeetle --remap-to ../local-tigerbeetle-repo.

I think if the API looks like that, hyperlink will have to make assumptions about which paths are valid, that are incompatible with how static sites are typically served. For example, linking to a directory https://github.com/tigerbeetle/tigerbeetle/src/ should be considered valid because github can serve something valid at that URL, but hyperlink would not consider that generally valid today because most static site hosts do not do directory listings.

One could fix this particular example by adding a "assume directory listings" option, but that's just one example... particularly around anchors, github's way of serving up a directory tree just differs too much from a static site host (https://github.com/user/repo/file.txt#L123). And then there is the question about whether hyperlink should parse any HTML file in the referenced directory. It's a slippery slope of adding too many "server-specific" tweaking options.

Right now I believe the sitemap thing is the best idea, except make it not XML, but a simple textfile:

(cd ../local-tigerbeetle-repo && find .) > urls.txt
hyperlink --remap-from https://github.com/tigerbeetle/tigerbeetle --remap-to urls.txt

then one can add/remove -type f or use a completely different script to have control about the list of valid URLs.

the downside of course is that all URLs have to be enumerated upfront, which hurts if there are very few external links to check. another option is this:

hyperlink ... --remap-to 'test -f ../local-tigerbeetle-repo/{}'

but shelling out for every link is unacceptable in all other cases except the one where there's really very few external links to check.

thoughts?

@matklad
Copy link

matklad commented Nov 3, 2023

🤔 maybe flip this around? Have hyperlink produce the list of external urls as a .txt on stdout to allow the user to pipe that into a custom script with arbitrary logic?

And also maybe a dual subcomand to take a list of urls as an input, and check them against the directory?

That way, the original issue with two cross-linking static sites could be solved by:

hyperlink ./site-a -print-urls | hyperlink ./site-b -read-urls

@untitaker
Copy link
Owner Author

untitaker commented Nov 3, 2023 via email

@untitaker
Copy link
Owner Author

Now I see no cheap way to validate changes to one site without caching the SSG output somewhere. If I want to run CI to validate a newly added external link to site A, I need to:

  1. In site B's CI, store the SSG output as artifact somewhere (multiple megabytes)
  2. In site A's CI, pull the SSG output from B, so I can run hyperlink ./site-a -print-urls | hyperlink ./site-b -read-urls

Compare to sitemap-style:

  1. In site B's CI, store the sitemap as artifact (<1MB)
  2. In site A's CI, pull the sitemap

Perhaps both approaches need to be implemented.

@untitaker
Copy link
Owner Author

version 0.1.32 is out which contains a new experimental dump-external-links subcommand. let me know if it helps scripting a solution to external links as outlined in #158 (comment) -- if that works out we can perhaps add better support for ingesting those external links and showing source information correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants