-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler resolves relative URLs incorrecty on pages reached from a redirect #6
Comments
The responsible line is in
Here, I was going to write a fix, but I think I need to step back and understand the design first -- the The more I think about it, the more I feel it would make sense to not handle redirects in a loop in the When receiving a redirect, handle that response as its own page. Update the Some reasons I think this design makes sense:
In other words, I think it makes better sense to index them separately. What do you think? edit: One possible downside with this design is that it is hard to enforce a maximum number of redirects. But is this any different from, say, a capsule that dynamically generates thousands of pages with their own URLs and links in between them? Maybe redirect behavior could be caught by other more general mechanisms (if such exist, I haven't read the code), such as a maximum number of URLs per host, or similar. |
Hi, I saw your email and sorry I hadn't reply yet. I'm traveling this weekend. As you said, fixing this could be a little tricky. I still debating myself how should I solve this. In terms of other solutions. I try to block orbits (web rings) as they interfere with the SALSA ranking algorithm while not contributing much useful information about the structure of Geminispace - (I might be wrong on this, but Low Earth Orbit does interfere severely). But I also want a more robust method for generic redirection handling. I'll think more about this tomorrow and share my thoughts with you. |
No worries! It wasn't my intention to "escalate". I just didn't know that TLGS was open source at first. I wrote the issue when I found out and then I had already sent the mail. This is not urgent at all! Hmm... Could you explain what about orbits cause interference? I host orbits on my capsule (well, they only have three users in total, and one is me, so they haven't taken off. yet.), and I would like to be a good Geminispace citizen. I'd be happy to adjust if there is something I can do from my end. (For example, I could add the redirection links to my robots.txt file.) I also have some thoughts about why orbits can be useful (and how they ought to be implemented), but I'd like to hear your point of view first. |
To clarify, I don't block servers that hosts orbits. But I do block links orbit endpoints. (ex: The issue is TLGS's ranking algorithm will unreasonably favor hosts with lots of links to each others. Called the Tightly Knit Community effect. Both HITS and SALSA are vulnerable. SALSA, the current default, is not as vulnerable but still. PageRank (IIRC used by geminispace.info/GUS) has the same problem. Usually the ranked score should approximately grow linearly as the number of links referencing your page. But under TKC, a small set of pages can glob up >50% of score with just a few links among them. From empirical evidence. Before I blocked LEO, searching for "gemini" on TLGS will result in the top 10 result including 5 LEO capsules, ranking among geminispace.info, gemini.circumlunar.space, medusae.space, etc.. Which isn't what most people looking for when searching that term. You don't need to do anything. robots.txt should be used to block crawlers from pages that you don't want crawled. This is a search engine problem induced by deficiency in the ranking algo. I'll do my best to keep things running and serve quality result. |
When the crawler processes a URL that results in a redirect, and the target page contains relative links, those relative links should be resolved using the target URL as a base, not the redirecting URL.
Example:
gemini://raek.se/
links togemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F
gemini://raek.se/orbits/omloppsbanan/next?gemini%3A%2F%2Fraek.se%2F
redirects togemini://hanicef.me/
gemini://hanicef.me/
links to/about
Then the crawler should resolve the last link into
gemini://hanicef.me/about
, but currently it incorrectly resolves it intogemini://raek.se/about
.The text was updated successfully, but these errors were encountered: