Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output URL are not correctly encoded #142

Open
maaaaz opened this issue Oct 29, 2024 · 5 comments
Open

Output URL are not correctly encoded #142

maaaaz opened this issue Oct 29, 2024 · 5 comments

Comments

@maaaaz
Copy link

maaaaz commented Oct 29, 2024

Hello there,

I observe that even the latest current version of ODD (v3.1.0.1) does not properly encode URL in the output file.

Let me detail the case:

  1. First, let's ODD a (randomly found on the internet) website containing some special chars in the path:
$ ./OpenDirectoryDownloader -u "https://gregoirelorieux.net/paysagescomposes/villes/Melle/" --output-file test
[...]
Finshed indexing
[...]
Saving URL list to file..
Saved URL list to file: /tmp/test.txt
  1. Then let's see the first results of the output file:
$ head test.txt
https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
[...]
  1. If we try to download the first file with wget (and even other download managers), it fails because there are unencoded characters in the URL: "#" and whitespaces.
$ wget -v "https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif"
--2024-10-29 23:22:12--  https://gregoirelorieux.net/paysagescomposes/villes/Melle/
Resolving gregoirelorieux.net (gregoirelorieux.net)... 213.186.33.87
Connecting to gregoirelorieux.net (gregoirelorieux.net)|213.186.33.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 844 [text/html]
Saving to: ‘index.html’

index.html                              100%[===============================================================================>]     844  --.-KB/s    in 0s

2024-10-29 23:22:13 (550 MB/s) - ‘index.html’ saved [844/844]

Here, the downloaded file:

  • is not the asked one: https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
  • but is from this automatically split link: https://gregoirelorieux.net/paysagescomposes/villes/Melle/
    wget ignores everything after finding a special char, the first one here is "#"

The correct encoded link in the ODD output file should be:
https://gregoirelorieux.net/paysagescomposes/villes/Melle/%233%2021%20jan/Melle/contrebasse-echantillons/cb-arco-1.aif

Instead of:
https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif

Can you fix it ?

The encodeURIComponent function should help.

Cheers!

@KoalaBear84
Copy link
Owner

Hi, thanks for letting me know. I'll try to look at it ASAP 😅

KoalaBear84 added a commit that referenced this issue Oct 31, 2024
@KoalaBear84
Copy link
Owner

Tried to make a new version with a partial fix, and maybe the definitive fix for now. But GitHub wont let me anymore because they deprecated/disabled older build actions. Will continue another time..

@Chaphasilor
Copy link
Contributor

I think this should be optional (but maybe the default). I've encountered servers in the past that didn't treat encoded URLs the same as the raw URL, seemingly becaus they didn't decode them (or not correctly). Improving the parsing in the downloadet itself, or manually passing an enquoted URL to it, should work even with URLs that aren't encoded.

@maaaaz
Copy link
Author

maaaaz commented Nov 1, 2024

I think this should be default, as download managers do not support unencoded URLs.

In the meantime, a Python solution to properly encode ODD output file:

$ cat script.py
#!/usr/bin/python3

import sys
import urllib.parse

for line in sys.stdin:
    print(urllib.parse.quote(line.strip(), safe=':/'))

$ cat oddresult | python script.py

@josephroosen
Copy link

Thank you for working on this issue! This just burned me in something and I am glad it was already addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants