Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance download_and_extract #8216

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

Jerome-Hsieh
Copy link

@Jerome-Hsieh Jerome-Hsieh commented Nov 16, 2024

Fixes #5463

Description

According to issue, the error messages are not very intuitive.
I think maybe we can check if the file name matches the downloaded file’s base name before starting the download.
If it doesn’t match, it will notify user.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@ericspod
Copy link
Member

Hi @Jerome-Hsieh If you're having issues with putting this PR together we can discuss how to resolve them, but please do not close and then open a new PR. Thanks for the contribution!

with tempfile.TemporaryDirectory() as tmp_dir:
filename = filepath or Path(tmp_dir, _basename(url)).resolve()
if filepath:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get the idea for this change here. Seems equivalent to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change I want to make sure if user don't set the filepath, the process can still work by using the default path

if filepath:
FilepathExtenstion = ''.join(Path(".", _basename(filepath)).resolve().suffixes)
if urlFilenameExtension != FilepathExtenstion:
raise NotImplementedError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the issue here is that if filepath can directly get the name of the downloaded file, the download_and_extract function would work as expected, rather than only raising an error at this point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but my perspective is that an error with the filepath occurs when the program tries to write the downloaded file to an invalid filepath.
E.g.,filepath='./test' is invalid, filepath='./test.tar.gz' is valid
So I would like to validate the filepath at the very beginning to ensure it is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know your point here.
Do you think when user specify filepath='./test', it would be better to give a warning and automatically catch the filename?

cc @ericspod @Nic-Ma

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would warn instead of forcing extensions, it takes control away from people to force things like this that aren't strictly necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I have a question.
Is it necessary for the user to specify the file extension when setting the filepath for the downloaded file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user gives a path without extension then you'll get a file with no extension, the user would have to account for that and provide an extension. You could add an extension if none is given (and warn that this happened), but if the user has given a different extension than what's expected, this should be a warning and not an error. If you add an extension you should also be careful not to overwrite an existing file. I realise this is getting rather complicated now, sorry!

monai/apps/utils.py Outdated Show resolved Hide resolved
monai/apps/utils.py Outdated Show resolved Hide resolved
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
Signed-off-by: jerome_Hsieh <[email protected]>
@Jerome-Hsieh
Copy link
Author

@ericspod @KumoLiu
Here's newest change. like we discussed before. I check the filepath every time at the very beginning.
If no extension is provided, the system will automatically add the appropriate file extension with a warning.
A warning will also appear if provided extension doesn't match the downloaded file's extension .

@KumoLiu
Copy link
Contributor

KumoLiu commented Dec 5, 2024

@ericspod @KumoLiu Here's newest change. like we discussed before. I check the filepath every time at the very beginning. If no extension is provided, the system will automatically add the appropriate file extension with a warning. A warning will also appear if provided extension doesn't match the downloaded file's extension .

Hi @Jerome-Hsieh, thank you for the quick update. I tested your latest changes, but neither "." nor "./test" as the file path seems to work. Is this the expected behavior? In my opinion, both should work. The logic could be adjusted so that if the file path is a directory and not empty, it automatically captures the name and downloads the file into the specified directory. What do you think?

url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
monai.apps.utils.download_and_extract(url, filepath="./test")

@Jerome-Hsieh
Copy link
Author

Hi @KumoLiu my test results:

>>> url = "https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz"
>>> monai.apps.utils.download_and_extract(url, filepath=".")
2024-12-06 18:06:56,333 - INFO - Expected md5 is None, skip md5 check for file ..
2024-12-06 18:06:56,334 - INFO - File exists: ., skipped downloading.
2024-12-06 18:06:56,334 - INFO - Non-empty folder exists in ., skipped extracting.

and

>>> monai.apps.utils.download_and_extract(url, filepath="./test")
2024-12-06 18:08:41,246 - WARNING - filepath=./test, which missing file extension. Auto-appending extension to: ./test.tar.gz
test.tar.gz: 59.0MB [00:10, 6.07MB/s]
2024-12-06 18:08:51,443 - INFO - Downloaded: test.tar.gz
2024-12-06 18:08:51,443 - INFO - Expected md5 is None, skip md5 check for file test.tar.gz.
2024-12-06 18:08:51,443 - INFO - Writing into directory: ..

If the results like above, that's expected behavior.

Your opinion sounds like
url=example.com/file.tar.gz, filepath=./test
the program will work :./test/file.tar.gz.
If I don't misunderstanding your thoughts, this method although can automatically captures the name, not only capture the strange filename make user confused if the URL don't have appropriate naming, but also can't allow user set the filename they want.

@mingxin-zheng
Copy link
Contributor

Regarding file downloading from URL, sometimes we can infer the name from the content deposition. Here is one example snippet: https://github.com/Project-MONAI/VLM/blob/7be688aa457a1806f908eb758f2f3ee816fea017/m3/demo/experts/utils.py#L135 @KumoLiu

@Jerome-Hsieh
Copy link
Author

Thank @mingxin-zheng provide information to find file name.
But I want to say that at first my initial approach to this issue that filepath should be a compressed file rather than a directory.
I wanted to focus on making filepath valid. Method provided by @KumoLiu it's great, but which somewhat deviates from my original intent.
Sorry for not providing a clear explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parameter filepath of download_and_extract
4 participants