Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing duplicate files #99

Merged
merged 5 commits into from
Nov 21, 2024
Merged

Conversation

ksbeattie
Copy link
Member

@ksbeattie ksbeattie commented Mar 29, 2024

Remove duplicate files

Addresses #83 which points out that we have duplicate logo.png files in the repo, bloating the released package.

I found a tool (fdupes) that will find duplicate files (via size and MD5 checksum) under a dir. So I ran it and found a few more than mention in the above PR. Specifically:

$ fdupes -r .
./idaes_examples/notebooks/docs/surrogates/pysmo_rbf_surrogate.json
./idaes_examples/notebooks/docs/surrogates/pysmo/pysmo_rbf_surrogate.json

./idaes_examples/notebooks/docs/surrogates/AR_PFD.png
./idaes_examples/notebooks/docs/surrogates/pysmo/AR_PFD.png

./idaes_examples/notebooks/_dev/notebooks/logo.png
./idaes_examples/notebooks/docs/tut/ui/idaes-logo.png
./idaes_examples/notebooks/logo.png

./idaes_examples/notebooks/references.bib
./idaes_examples/notebooks/_dev/notebooks/references.bib

./idaes_examples/archive/ripe/sv.alm
./idaes_examples/archive/ripe/temp.alm

./idaes_examples/archive/dmf/data_management_framework.ipynb
./idaes_examples/archive/dmf/my_workspace/files/928385c3acda4a449412c5bfbbaa83b5/data_management_framework.ipynb

./idaes_examples/archive/power_gen/supercritical/supercritical_steam_cycle.svg
./idaes_examples/notebooks/docs/power_gen/supercritical/supercritical_steam_cycle.svg

./idaes_examples/archive/data_reconciliation/Boiler_scpc_PFD.svg
./idaes_examples/archive/power_gen/supercritical/Boiler_scpc_PFD.svg
./idaes_examples/notebooks/docs/power_gen/supercritical/Boiler_scpc_PFD.svg

Several other files were also found, like empty __init__.py files and notebooks, but I'd removed those from the above list, leaving ones that looked like they could be cleaned up.

This is my attempt at cleaning them up, first by simply removing 2 of the logo.png dupes and seeing what the CI says about that.


Legal Acknowledgement

By contributing to this software project, I agree to the following terms and conditions for my contribution:

I agree my contributions are submitted under the license terms described in the LICENSE.txt file at the top level of this directory.
I represent I am authorized to make the contributions and grant the license. If my employer has rights to intellectual property that includes these contributions, I represent that I have received permission to make contributions and grant the required license on behalf of that employer.

📚 Documentation preview 📚: https://idaes-examples--99.org.readthedocs.build/en/99/

@ksbeattie ksbeattie added the Priority:Normal Normal Priority Issue or PR label Mar 29, 2024
@ksbeattie ksbeattie self-assigned this Mar 29, 2024
@lbianchi-lbl lbianchi-lbl self-requested a review April 4, 2024 18:31
@lbianchi-lbl
Copy link
Contributor

  • In general I think this is a very good idea, and we should check for duplicate files either regularly (as part of the CI) or periodically
  • We should wait until the CI is fully functional again and make sure we're able to detect missing/broken links before merging

@dangunter
Copy link
Member

@lbianchi-lbl and @ksbeattie could we remove the identified files now and deal with the bigger issue separately?

@ksbeattie
Copy link
Member Author

I've updated this PR from main (which fixed a few of the things listed above) and re-ran fdupes -nSr in idaes_examples/. Here are some questions for the team (and specific people where it seems possible).

(BTW: this skips all the *{_usr,_doc,_solution,_test}.ipynb duplicate files which do take up a lot of space. Not sure if we can removed any of those).


@bpaul4 - The .mdl_ files looks like checkpoint files when training keras models, I've removed them here in this PR, are these safe to remove from the repo (and maybe also add to the .gitignore file?)

50022 bytes each:
./notebooks/docs/surrogates/.mdl_wts.keras
./notebooks/docs/surrogates/keras_surrogate/keras_model.keras

47725 bytes each:
./notebooks/docs/surrogates/sco2/omlt/.mdl_co2.keras
./notebooks/docs/surrogates/sco2/omlt/sco2_keras_surr/sco2_keras_model.keras

50022 bytes each:
./notebooks/docs/surrogates/omlt/.mdl_wts.keras
./notebooks/docs/surrogates/omlt/keras_surrogate/keras_model.keras

These are small, so not that much of a problem, but @JavalVyas2000 is there a way to just have one copy of this file?

9143 bytes each:
./notebooks/docs/surrogates/sco2/alamo/flowsheet_optimization.py
./notebooks/docs/surrogates/sco2/pysmo/flowsheet_optimization.py

Neither of these seem to be referenced anywhere I can see in this repo. @dangunter are they still used?

113565 bytes each:
./notebooks/docs/surrogates/pysmo_rbf_surrogate.json
./notebooks/docs/surrogates/pysmo/pysmo_rbf_surrogate.json

@dangunter, looks like this is used in several notebooks, but only the first one? I've removed the 2nd one in this PR.

194781 bytes each:
./notebooks/docs/surrogates/AR_PFD.png
./notebooks/docs/surrogates/pysmo/AR_PFD.png

Is the _dev/ subdir needed here?

1033 bytes each:
./notebooks/references.bib
./notebooks/_dev/notebooks/references.bib

Is the dmf/ dir still needed?

871 bytes each:
./archive/dmf/data_management_framework.ipynb
./archive/dmf/my_workspace/files/928385c3acda4a449412c5bfbbaa83b5/data_management_framework.ipynb

@luohezhiming, can one of these be removed?

372990 bytes each:
./archive/power_gen/supercritical/supercritical_steam_cycle.svg
./notebooks/docs/power_gen/supercritical/supercritical_steam_cycle.svg

@luohezhiming, @dangunter can 2 of these be removed?

275711 bytes each:
./archive/power_gen/supercritical/Boiler_scpc_PFD.svg
./notebooks/docs/power_gen/supercritical/Boiler_scpc_PFD.svg
./archive/data_reconciliation/Boiler_scpc_PFD.svg

@andrewlee94
Copy link
Member

@JavalVyas2000 is no longer part of the project so probably will not be able to look at those files. Are we certain those are exact duplicates of each other, or are there minor differences between the two? I suspect we probably could find a way to avoid duplication, but I wonder if it is worth the extra overhead.

@JavalVyas2000
Copy link
Contributor

@ksbeattie the files you mentioned are essentially same but the property package they are using is with a different (each folder with a different surrogate). I am not sure of an elegant way to remove the duplication in this case. I can try to work that out but won't be able to until next week.

@lbianchi-lbl
Copy link
Contributor

Since the main drive for this PR was to address file size limitations, I'd suggest to leave those .py files as they are, since (1) their impact on the package size is most likely negligible, and (2) they might be extra complications with trying to relocate them to a common location (e.g. issues importing them from notebooks, needing to mess around with sys.path, etc), making "the juice not worth the squeeze".

@bpaul4
Copy link
Contributor

bpaul4 commented Oct 28, 2024

@bpaul4 - The .mdl_ files looks like checkpoint files when training keras models, I've removed them here in this PR, are these safe to remove from the repo (and maybe also add to the .gitignore file?)

50022 bytes each:
./notebooks/docs/surrogates/.mdl_wts.keras
./notebooks/docs/surrogates/keras_surrogate/keras_model.keras

47725 bytes each:
./notebooks/docs/surrogates/sco2/omlt/.mdl_co2.keras
./notebooks/docs/surrogates/sco2/omlt/sco2_keras_surr/sco2_keras_model.keras

50022 bytes each:
./notebooks/docs/surrogates/omlt/.mdl_wts.keras
./notebooks/docs/surrogates/omlt/keras_surrogate/keras_model.keras

Yes, these are model checkpoint files that seem to be generated during model training as a place to store history data like mean squared error. I believe it's safe to remove them and ignore them in future commits - it looks like the examples still pass without them and they are regenerated when the examples run.

We definitely need to keep the keras_model.keras files.

@ksbeattie ksbeattie marked this pull request as ready for review November 15, 2024 01:26
@lbianchi-lbl lbianchi-lbl merged commit f7893e7 into IDAES:main Nov 21, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:Normal Normal Priority Issue or PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants