Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Colab as a base image. #1444

Merged
merged 12 commits into from
Nov 27, 2024
Merged

Use Colab as a base image. #1444

merged 12 commits into from
Nov 27, 2024

Conversation

djherbis
Copy link
Contributor

This change makes a number of major changes:

  • Colab is the base image
  • uv is the main package install tool
  • leveraging requirements.txt instead of many separate installs
  • stop building and installing tensorflow/torch/lightbgm/jax since those are managed by the Colab base image now

In order to decide what packages to explicitly install I:

  • looked at what packages are in the Colab base image
  • looked at what packages were in the Kaggle image
  • looked at what packages were explicitly mentioned in Kaggle Dockerfile

This may still take a few iterations to get all the right parts in the image, but this should hopefully make the image much more manageable.

http://b/365782129

This change makes a number of major changes:
- Colab is the base image
- uv is the main package install tool
- leveraging requirements.txt instead of many separate installs
- stop building and installing tensorflow/torch/lightbgm/jax since those
  are managed by the Colab base image now

In order to decide what packages to explicitly install I:
- looked at what packages are in the Colab base image
- looked at what packages were in the Kaggle image
- looked at what packages were explicitly mentioned in Kaggle Dockerfile

This may still take a few iterations to get all the right parts in the
image, but this should hopefully make the image much more manageable.

http://b/365782129
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are no longer needed since they are installed in the base image.

@@ -3,7 +3,7 @@ set -e

IMAGE_TAG='kaggle/python-build'
IMAGE_TAG_OVERRIDE=''
ADDITONAL_OPTS=''
ADDITONAL_OPTS='--runtime runc ' # Use the CPU runtime by default
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colab image fails if using the nvidia runtime (which Jenkins does by default) while on a CPU VM.

learn = tabular_learner(dls, layers=[200, 100])
learn.fit_one_cycle(n_epoch=1)
with learn.no_bar():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastprogress was trying to render html while running in the background and it caused issues, so I turned off the progress bar rendering.

Copy link
Contributor

@rosbo rosbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work on this!

CUDA_MAJOR_VERSION=12
CUDA_MINOR_VERSION=3
CUDA_MINOR_VERSION=2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this downgrade is to align with Colab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Colab is on 12.2, so we are too now. Sucks to go backwards, but better that we're aligned.

@@ -34,7 +34,9 @@ def test_cpu(self):

self.assertEqual(1, gbm.best_iteration)

# TODO(b/381256047): Colab needs to install GPU-enabled lightgbm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you consider this a launch blocker for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how popular lightgbm (on GPU) is, I'm planning to have it be a launch blocker.
I'm working with the Colab team to get it in there :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-added it in #1451

Colab is also working on adding it to the base so we can re-drop once its out in their image

@djherbis djherbis merged commit 66bac48 into main Nov 27, 2024
4 checks passed
@djherbis djherbis deleted the uv-colab-base-image branch November 27, 2024 22:05
# b/183041606#comment5: the Kaggle data proxy doesn't support these APIs. If the library is missing, it falls back to using a regular BigQuery query to fetch data.
RUN uv pip uninstall --system google-cloud-bigquery-storage

# NOTE(herbison): uv fails to install this for some reason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer a bug over ldap, unless you intend to change it to a uv install soon.

Copy link
Contributor Author

@djherbis djherbis Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this in #1448.

# NOTE(herbison): uv fails to install this for some reason
RUN pip install git+https://github.com/Kaggle/learntools

# We install an incompatible pair of libs (shapely<, libpysal==4.9.2) so we can't put this one in the requirements.txt
Copy link
Contributor

@calderjo calderjo Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mention that it can't be added, but it's still in the requirements.txt:

https://github.com/Kaggle/docker-python/pull/1444/files#diff-2fb70e5756d0842b40d8bbe8b41a8be4d3a68c7d98fdaac2914b1dbf43638ed3R75-R76

did you intend on removing it?

also be nice to drop the related bug number here too: b/328788268

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not fully pinned in the requirements.txt, its a "<=4.9.2" there, but then we "upgrade it to "==4.9.2" here.

/tmp/clean-layer.sh
{{ end}}

# Install PyTorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳🎉

@@ -0,0 +1,139 @@
altair>=5.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super optional: seem sorta alphabetically ordered, is this intentional?
do we want to add a comment at the top to enforce it? ngl it is easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I originally created this list by doing pip freeze on Colab and on Kaggle, then diffing the results and taking the diff subset, and then only including explicitly mentioned packages from our Dockerfile.

It is roughly alpha ordered because of pip freeze (caps is going above other things, caps probably shouldn't matter, I'm not sure why some packages are capitalized). I'll add a comment in my follow-up.

vtk
wandb
wavio
xgboost==2.0.3
Copy link
Contributor

@calderjo calderjo Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include " b/350573866: xgboost v2.1.0 breaks learntools " comment from before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've been adding back pins and bug info in #1448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants