-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Colab as a base image. #1444
Conversation
This change makes a number of major changes: - Colab is the base image - uv is the main package install tool - leveraging requirements.txt instead of many separate installs - stop building and installing tensorflow/torch/lightbgm/jax since those are managed by the Colab base image now In order to decide what packages to explicitly install I: - looked at what packages are in the Colab base image - looked at what packages were in the Kaggle image - looked at what packages were explicitly mentioned in Kaggle Dockerfile This may still take a few iterations to get all the right parts in the image, but this should hopefully make the image much more manageable. http://b/365782129
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are no longer needed since they are installed in the base image.
@@ -3,7 +3,7 @@ set -e | |||
|
|||
IMAGE_TAG='kaggle/python-build' | |||
IMAGE_TAG_OVERRIDE='' | |||
ADDITONAL_OPTS='' | |||
ADDITONAL_OPTS='--runtime runc ' # Use the CPU runtime by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Colab image fails if using the nvidia runtime (which Jenkins does by default) while on a CPU VM.
learn = tabular_learner(dls, layers=[200, 100]) | ||
learn.fit_one_cycle(n_epoch=1) | ||
with learn.no_bar(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fastprogress was trying to render html while running in the background and it caused issues, so I turned off the progress bar rendering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work on this!
CUDA_MAJOR_VERSION=12 | ||
CUDA_MINOR_VERSION=3 | ||
CUDA_MINOR_VERSION=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this downgrade is to align with Colab?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Colab is on 12.2, so we are too now. Sucks to go backwards, but better that we're aligned.
@@ -34,7 +34,9 @@ def test_cpu(self): | |||
|
|||
self.assertEqual(1, gbm.best_iteration) | |||
|
|||
# TODO(b/381256047): Colab needs to install GPU-enabled lightgbm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you consider this a launch blocker for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how popular lightgbm (on GPU) is, I'm planning to have it be a launch blocker.
I'm working with the Colab team to get it in there :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-added it in #1451
Colab is also working on adding it to the base so we can re-drop once its out in their image
# b/183041606#comment5: the Kaggle data proxy doesn't support these APIs. If the library is missing, it falls back to using a regular BigQuery query to fetch data. | ||
RUN uv pip uninstall --system google-cloud-bigquery-storage | ||
|
||
# NOTE(herbison): uv fails to install this for some reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer a bug over ldap, unless you intend to change it to a uv install soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this in #1448.
# NOTE(herbison): uv fails to install this for some reason | ||
RUN pip install git+https://github.com/Kaggle/learntools | ||
|
||
# We install an incompatible pair of libs (shapely<, libpysal==4.9.2) so we can't put this one in the requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mention that it can't be added, but it's still in the requirements.txt:
did you intend on removing it?
also be nice to drop the related bug number here too: b/328788268
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not fully pinned in the requirements.txt, its a "<=4.9.2" there, but then we "upgrade it to "==4.9.2" here.
/tmp/clean-layer.sh | ||
{{ end}} | ||
|
||
# Install PyTorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳🎉
@@ -0,0 +1,139 @@ | |||
altair>=5.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super optional: seem sorta alphabetically ordered, is this intentional?
do we want to add a comment at the top to enforce it? ngl it is easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I originally created this list by doing pip freeze
on Colab and on Kaggle, then diffing the results and taking the diff subset, and then only including explicitly mentioned packages from our Dockerfile.
It is roughly alpha ordered because of pip freeze (caps is going above other things, caps probably shouldn't matter, I'm not sure why some packages are capitalized). I'll add a comment in my follow-up.
vtk | ||
wandb | ||
wavio | ||
xgboost==2.0.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe include " b/350573866: xgboost v2.1.0 breaks learntools " comment from before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've been adding back pins and bug info in #1448
This change makes a number of major changes:
In order to decide what packages to explicitly install I:
This may still take a few iterations to get all the right parts in the image, but this should hopefully make the image much more manageable.
http://b/365782129