-
Notifications
You must be signed in to change notification settings - Fork 240
Added output cells to these notebooks #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Added output cells to these notebooks #1259
Conversation
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
| @@ -1,4 +1,4 @@ | |||
| data-prep-toolkit-transforms[ray,all]==1.1.1.dev0 | |||
| data-prep-toolkit-transforms[ray,all]==1.1.1.dev1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that 1.1.1 is release, please consider using it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without having to rework the notebook:
from dpk_docling2parquet.ray import Docling2Parquet
instead of:
from dpk_docling2parquet.ray.transform import Docling2Parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without breaking the notebook:
from dpk_doc_id.ray import DocID
instead of
from dpk_doc_id.ray.transform import DocID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without having to rework the notebook:
from dpk_docling2parquet import Docling2Parquet
instead of:
from dpk_docling2parquet.transform_python import Docling2Parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
|
@touma-I This PR is only about the output cells of the pdfprocessing in the examples folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cell #2. Consider adding file_utils.py to the util folder so you get it with pip install. This will simplify the notebook when running with and without collab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide some explanations on cell 3 and the reason for having to recreate a conda environment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't need cell#5 if the utils are part of the pip install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cell#6, we should always be downloading the data-files (with or without collab). We should not be assuming that we can get to the files directly. This will make the notebook easier to support and maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some comments, that would be better to have explained in the markdown for example:
setup a sandbox env to avoid conflicts with colab libraries
!pip install -q condacolab
import condacolab
condacolab.install()
!conda create -n my_env python=3.11 -y
!conda activate my_env
to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0
!pip install --default-timeout=100
'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0'
humanfriendly
terminate the current kernel, so we restart the runtime
os.kill(os.getpid(), 9)
restart the session
Since some cells contain "grayed out" ( commented out) code it, helps to not have to rely on important statements missed as comments, (i.e.
print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
touma-I
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several suggestions to streamline the notebook:
- Submit a PR to add the utils to the package and installed via pip install since the can be used by all the notebooks
- data-files are a bucket for all the examples to use. Consider always doing a download of the specific data file from git (don't assume working with local copy obtained via git clone)
|
@sujee As you can see from the comments by Maroun, he is asking for more than just adding the output cells that I have done. |
swith005
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encountered some errors with ray in collab; please review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running Step-4: Extract Data from PDF (docling2parquet) in Google Collab, I got:
1:54:48 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-05-13 21:54:48,751 ERROR services.py:1350 -- Failed to start the dashboard , return code 1
2025-05-13 21:54:48,753 ERROR services.py:1375 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-05-13 21:54:48,754 ERROR services.py:1385 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2025-05-13_21-54-48_187893_5035/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr.
2025-05-13 21:54:48,755 ERROR services.py:1419 --
The last 20 lines of /tmp/ray/session_2025-05-13_21-54-48_187893_5035/logs/dashboard.err (it contains the error message from the dashboard):
File "/usr/local/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 11, in
import ray._private.ray_constants as ray_constants
ModuleNotFoundError: No module named 'ray'
is there ray missing from the installation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some comments, that would be better to have explained in the markdown for example:
setup a sandbox env to avoid conflicts with colab libraries
!pip install -q condacolab
import condacolab
condacolab.install()
!conda create -n my_env python=3.11 -y
!conda activate my_env
to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0
!pip install --default-timeout=100
'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0'
humanfriendly
terminate the current kernel, so we restart the runtime
os.kill(os.getpid(), 9)
restart the session
Since some cells contain "grayed out" ( commented out) code it, helps to not have to rely on important statements missed as comments, (i.e.
print ("Input data dimensions (rows x columns)= ", input_df.shape)
print ("Output data dimensions (rows x columns)= ", output_df.shape)
|
@swith005 There are two issues here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the notebook is using ==1.1.1.dev0'; would be better to use the latest (1.1.1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the notebook is using ==1.1.1.dev0'; would be better to use the latest (1.1.1))
swith005
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import ray issues and version
|
@swith005 These two notebooks install DPK modules by pip installing requirements.txt in the local environment and if you look at the requirements.txt in the PR, you will see that it is using 1.1.1 and not 1.1.1.dev0. When running on Google Colab, please refer to my note above for instructions on opening a notebook that is still in PR. I see that in the Python version, it uses 1.1.1 and in the ray version, it uses 1.1.1.dev1 that should be changed to 1.1.1. |
@sujee @shahrokhDaijavad there are a lot of good stuff in this notebook and it will be nice if we can streamline things so others can use it as a template for their work. Right now, I am worried only a few of us understand how it works and we should try to streamline it so it is easier to consume by others. Few things we discussed before and we may want to take action on assuming this is the last iteration we do on this notebook:
|
Addressing issue #1258