Skip to content

Conversation

@shahrokhDaijavad
Copy link
Collaborator

Addressing issue #1258

@shahrokhDaijavad shahrokhDaijavad requested a review from touma-I May 8, 2025 23:38
@@ -1,4 +1,4 @@
data-prep-toolkit-transforms[ray,all]==1.1.1.dev0
data-prep-toolkit-transforms[ray,all]==1.1.1.dev1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that 1.1.1 is release, please consider using it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without having to rework the notebook:
from dpk_docling2parquet.ray import Docling2Parquet
instead of:
from dpk_docling2parquet.ray.transform import Docling2Parquet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without breaking the notebook:
from dpk_doc_id.ray import DocID
instead of
from dpk_doc_id.ray.transform import DocID

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without having to rework the notebook:
from dpk_docling2parquet import Docling2Parquet
instead of:
from dpk_docling2parquet.transform_python import Docling2Parquet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@shahrokhDaijavad shahrokhDaijavad requested a review from touma-I May 12, 2025 23:48
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@shahrokhDaijavad
Copy link
Collaborator Author

@touma-I This PR is only about the output cells of the pdfprocessing in the examples folder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cell #2. Consider adding file_utils.py to the util folder so you get it with pip install. This will simplify the notebook when running with and without collab.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some explanations on cell 3 and the reason for having to recreate a conda environment ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't need cell#5 if the utils are part of the pip install.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cell#6, we should always be downloading the data-files (with or without collab). We should not be assuming that we can get to the files directly. This will make the notebook easier to support and maintain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some comments, that would be better to have explained in the markdown for example:

setup a sandbox env to avoid conflicts with colab libraries

!pip install -q condacolab
import condacolab
condacolab.install()
!conda create -n my_env python=3.11 -y
!conda activate my_env

to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0

!pip install --default-timeout=100
'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0'
humanfriendly

terminate the current kernel, so we restart the runtime

os.kill(os.getpid(), 9)

restart the session

Since some cells contain "grayed out" ( commented out) code it, helps to not have to rely on important statements missed as comments, (i.e.

print ("Input data dimensions (rows x columns)= ", input_df.shape)

print ("Output data dimensions (rows x columns)= ", output_df.shape)

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several suggestions to streamline the notebook:

  • Submit a PR to add the utils to the package and installed via pip install since the can be used by all the notebooks
  • data-files are a bucket for all the examples to use. Consider always doing a download of the specific data file from git (don't assume working with local copy obtained via git clone)

@touma-I touma-I requested a review from swith005 May 13, 2025 19:53
@shahrokhDaijavad
Copy link
Collaborator Author

@sujee As you can see from the comments by Maroun, he is asking for more than just adding the output cells that I have done.

Copy link
Collaborator

@swith005 swith005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encountered some errors with ray in collab; please review

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running Step-4: Extract Data from PDF (docling2parquet) in Google Collab, I got:
1:54:48 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-05-13 21:54:48,751 ERROR services.py:1350 -- Failed to start the dashboard , return code 1
2025-05-13 21:54:48,753 ERROR services.py:1375 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-05-13 21:54:48,754 ERROR services.py:1385 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2025-05-13_21-54-48_187893_5035/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr.
2025-05-13 21:54:48,755 ERROR services.py:1419 --
The last 20 lines of /tmp/ray/session_2025-05-13_21-54-48_187893_5035/logs/dashboard.err (it contains the error message from the dashboard):
File "/usr/local/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 11, in
import ray._private.ray_constants as ray_constants
ModuleNotFoundError: No module named 'ray'

is there ray missing from the installation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some comments, that would be better to have explained in the markdown for example:

setup a sandbox env to avoid conflicts with colab libraries

!pip install -q condacolab
import condacolab
condacolab.install()
!conda create -n my_env python=3.11 -y
!conda activate my_env

to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0

!pip install --default-timeout=100
'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0'
humanfriendly

terminate the current kernel, so we restart the runtime

os.kill(os.getpid(), 9)

restart the session

Since some cells contain "grayed out" ( commented out) code it, helps to not have to rely on important statements missed as comments, (i.e.

print ("Input data dimensions (rows x columns)= ", input_df.shape)

print ("Output data dimensions (rows x columns)= ", output_df.shape)

@shahrokhDaijavad
Copy link
Collaborator Author

@swith005 There are two issues here:

  1. If you click on the "Google Colab" icon in the notebook, it uploads the version of notebook that is in the dev branch and not the version of notebook that is in this PR! If you want to test a notebook that is still in PR on Google Colab, you have to start Colab in the browser and then upload the notebook (in the PR) manually to the Colab.
  2. Even if you do this, there is no guarantee that the Ray version runs successfully on Colab, but I think from the error you have above, the problem is number 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the notebook is using ==1.1.1.dev0'; would be better to use the latest (1.1.1))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the notebook is using ==1.1.1.dev0'; would be better to use the latest (1.1.1))

Copy link
Collaborator

@swith005 swith005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import ray issues and version

@shahrokhDaijavad
Copy link
Collaborator Author

@swith005 These two notebooks install DPK modules by pip installing requirements.txt in the local environment and if you look at the requirements.txt in the PR, you will see that it is using 1.1.1 and not 1.1.1.dev0. When running on Google Colab, please refer to my note above for instructions on opening a notebook that is still in PR. I see that in the Python version, it uses 1.1.1 and in the ray version, it uses 1.1.1.dev1 that should be changed to 1.1.1.

@touma-I
Copy link
Collaborator

touma-I commented May 15, 2025

@sujee As you can see from the comments by Maroun, he is asking for more than just adding the output cells that I have done.

@sujee @shahrokhDaijavad there are a lot of good stuff in this notebook and it will be nice if we can streamline things so others can use it as a template for their work. Right now, I am worried only a few of us understand how it works and we should try to streamline it so it is easier to consume by others. Few things we discussed before and we may want to take action on assuming this is the last iteration we do on this notebook:

  • Let's get rid of the wget utils.py and either add it to the notebook itself or submit a PR to the data-processing-lib util library if we feel those services are widely needed in other notebooks

  • There are a lot of special things going on for collab vs non-collab. It will be nice if we can streamline this. From my experience, if we build it for collab, then we can run it as-is in any environment. If this is not a correct assumption, please let me know where collab extensions break the notebook when running in the environment.

  • Get rid of requirements.txt and add them to the notebook regardless of collab or no collab.

  • Can you help me understand why we are doing condo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants