Added output cells to these notebooks #1259

shahrokhDaijavad · 2025-05-08T22:58:15Z

Addressing issue #1258

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

…ells

touma-I · 2025-05-12T16:56:39Z

examples/pdf-processing-1/requirements.txt

@@ -1,4 +1,4 @@
-data-prep-toolkit-transforms[ray,all]==1.1.1.dev0
+data-prep-toolkit-transforms[ray,all]==1.1.1.dev1


Now that 1.1.1 is release, please consider using it

touma-I · 2025-05-12T16:59:08Z

examples/pdf-processing-1/pdf_processing_1_ray.ipynb

Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without having to rework the notebook:
from dpk_docling2parquet.ray import Docling2Parquet
instead of:
from dpk_docling2parquet.ray.transform import Docling2Parquet

touma-I · 2025-05-12T17:01:46Z

examples/pdf-processing-1/pdf_processing_1_ray.ipynb

Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without breaking the notebook:
from dpk_doc_id.ray import DocID
instead of
from dpk_doc_id.ray.transform import DocID

touma-I · 2025-05-12T17:03:23Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

Consider changing imports as below. This will allow us to modify the internal structure of the transform in the future without having to rework the notebook:
from dpk_docling2parquet import Docling2Parquet
instead of:
from dpk_docling2parquet.transform_python import Docling2Parquet

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2025-05-13T19:14:42Z

@touma-I This PR is only about the output cells of the pdfprocessing in the examples folder.

touma-I · 2025-05-13T19:26:14Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

Cell #2. Consider adding file_utils.py to the util folder so you get it with pip install. This will simplify the notebook when running with and without collab.

touma-I · 2025-05-13T19:28:01Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

Can you provide some explanations on cell 3 and the reason for having to recreate a conda environment ?

touma-I · 2025-05-13T19:28:45Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

We won't need cell#5 if the utils are part of the pip install.

touma-I · 2025-05-13T19:30:54Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

Cell#6, we should always be downloading the data-files (with or without collab). We should not be assuming that we can get to the files directly. This will make the notebook easier to support and maintain.

there are some comments, that would be better to have explained in the markdown for example:

setup a sandbox env to avoid conflicts with colab libraries

!pip install -q condacolab
import condacolab
condacolab.install()
!conda create -n my_env python=3.11 -y
!conda activate my_env

to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0

!pip install --default-timeout=100
'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0'
humanfriendly

terminate the current kernel, so we restart the runtime

os.kill(os.getpid(), 9)

restart the session

Since some cells contain "grayed out" ( commented out) code it, helps to not have to rely on important statements missed as comments, (i.e.

print ("Input data dimensions (rows x columns)= ", input_df.shape)

print ("Output data dimensions (rows x columns)= ", output_df.shape)

touma-I

Several suggestions to streamline the notebook:

Submit a PR to add the utils to the package and installed via pip install since the can be used by all the notebooks
data-files are a bucket for all the examples to use. Consider always doing a download of the specific data file from git (don't assume working with local copy obtained via git clone)

shahrokhDaijavad · 2025-05-13T21:22:14Z

@sujee As you can see from the comments by Maroun, he is asking for more than just adding the output cells that I have done.

swith005

encountered some errors with ray in collab; please review

swith005 · 2025-05-13T21:57:31Z

examples/pdf-processing-1/pdf_processing_1_ray.ipynb

When running Step-4: Extract Data from PDF (docling2parquet) in Google Collab, I got:
1:54:48 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-05-13 21:54:48,751 ERROR services.py:1350 -- Failed to start the dashboard , return code 1
2025-05-13 21:54:48,753 ERROR services.py:1375 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure' to find where the log file is.
2025-05-13 21:54:48,754 ERROR services.py:1385 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2025-05-13_21-54-48_187893_5035/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr.
2025-05-13 21:54:48,755 ERROR services.py:1419 --
The last 20 lines of /tmp/ray/session_2025-05-13_21-54-48_187893_5035/logs/dashboard.err (it contains the error message from the dashboard):
File "/usr/local/lib/python3.11/site-packages/ray/dashboard/dashboard.py", line 11, in
import ray._private.ray_constants as ray_constants
ModuleNotFoundError: No module named 'ray'

is there ray missing from the installation?

swith005 · 2025-05-13T22:05:08Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

there are some comments, that would be better to have explained in the markdown for example:

setup a sandbox env to avoid conflicts with colab libraries

!pip install -q condacolab
import condacolab
condacolab.install()
!conda create -n my_env python=3.11 -y
!conda activate my_env

to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0

!pip install --default-timeout=100
'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0'
humanfriendly

terminate the current kernel, so we restart the runtime

os.kill(os.getpid(), 9)

restart the session

Since some cells contain "grayed out" ( commented out) code it, helps to not have to rely on important statements missed as comments, (i.e.

print ("Input data dimensions (rows x columns)= ", input_df.shape)

print ("Output data dimensions (rows x columns)= ", output_df.shape)

shahrokhDaijavad · 2025-05-13T22:25:01Z

@swith005 There are two issues here:

If you click on the "Google Colab" icon in the notebook, it uploads the version of notebook that is in the dev branch and not the version of notebook that is in this PR! If you want to test a notebook that is still in PR on Google Colab, you have to start Colab in the browser and then upload the notebook (in the PR) manually to the Colab.
Even if you do this, there is no guarantee that the Ray version runs successfully on Colab, but I think from the error you have above, the problem is number 1.

swith005 · 2025-05-14T14:00:07Z

examples/pdf-processing-1/pdf_processing_1_python.ipynb

Looks like the notebook is using ==1.1.1.dev0'; would be better to use the latest (1.1.1))

swith005 · 2025-05-14T14:00:20Z

examples/pdf-processing-1/pdf_processing_1_ray.ipynb

Looks like the notebook is using ==1.1.1.dev0'; would be better to use the latest (1.1.1))

swith005

import ray issues and version

shahrokhDaijavad · 2025-05-14T14:18:48Z

@swith005 These two notebooks install DPK modules by pip installing requirements.txt in the local environment and if you look at the requirements.txt in the PR, you will see that it is using 1.1.1 and not 1.1.1.dev0. When running on Google Colab, please refer to my note above for instructions on opening a notebook that is still in PR. I see that in the Python version, it uses 1.1.1 and in the ray version, it uses 1.1.1.dev1 that should be changed to 1.1.1.

touma-I · 2025-05-15T11:49:49Z

@sujee As you can see from the comments by Maroun, he is asking for more than just adding the output cells that I have done.

@sujee @shahrokhDaijavad there are a lot of good stuff in this notebook and it will be nice if we can streamline things so others can use it as a template for their work. Right now, I am worried only a few of us understand how it works and we should try to streamline it so it is easier to consume by others. Few things we discussed before and we may want to take action on assuming this is the last iteration we do on this notebook:

Let's get rid of the wget utils.py and either add it to the notebook itself or submit a PR to the data-processing-lib util library if we feel those services are widely needed in other notebooks
There are a lot of special things going on for collab vs non-collab. It will be nice if we can streamline this. From my experience, if we build it for collab, then we can run it as-is in any environment. If this is not a correct assumption, please let me know where collab extensions break the notebook when running in the environment.
Get rid of requirements.txt and add them to the notebook regardless of collab or no collab.
Can you help me understand why we are doing condo

shahrokhDaijavad and others added 2 commits May 8, 2025 15:55

Added output cells to these notebooks

ebf164f

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

Merge branch 'dev' into pdfprocess-notebooksplusoutputcells

b4f74d4

shahrokhDaijavad requested a review from touma-I May 8, 2025 23:38

shahrokhDaijavad added 2 commits May 9, 2025 09:26

Merge branch 'data-prep-kit:dev' into pdfprocess-notebooksplusoutputc…

e7fc93b

…ells

Merge branch 'dev' into pdfprocess-notebooksplusoutputcells

813441d

touma-I requested changes May 12, 2025

View reviewed changes

shahrokhDaijavad added 3 commits May 12, 2025 11:02

simplfying import statements

9662df3

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

Added recipes folder

0360dd0

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

added license

cfb369e

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad requested a review from touma-I May 12, 2025 23:48

separated recipe stuff from this PR

a40f026

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

touma-I reviewed May 13, 2025

View reviewed changes

examples/pdf-processing-1/pdf_processing_1_python.ipynb

Copy link

Collaborator

touma-I May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't need cell#5 if the utils are part of the pip install.

touma-I reviewed May 13, 2025

View reviewed changes

touma-I requested changes May 13, 2025

View reviewed changes

touma-I requested a review from swith005 May 13, 2025 19:53

swith005 requested changes May 13, 2025

View reviewed changes

swith005 reviewed May 14, 2025

View reviewed changes

swith005 requested changes May 14, 2025

View reviewed changes

shahrokhDaijavad mentioned this pull request Oct 6, 2025

[Bug] pdf_processing_1_python.ipynb is not working, complaining for ModuleNotFoundError: No module named 'humanfriendly' #1477

Open

2 tasks

		@@ -1,4 +1,4 @@
		data-prep-toolkit-transforms[ray,all]==1.1.1.dev0
		data-prep-toolkit-transforms[ray,all]==1.1.1.dev1

Added output cells to these notebooks #1259

Are you sure you want to change the base?

Added output cells to these notebooks #1259

Uh oh!

Conversation

shahrokhDaijavad commented May 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shahrokhDaijavad commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

setup a sandbox env to avoid conflicts with colab libraries

to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0

terminate the current kernel, so we restart the runtime

restart the session

print ("Input data dimensions (rows x columns)= ", input_df.shape)

print ("Output data dimensions (rows x columns)= ", output_df.shape)

Uh oh!

touma-I left a comment

Choose a reason for hiding this comment

Uh oh!

shahrokhDaijavad commented May 13, 2025

Uh oh!

swith005 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

setup a sandbox env to avoid conflicts with colab libraries

to install every thing thing use 'data-prep-toolkit-transforms[ray, all]==1.1.1.dev0

terminate the current kernel, so we restart the runtime

restart the session

print ("Input data dimensions (rows x columns)= ", input_df.shape)

print ("Output data dimensions (rows x columns)= ", output_df.shape)

Uh oh!

shahrokhDaijavad commented May 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swith005 left a comment

Choose a reason for hiding this comment

Uh oh!

shahrokhDaijavad commented May 14, 2025

Uh oh!

touma-I commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants