Skip to content

Conversation

@anna-charlotte
Copy link
Contributor

@anna-charlotte anna-charlotte commented Feb 28, 2023

Goals:

Add map_docs function and map_batch:

docs = list(map_docs(da, lambda x : x, num_worker=4 ))

This will be different from doing

for doc in da:
    doc = f(doc)
  • leverage multiprocessing by benchmarking in test, check that using 2 CPUS is faster then using 1

  • map_docs()

  • map_docs_batch()

  • benchmarking tests

  • check and update documentation, if required. See guide

@JohannesMessner
Copy link
Member

Is apply_batched also in scope for this issue? And map/map_batched?

func: Callable[[BaseDocument], BaseDocument],
num_worker: Optional[int] = None,
pool: Optional['Pool'] = None,
show_progress: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a backend options to switch between multi-processing and multi-threading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood we only wanted to keep multiprocessing, not multithreading @samsja

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoanFM maybe know more. But to me only multi processing makes sense. Can multi threading really improve performance here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the keep the PR discussion up to date with what we discussed on Discord: there will be multi threading since it makes sense for IO bound ops and tf/np/torch stuff

@anna-charlotte
Copy link
Contributor Author

Is apply_batched also in scope for this issue? And map/map_batched?

Map is already included, but private right now. Not sure if we want to expose it again, or only one of the two.
The batched one, I can add in this PR, but right now I am still trying to get the benchmark to work correctly. Test is passing in the CI, but only because it is being skipped I think, because there in only 1 CPU available I think, therefore no performance comparison with 1 cpu vs 2 cpus.

Copy link
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! We will make the map functions public, right?

@anna-charlotte
Copy link
Contributor Author

Looks good! We will make the map functions public, right?

I dont really have an opinion on this. @samsja suggested to keep only one of the two exposed, map or apply. Why would you prefer this @samsja

@anna-charlotte anna-charlotte marked this pull request as ready for review March 3, 2023 09:38
@JohannesMessner
Copy link
Member

Looks good! We will make the map functions public, right?

I dont really have an opinion on this. @samsja suggested to keep only one of the two exposed, map or apply. Why would you prefer this @samsja

Why should we keep only one @samsja? I think both make sense, the user might already have an in-place or pure function that they want to use without rewriting

anna-charlotte added 17 commits March 3, 2023 14:30
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
@anna-charlotte anna-charlotte changed the title feat: add apply function feat: add map function Mar 3, 2023
anna-charlotte added 2 commits March 3, 2023 15:15
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
anna-charlotte added 2 commits March 3, 2023 15:51
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
@github-actions
Copy link

github-actions bot commented Mar 3, 2023

📝 Docs are deployed on https://ft-feat-map-apply--jina-docs.netlify.app 🎉

@anna-charlotte anna-charlotte merged commit 6cd5912 into feat-rewrite-v2 Mar 3, 2023
@anna-charlotte anna-charlotte deleted the feat-map-apply branch March 3, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Map and apply with multiprocessing

4 participants