-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DataCatalog problem with ThreadRunner on kedro >=0.19.7 #4191
Comments
Im encountering the same issue while trying to run a pipeline in Databricks after upgrading from kedro 0.19.3 |
Hey @andresrq-mckinsey and @cramirez98, thanks for reporting this! I've added this to our backlog! Also ccing @ElenaKhaustova to see if any changes with the new catalog address this already? |
Hey all! Thank you for reporting the issue. As previously discussed in the support channel this bug will be solved after releasing new catalog - Currently one can use the main branch with
There's also a temporal alternative to stick to the older Kedro version |
@ElenaKhaustova shall we marked this as resolved and close it or does it make sense to keep this open? |
Let's keep it open until the release at least? In case we decide fixing it for the old catalog. |
Initial task is to pin down when this started failing (which version) and what point in the code. |
@cramirez98 @andresrq-mckinsey Can you confirm you are using dataset factory pattern or not? Is this a pure version upgrade or you update code at the same time? I created a separate issue: #4191 My finding is that this seems to link to dataset factory specifically, and the bug exists in much earlier version. |
Is this solved by #4262 @noklam @ElenaKhaustova ? |
I believe so, @andresrq-mckinsey are you able to confirm this with: |
Closing this issue, the solution #4262 will be in the next release. |
Description
After kedro >= 0.19.7, there was a change introduced into the DataCatalog on PR #3990 with the addition of
__repr__
into the_FrozenDatasets
. This is causing anerror: dictionary changed size during iteration
when using theThreadRunner
on some scenarios that handle big volumes of data.Context
So, there are 3 components that are acting to generate this error:
pluggy/_tracing.py
has an inheritlogger.debug
that prints the DataCatalogafter_node_run [hook]
log from pluggycatalog.yml
are added into the DataCatalogself.datasets
property once the node finishes, to keep track of themThe flow is as follows:
before_node_run
,on_node_error
andafter_node_run
get to also run in parallel.after_node_run
from the_tracing.py
will try to print the entire dataset object (see func__repr__
from DataCatalog), which triggers the print (__repr__
) of the_FrozenDatasets
.catalog.yml
, will be added into the DataCatalog as a MemoryDataset during the dataset saving process (runner.py:530 in_run_node_sequential
:catalog.save(name, data)
-> data_catalog.py:579 insave
:dataset = self._get_dataset(name)
-> data_catalog.py:453 in_get_dataset
:self.add(dataset_name, dataset)
)This was not a problem, as the entire dictionary was being printed before the implementation of
__repr__
in #3990 (so step 2 was like an atomic action). Now, it is being built with formatting, by iterating over the keys fromself._original_names
; but if another node finishes and is at step 2 at the same time a MemoryDataset is being saved and added into the DataCatalog (step 3), then keys fromself._original_names
get changed during iteration, which triggers the error.How to fix this
There are two methods that I know about to fix this problem: user fix or kedro fix
_FrozenDatasets.__repr__(self)
function a shallow copy of the dictionary, so that if new items get added, the current process that is iterating over theself._original_names
do not see this effect.In the
catalog.yml
Steps to Reproduce
I've tried to reproduce this problem on a kedro starter project, but these projects do not operate on datasets big enough to show the behavior. But the needed ingredients its to have dataset outputs not declared on the catalog.yml while using the ThreadRunner, and a big DataCatalog that takes time printing
Expected Result
Pipeline should work as previous kedro versiones
Actual Result
Pipeline fails with the following error
Your Environment
pip show kedro
orkedro -V
): 0.19.8 but also happens in 0.19.7 (when the change was introduced)python -V
): 3.11The text was updated successfully, but these errors were encountered: