Skip to content

Node caching: avoid unnecessary node execution when inputs, outputs, or logic remain unchanged #4350

Closed as not planned
@froxec

Description

Description

First of all, thank you for your efforts in developing Kedro.
I believe it would be highly beneficial if Kedro had a built-in node caching feature. By node caching, I mean a mechanism to avoid re-executing a node when its inputs, outputs, and logic remain unchanged.

Context

This feature is important to me because, in some scenarios, it is necessary to run the entire pipeline multiple times with different configurations. Re-executing nodes that remain unchanged between runs can significantly increase the time required for experiments.

For instance, when tracking pipeline parameters using MLFlow, we need to run the entire pipeline to record parameters for every node. This is because kedro-mlflow records parameters node by node.

Possible Implementation

There is already an existing plugin, kedro-cache, that implements similar functionality. The plugin is well-written and could work effectively with some adjustments. However, it is outdated and incompatible with the most recent Kedro releases. Moreover, there are compatibility issues with specific datasets, such as tracking.JSONDataset and tracking.MetricsDataset, which are write-only and cannot be loaded.

I believe that integrating node caching directly into Kedro's core design would help mitigate such compatibility issues and provide a more robust solution for users.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    CommunityIssue/PR opened by the open-source communityIssue: Feature RequestNew feature or improvement to existing feature

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions