Skip to content

Bug using AzureMLAssetDataset locally #147

Open
@robertmcleod2

Description

When using the AzureMLAssetDataset it all works fine when deployed. However, I get an error locally when one pipeline outputs an AzureMLAssetDataset, and another pipeline tries to consume this asset. Here is a reproducible example:

The first pipeline:

from kedro.pipeline import Pipeline, node
import pandas as pd


def create_dataset():
    df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    return df


def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes=[
            node(
                func=create_dataset,
                inputs=None,
                outputs="test_raw",
                name="create_test_raw",
            ),
        ],
    )

The second pipeline:

from kedro.pipeline import Pipeline, node


def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        nodes = [
            node(
                func=lambda x: x,
                inputs="test_raw",
                outputs="test_raw_copy",
                name="copy_test_raw"
            )
        ],
    )

and the catalog:

test_raw:
  type: kedro_azureml.datasets.AzureMLAssetDataset
  azureml_dataset: test_raw
  root_dir: data/00_azurelocals
  versioned: true
  dataset:
    type: pandas.CSVDataset
    filepath: test_raw.csv

test_raw_copy:
  type: kedro_azureml.datasets.AzureMLAssetDataset
  azureml_dataset: test_raw_copy
  root_dir: data/00_azurelocals
  versioned: true
  dataset:
    type: pandas.CSVDataset
    filepath: test_raw_copy.csv

When running the first pipeline locally with kedro run --pipeline test, it creates a local file at data/00_azurelocals/test_raw/local/test_raw.csv. Then when running the second pipeline with kedro run --pipeline copy_test, I get the following stack trace:

(enerfore-deployment) C:\Users\Robert.McLeod2\git_repos\ptx-ds-enerfore-deployment>kedro run --pipeline copy_test
[07/17/24 16:34:09] INFO     Kedro project ptx-ds-enerfore-deployment                                                                                                                                                                                session.py:365
[07/17/24 16:34:18]                                                                                                                                                                 
                    WARNING  Replacing dataset 'test_raw'                                                                                                                                                                                       data_catalog.py:606
                    WARNING  Replacing dataset 'test_raw_copy'                                                                                                                                                                                  data_catalog.py:606
                    INFO     Loading data from 'test_raw' (AzureMLAssetDataset)...                                                                                                                                                              data_catalog.py:502
Found the config file in: C:\Users\ROBERT~1.MCL\AppData\Local\Temp\tmpxxwei3q5\config.json
Found the config file in: C:\Users\ROBERT~1.MCL\AppData\Local\Temp\tmp5omkfo_i\config.json
[07/17/24 16:34:31] WARNING  No nodes ran. Repeat the previous command to attempt a new run.                                                                                                                                                          runner.py:213
Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_utils\_asset_utils.py", line 775, in _get_latest_version_from_container
    else container_operation.get(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\core\tracing\decorator.py", line 94, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_restclient\v2023_04_01_preview\operations\_data_containers_operations.py", line 430, in get
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\core\exceptions.py", line 161, in map_error
    raise error
azure.core.exceptions.ResourceNotFoundError: (UserError) test_raw container was not found.
Code: UserError
Message: test_raw container was not found.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\operations\_data_operations.py", line 265, in get
    return _resolve_label_to_asset(self, name, label)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_utils\_asset_utils.py", line 1022, in _resolve_label_to_asset
    return resolver(name)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\operations\_data_operations.py", line 675, in _get_latest_version
    latest_version = _get_latest_version_from_container(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_utils\_asset_utils.py", line 795, in _get_latest_version_from_container
    raise ValidationException(
azure.ai.ml.exceptions.ValidationException: Asset test_raw does not exist in workspace azuremlworkspace.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 193, in load
    return self._load()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 188, in _load
    azureml_ds = self._get_azureml_dataset()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 182, in _get_azureml_dataset
    self._azureml_dataset, version=self.resolve_load_version()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 576, in resolve_load_version
    return self._fetch_latest_load_version()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\cachetools\__init__.py", line 799, in wrapper
    v = method(self, *args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 175, in _fetch_latest_load_version
    return self._get_latest_version()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro_azureml\datasets\asset_dataset.py", line 169, in _get_latest_version
    return ml_client.data.get(self._azureml_dataset, label="latest").version
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_telemetry\activity.py", line 292, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\operations\_data_operations.py", line 279, in get
    log_and_raise_error(ex)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\azure\ai\ml\_exception_helper.py", line 337, in log_and_raise_error
    raise MlException(message=formatted_error, no_personal_data_message=formatted_error)
azure.ai.ml.exceptions.MlException:


1) Resource was not found.


Details:

(x) Asset test_raw does not exist in workspace azuremlworkspace.

Resolutions:
1) Double-check that the resource has been specified correctly and that you have access to it.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: https://code.visualstudio.com/docs/datascience/azure-machine-learning. To set up VS Code, visit https://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\cli\cli.py", line 211, in main
    cli_collection()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\cli\cli.py", line 139, in main
    super().main(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\cli\project.py", line 453, in run
    session.run(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\session\session.py", line 436, in run
    run_result = runner.run(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 103, in run
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\framework\session\session.py", line 436, in run
    run_result = runner.run(
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 103, in run
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 103, in run
    self._run(pipeline, catalog, hook_manager, session_id)
    self._run(pipeline, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\sequential_runner.py", line 70, in _run
    run_node(node, catalog, hook_manager, self._is_async, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 331, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\runner\runner.py", line 414, in _run_node_sequential
    inputs[name] = catalog.load(name)
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\data_catalog.py", line 506, in load
    result = dataset.load()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 614, in load
    return super().load()
  File "C:\Users\Robert.McLeod2\AppData\Local\anaconda3\envs\enerfore-deployment\lib\site-packages\kedro\io\core.py", line 202, in load
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while loading data from data set AzureMLAssetDataset(dataset_config={'filepath': test_raw.csv}, dataset_type=CSVDataset, filepath_arg=filepath, root_dir=data/00_azurelocals).



1) Resource was not found.


Details:

(x) Asset test_raw does not exist in workspace azuremlworkspace.

Resolutions:
1) Double-check that the resource has been specified correctly and that you have access to it.
If using the CLI, you can also check the full log in debug mode for more details by adding --debug to the end of your command

Additional Resources: The easiest way to author a yaml specification file is using IntelliSense and auto-completion Azure ML VS code extension provides: https://code.visualstudio.com/docs/datascience/azure-machine-learning. To set up VS Code, visit https://docs.microsoft.com/azure/machine-learning/how-to-setup-vs-code

So it seems like it is trying to find a version of the file on Azure, rather than using the local copy. When there is a version on Azure that exists, it puts the version number of the Dataset on azure in the directory path rather than local, i.e. it will look for a file at data/00_azurelocals/test_raw/4/test_raw.csv

I'm not sure why it is trying to find the dataset on Azure, but I would expect the behaviour would be to just look at the local files instead. This error only happens when using an AzureMLAssetDataset as an input locally. Any help is appreciated, thanks.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions