Azure Data Assets

kedro-azureml adds support for two new datasets that can be used in the Kedro catalog, the AzureMLFileDataSet and the AzureMLPandasDataSet which translate to File/Folder dataset and Tabular dataset respectively in Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any other dataset in Kedro.

Apart from these, kedro-azureml also adds the AzureMLPipelineDataSet which is used to pass data between pipeline nodes when the pipeline is run on Azure ML and the pipeline data passing feature is enabled. By default, data is then saved and loaded using the PickleDataSet as underlying dataset. Any other underlying dataset can be used instead by adding a AzureMLPipelineDataSet to the catalog.

All of these can be found under the kedro_azureml.datasets module.

For details on usage, see the API Reference below

API Reference

class kedro_azureml.datasets.AzureMLPandasDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None)

AzureML tabular dataset integration with Pandas DataFrame and kedro. Can be used to save Pandas DataFrame to AzureML tabular dataset, and load it back to Pandas DataFrame.

Args

- azureml_dataset: Name of the AzureML file azureml_dataset.

- azureml_datastore: Name of the AzureML azureml_datastore. If not provided, the default azureml_datastore will be used.

- azureml_dataset_save_args: Additional arguments to pass to TabularDatasetFactory.register_pandas_dataframe method. Read more: register_pandas_dataframe

- azureml_dataset_load_args: Additional arguments to pass to azureml.core.Dataset.get_by_name method. Read more: Dataset.get_by_name

- workspace: AzureML Workspace. If not specified, will attempt to load the workspace automatically.

- workspace_args: Additional arguments to pass to utils.get_workspace().

Example

Example of a catalog.yml entry:

my_pandas_dataframe_dataset:
  type: kedro_azureml.datasets.AzureMLPandasDataSet
  azureml_dataset: my_new_azureml_dataset

  # if version is not provided, the latest dataset version will be used
  azureml_dataset_load_args:
    version: 1

class kedro_azureml.datasets.AzureMLFileDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None, **kwargs)

AzureML file dataset integration with Kedro, using kedro.io.PartitionedDataSet as base class. Can be used to save (register) data stored in azure blob storage as an AzureML file dataset. The data can then be loaded from the AzureML file dataset into a convenient format (e.g. pandas, pillow image etc).

Args

- azureml_dataset: Name of the AzureML file dataset.

- azureml_datastore: Name of the AzureML datastore. If not provided, the default datastore will be used.

- azureml_dataset_save_args: Additional arguments to pass to AbstractDataset.register method. make sure to pass create_new_version=True to create a new version of an existing dataset. note: if there’s no difference in file paths, a new version will not be created and the existing version will be overwritten, even if create_new_version=True. Read more: AbstractDataset.register.

- azureml_dataset_load_args: Additional arguments to pass to azureml.core.Dataset.get_by_name method. Read more: azureml.core.Dataset.get_by_name.

- workspace: AzureML Workspace. If not specified, will attempt to load the workspace automatically.

- workspace_args: Additional arguments to pass to utils.get_workspace().

- kwargs: Additional arguments to pass to PartitionedDataSet constructor. make sure to not pass path argument, as it will be built from azureml_datastore argument.

Example

Example of a catalog.yml entry:

processed_images:
  type: kedro_azureml.datasets.AzureMLFileDataSet
  dataset: pillow.ImageDataSet
  filename_suffix: '.png'
  azureml_dataset: processed_images
  azureml_dataset_save_args:
    create_new_version: true

  # if version is not provided, the latest dataset version will be used
  azureml_dataset_load_args:
    version: 1

  # optional, if not provided, the environment variable
  # `AZURE_STORAGE_ACCOUNT_NAME` and `AZURE_STORAGE_ACCOUNT_KEY` will be used
  credentials:
    account_name: my_storage_account_name
    account_key: my_storage_account_key

Example of Python API usage:

import pandas as pd

# create dummy data
dict_df = {}
dict_df['path/in/azure/blob/storage/file_1'] = pd.DataFrame({'a': [1,2], 'b': [3,4]})
dict_df['path/in/azure/blob/storage/file_2'] = pd.DataFrame({'c': [3,4], 'd': [5,6]})

# init AzureMLFileDataSet
data_set = AzureMLFileDataSet(
    azureml_dataset='my_azureml_file_dataset_name',
    azureml_datastore='my_azureml_datastore_name',  # optional, if not provided, the default datastore will be used  # noqa
    dataset='pandas.CSVDataSet',
    filename_suffix='.csv',  # optional, will add this suffix to the file names (file_1.csv, file_2.csv)

    # optional - if not provided, will use the environment variables
    # AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY
    credentials={
        'account_name': 'my_storage_account_name',
        'account_key': 'my_storage_account_key',
    },

    # create version if the dataset already exists (otherwise, when trying to save, will get an error)
    azureml_dataset_save_args={
        'create_new_version': True,
    }
)

# this will create 2 blobs, one for each dataframe, in the following paths:
# <my_storage_account_name/my_container/path/in/azure/blob/storage/file_1.csv>
# <my_storage_account_name/my_container/path/in/azure/blob/storage/file_2.csv>
# also, it will register a corresponding AzureML file-dataset under the name <my_azureml_file_dataset_name> # noqa
data_set.save(dict_df)

# this will create lazy load functions instead of loading data into memory immediately.
loaded = data_set.load()

# load all the partitions
for file_path, load_func in loaded.items():
    df = load_func()

    # process pandas dataframe
    # ...

class kedro_azureml.datasets.AzureMLPipelineDataSet(dataset: str | Type[AbstractDataSet] | Dict[str, Any], filepath_arg: str = 'filepath')

Dataset to support pipeline data passing in Azure ML between nodes, using kedro.io.AbstractDataSet as base class. Wraps around an underlying dataset, which can be any dataset supported by Kedro, and adds the ability to modify the file path of the underlying dataset, to point to the mount paths on the Azure ML compute where the node is run.

Args

- dataset: dataset: Underlying dataset definition. Accepted formats are: a) object of a class that inherits from AbstractDataSet b) a string representing a fully qualified class name to such class c) a dictionary with type key pointing to a string from b), other keys are passed to the Dataset initializer.

- filepath_arg: Underlying dataset initializer argument that will set the filepath. If unspecified, defaults to “filepath”.

Example

Example of a catalog.yml entry:

processed_images:
  type: kedro_azureml.datasets.AzureMLPipelineDataSet
  dataset:
    type: pillow.ImageDataSet
    filepath: 'images.png'