Azure Data Assets

kedro-azureml adds support for two new datasets that can be used in the Kedro catalog, the AzureMLFileDataSet and the AzureMLPandasDataSet which translate to File/Folder dataset and Tabular dataset respectively in Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any other dataset in Kedro.

Both of these can be found under the kedro_azureml.datasets module.

For details on usage, see the API Reference below

API Reference

class kedro_azureml.datasets.AzureMLPandasDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None)

AzureML tabular dataset integration with Pandas DataFrame and kedro. Can be used to save Pandas DataFrame to AzureML tabular dataset, and load it back to Pandas DataFrame.

Args

- azureml_dataset: Name of the AzureML file azureml_dataset.
- azureml_datastore: Name of the AzureML azureml_datastore. If not provided, the default azureml_datastore will be used.
- azureml_dataset_save_args: Additional arguments to pass to TabularDatasetFactory.register_pandas_dataframe method. Read more: register_pandas_dataframe
- azureml_dataset_load_args: Additional arguments to pass to azureml.core.Dataset.get_by_name method. Read more: Dataset.get_by_name
- workspace: AzureML Workspace. If not specified, will attempt to load the workspace automatically.
- workspace_args: Additional arguments to pass to utils.get_workspace().

Example

Example of a catalog.yml entry:

my_pandas_dataframe_dataset:
  type: kedro_azureml.datasets.AzureMLPandasDataSet
  azureml_dataset: my_new_azureml_dataset

  # if version is not provided, the latest dataset version will be used
  azureml_dataset_load_args:
    version: 1

class kedro_azureml.datasets.AzureMLFileDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None, **kwargs)

AzureML file dataset integration with Kedro, using kedro.io.PartitionedDataSet as base class. Can be used to save (register) data stored in azure blob storage as an AzureML file dataset. The data can then be loaded from the AzureML file dataset into a convenient format (e.g. pandas, pillow image etc).

Args

- azureml_dataset: Name of the AzureML file dataset.
- azureml_datastore: Name of the AzureML datastore. If not provided, the default datastore will be used.
- azureml_dataset_save_args: Additional arguments to pass to AbstractDataset.register method. make sure to pass create_new_version=True to create a new version of an existing dataset. note: if there’s no difference in file paths, a new version will not be created and the existing version will be overwritten, even if create_new_version=True. Read more: AbstractDataset.register.
- azureml_dataset_load_args: Additional arguments to pass to azureml.core.Dataset.get_by_name method. Read more: azureml.core.Dataset.get_by_name.
- workspace: AzureML Workspace. If not specified, will attempt to load the workspace automatically.
- workspace_args: Additional arguments to pass to utils.get_workspace().
- kwargs: Additional arguments to pass to PartitionedDataSet constructor. make sure to not pass path argument, as it will be built from azureml_datastore argument.

Example

Example of a catalog.yml enry:

processed_images:
  type: kedro_azureml.datasets.AzureMLFileDataSet
  dataset: pillow.ImageDataSet
  filename_suffix: '.png'
  azureml_dataset: processed_images
  azureml_dataset_save_args:
    create_new_version: true

  # if version is not provided, the latest dataset version will be used
  azureml_dataset_load_args:
    version: 1

  # optional, if not provided, the environment variable
  # `AZURE_STORAGE_ACCOUNT_NAME` and `AZURE_STORAGE_ACCOUNT_KEY` will be used
  credentials:
    account_name: my_storage_account_name
    account_key: my_storage_account_key

Example of Python API usage:

import pandas as pd

# create dummy data
dict_df = {}
dict_df['path/in/azure/blob/storage/file_1'] = pd.DataFrame({'a': [1,2], 'b': [3,4]})
dict_df['path/in/azure/blob/storage/file_2'] = pd.DataFrame({'c': [3,4], 'd': [5,6]})

# init AzureMLFileDataSet
data_set = AzureMLFileDataSet(
    azureml_dataset='my_azureml_file_dataset_name',
    azureml_datastore='my_azureml_datastore_name',  # optional, if not provided, the default datastore will be used  # noqa
    dataset='pandas.CSVDataSet',
    filename_suffix='.csv',  # optional, will add this suffix to the file names (file_1.csv, file_2.csv)

    # optional - if not provided, will use the environment variables
    # AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY
    credentials={
        'account_name': 'my_storage_account_name',
        'account_key': 'my_storage_account_key',
    },

    # create version if the dataset already exists (otherwise, when trying to save, will get an error)
    azureml_dataset_save_args={
        'create_new_version': True,
    }
)

# this will create 2 blobs, one for each dataframe, in the following paths:
# <my_storage_account_name/my_container/path/in/azure/blob/storage/file_1.csv>
# <my_storage_account_name/my_container/path/in/azure/blob/storage/file_2.csv>
# also, it will register a corresponding AzureML file-dataset under the name <my_azureml_file_dataset_name> # noqa
data_set.save(dict_df)

# this will create lazy load functions instead of loading data into memory immediately.
loaded = data_set.load()

# load all the partitions
for file_path, load_func in loaded.items():
    df = load_func()

    # process pandas dataframe
    # ...