Azure Data Assets
kedro-azureml
adds support for two new datasets that can be used in the Kedro catalog, the AzureMLFileDataSet
and the AzureMLPandasDataSet
which translate to File/Folder dataset and Tabular dataset respectively in
Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
other dataset in Kedro.
Apart from these, kedro-azureml
also adds the AzureMLPipelineDataSet
which is used to pass data between
pipeline nodes when the pipeline is run on Azure ML and the pipeline data passing feature is enabled.
By default, data is then saved and loaded using the PickleDataSet
as underlying dataset.
Any other underlying dataset can be used instead by adding a AzureMLPipelineDataSet
to the catalog.
All of these can be found under the kedro_azureml.datasets module.
For details on usage, see the API Reference below
API Reference
- class kedro_azureml.datasets.AzureMLPandasDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None)
AzureML tabular dataset integration with Pandas DataFrame and kedro. Can be used to save Pandas DataFrame to AzureML tabular dataset, and load it back to Pandas DataFrame.
Args
-azureml_dataset
: Name of the AzureML file azureml_dataset.-azureml_datastore
: Name of the AzureML azureml_datastore. If not provided, the default azureml_datastore will be used.-azureml_dataset_save_args
: Additional arguments to pass toTabularDatasetFactory.register_pandas_dataframe
method. Read more: register_pandas_dataframe-azureml_dataset_load_args
: Additional arguments to pass toazureml.core.Dataset.get_by_name
method. Read more: Dataset.get_by_name-workspace
: AzureML Workspace. If not specified, will attempt to load the workspace automatically.-workspace_args
: Additional arguments to pass toutils.get_workspace()
.Example
Example of a catalog.yml entry:
my_pandas_dataframe_dataset: type: kedro_azureml.datasets.AzureMLPandasDataSet azureml_dataset: my_new_azureml_dataset # if version is not provided, the latest dataset version will be used azureml_dataset_load_args: version: 1
- class kedro_azureml.datasets.AzureMLFileDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None, **kwargs)
AzureML file dataset integration with Kedro, using kedro.io.PartitionedDataSet as base class. Can be used to save (register) data stored in azure blob storage as an AzureML file dataset. The data can then be loaded from the AzureML file dataset into a convenient format (e.g. pandas, pillow image etc).
Args
-azureml_dataset
: Name of the AzureML file dataset.-azureml_datastore
: Name of the AzureML datastore. If not provided, the default datastore will be used.-azureml_dataset_save_args
: Additional arguments to pass toAbstractDataset.register
method. make sure to passcreate_new_version=True
to create a new version of an existing dataset. note: if there’s no difference in file paths, a new version will not be created and the existing version will be overwritten, even ifcreate_new_version=True
. Read more: AbstractDataset.register.-azureml_dataset_load_args
: Additional arguments to pass to azureml.core.Dataset.get_by_name method. Read more: azureml.core.Dataset.get_by_name.-workspace
: AzureML Workspace. If not specified, will attempt to load the workspace automatically.-workspace_args
: Additional arguments to pass toutils.get_workspace()
.-kwargs
: Additional arguments to pass toPartitionedDataSet
constructor. make sure to not pass path argument, as it will be built fromazureml_datastore
argument.Example
Example of a catalog.yml entry:
processed_images: type: kedro_azureml.datasets.AzureMLFileDataSet dataset: pillow.ImageDataSet filename_suffix: '.png' azureml_dataset: processed_images azureml_dataset_save_args: create_new_version: true # if version is not provided, the latest dataset version will be used azureml_dataset_load_args: version: 1 # optional, if not provided, the environment variable # `AZURE_STORAGE_ACCOUNT_NAME` and `AZURE_STORAGE_ACCOUNT_KEY` will be used credentials: account_name: my_storage_account_name account_key: my_storage_account_key
Example of Python API usage:
import pandas as pd # create dummy data dict_df = {} dict_df['path/in/azure/blob/storage/file_1'] = pd.DataFrame({'a': [1,2], 'b': [3,4]}) dict_df['path/in/azure/blob/storage/file_2'] = pd.DataFrame({'c': [3,4], 'd': [5,6]}) # init AzureMLFileDataSet data_set = AzureMLFileDataSet( azureml_dataset='my_azureml_file_dataset_name', azureml_datastore='my_azureml_datastore_name', # optional, if not provided, the default datastore will be used # noqa dataset='pandas.CSVDataSet', filename_suffix='.csv', # optional, will add this suffix to the file names (file_1.csv, file_2.csv) # optional - if not provided, will use the environment variables # AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY credentials={ 'account_name': 'my_storage_account_name', 'account_key': 'my_storage_account_key', }, # create version if the dataset already exists (otherwise, when trying to save, will get an error) azureml_dataset_save_args={ 'create_new_version': True, } ) # this will create 2 blobs, one for each dataframe, in the following paths: # <my_storage_account_name/my_container/path/in/azure/blob/storage/file_1.csv> # <my_storage_account_name/my_container/path/in/azure/blob/storage/file_2.csv> # also, it will register a corresponding AzureML file-dataset under the name <my_azureml_file_dataset_name> # noqa data_set.save(dict_df) # this will create lazy load functions instead of loading data into memory immediately. loaded = data_set.load() # load all the partitions for file_path, load_func in loaded.items(): df = load_func() # process pandas dataframe # ...
- class kedro_azureml.datasets.AzureMLPipelineDataSet(dataset: str | Type[AbstractDataSet] | Dict[str, Any], filepath_arg: str = 'filepath')
Dataset to support pipeline data passing in Azure ML between nodes, using kedro.io.AbstractDataSet as base class. Wraps around an underlying dataset, which can be any dataset supported by Kedro, and adds the ability to modify the file path of the underlying dataset, to point to the mount paths on the Azure ML compute where the node is run.
Args
-dataset
: dataset: Underlying dataset definition. Accepted formats are: a) object of a class that inherits fromAbstractDataSet
b) a string representing a fully qualified class name to such class c) a dictionary withtype
key pointing to a string from b), other keys are passed to the Dataset initializer.-filepath_arg
: Underlying dataset initializer argument that will set the filepath. If unspecified, defaults to “filepath”.Example
Example of a catalog.yml entry:
processed_images: type: kedro_azureml.datasets.AzureMLPipelineDataSet dataset: type: pillow.ImageDataSet filepath: 'images.png'