Azure Data Assets
kedro-azureml
adds support for two new datasets that can be used in the Kedro catalog, the AzureMLFileDataSet
and the AzureMLPandasDataSet
which translate to File/Folder dataset and Tabular dataset respectively in
Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
other dataset in Kedro.
Both of these can be found under the kedro_azureml.datasets module.
For details on usage, see the API Reference below
API Reference
- class kedro_azureml.datasets.AzureMLPandasDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None)
AzureML tabular dataset integration with Pandas DataFrame and kedro. Can be used to save Pandas DataFrame to AzureML tabular dataset, and load it back to Pandas DataFrame.
Args
-azureml_dataset
: Name of the AzureML file azureml_dataset.-azureml_datastore
: Name of the AzureML azureml_datastore. If not provided, the default azureml_datastore will be used.-azureml_dataset_save_args
: Additional arguments to pass toTabularDatasetFactory.register_pandas_dataframe
method. Read more: register_pandas_dataframe-azureml_dataset_load_args
: Additional arguments to pass toazureml.core.Dataset.get_by_name
method. Read more: Dataset.get_by_name-workspace
: AzureML Workspace. If not specified, will attempt to load the workspace automatically.-workspace_args
: Additional arguments to pass toutils.get_workspace()
.Example
Example of a catalog.yml entry:
my_pandas_dataframe_dataset: type: kedro_azureml.datasets.AzureMLPandasDataSet azureml_dataset: my_new_azureml_dataset # if version is not provided, the latest dataset version will be used azureml_dataset_load_args: version: 1
- class kedro_azureml.datasets.AzureMLFileDataSet(azureml_dataset: str, azureml_datastore: str | None = None, azureml_dataset_save_args: Dict[str, Any] | None = None, azureml_dataset_load_args: Dict[str, Any] | None = None, workspace: azureml.core.Workspace | None = None, workspace_args: Dict[str, Any] | None = None, **kwargs)
AzureML file dataset integration with Kedro, using kedro.io.PartitionedDataSet as base class. Can be used to save (register) data stored in azure blob storage as an AzureML file dataset. The data can then be loaded from the AzureML file dataset into a convenient format (e.g. pandas, pillow image etc).
Args
-azureml_dataset
: Name of the AzureML file dataset.-azureml_datastore
: Name of the AzureML datastore. If not provided, the default datastore will be used.-azureml_dataset_save_args
: Additional arguments to pass toAbstractDataset.register
method. make sure to passcreate_new_version=True
to create a new version of an existing dataset. note: if there’s no difference in file paths, a new version will not be created and the existing version will be overwritten, even ifcreate_new_version=True
. Read more: AbstractDataset.register.-azureml_dataset_load_args
: Additional arguments to pass to azureml.core.Dataset.get_by_name method. Read more: azureml.core.Dataset.get_by_name.-workspace
: AzureML Workspace. If not specified, will attempt to load the workspace automatically.-workspace_args
: Additional arguments to pass toutils.get_workspace()
.-kwargs
: Additional arguments to pass toPartitionedDataSet
constructor. make sure to not pass path argument, as it will be built fromazureml_datastore
argument.Example
Example of a catalog.yml enry:
processed_images: type: kedro_azureml.datasets.AzureMLFileDataSet dataset: pillow.ImageDataSet filename_suffix: '.png' azureml_dataset: processed_images azureml_dataset_save_args: create_new_version: true # if version is not provided, the latest dataset version will be used azureml_dataset_load_args: version: 1 # optional, if not provided, the environment variable # `AZURE_STORAGE_ACCOUNT_NAME` and `AZURE_STORAGE_ACCOUNT_KEY` will be used credentials: account_name: my_storage_account_name account_key: my_storage_account_key
Example of Python API usage:
import pandas as pd # create dummy data dict_df = {} dict_df['path/in/azure/blob/storage/file_1'] = pd.DataFrame({'a': [1,2], 'b': [3,4]}) dict_df['path/in/azure/blob/storage/file_2'] = pd.DataFrame({'c': [3,4], 'd': [5,6]}) # init AzureMLFileDataSet data_set = AzureMLFileDataSet( azureml_dataset='my_azureml_file_dataset_name', azureml_datastore='my_azureml_datastore_name', # optional, if not provided, the default datastore will be used # noqa dataset='pandas.CSVDataSet', filename_suffix='.csv', # optional, will add this suffix to the file names (file_1.csv, file_2.csv) # optional - if not provided, will use the environment variables # AZURE_STORAGE_ACCOUNT_NAME and AZURE_STORAGE_ACCOUNT_KEY credentials={ 'account_name': 'my_storage_account_name', 'account_key': 'my_storage_account_key', }, # create version if the dataset already exists (otherwise, when trying to save, will get an error) azureml_dataset_save_args={ 'create_new_version': True, } ) # this will create 2 blobs, one for each dataframe, in the following paths: # <my_storage_account_name/my_container/path/in/azure/blob/storage/file_1.csv> # <my_storage_account_name/my_container/path/in/azure/blob/storage/file_2.csv> # also, it will register a corresponding AzureML file-dataset under the name <my_azureml_file_dataset_name> # noqa data_set.save(dict_df) # this will create lazy load functions instead of loading data into memory immediately. loaded = data_set.load() # load all the partitions for file_path, load_func in loaded.items(): df = load_func() # process pandas dataframe # ...