Azure Data Assets
kedro-azureml
adds support for two new datasets that can be used in the Kedro catalog. Right now we support both Azure ML v1 SDK (direct Python) and Azure ML v2 SDK (fsspec-based) APIs.
For v2 API (fspec-based) - use AzureMLAssetDataset
that enables to use Azure ML v2 SDK Folder/File datasets for remote and local runs.
Currently only the uri_file and uri_folder types are supported. Because of limitations of the Azure ML SDK, the uri_file type can only be used for pipeline inputs,
not for outputs. The uri_folder type can be used for both inputs and outputs.
For v1 API (deprecated ⚠️) use the AzureMLFileDataset
and the AzureMLPandasDataset
which translate to File/Folder dataset and Tabular dataset respectively in
Azure Machine Learning. Both fully support the Azure versioning mechanism and can be used in the same way as any
other dataset in Kedro.
Apart from these, kedro-azureml
also adds the AzureMLPipelineDataset
which is used to pass data between
pipeline nodes when the pipeline is run on Azure ML and the pipeline data passing feature is enabled.
By default, data is then saved and loaded using the PickleDataset
as underlying dataset.
Any other underlying dataset can be used instead by adding a AzureMLPipelineDataset
to the catalog.
All of these can be found under the kedro_azureml.datasets module.
For details on usage, see the API Reference below
API Reference
Pipeline data passing
⚠️ Cannot be used when run locally.
- class kedro_azureml.datasets.AzureMLPipelineDataset(dataset: str | Type[AbstractDataset] | Dict[str, Any], root_dir: str = 'data', filepath_arg: str = 'filepath', metadata: Dict[str, Any] | None = None)
Dataset to support pipeline data passing in Azure ML between nodes, using kedro.io.AbstractDataset as base class. Wraps around an underlying dataset, which can be any dataset supported by Kedro, and adds the ability to modify the file path of the underlying dataset, to point to the mount paths on the Azure ML compute where the node is run.
Args
-dataset
: Underlying dataset definition. Accepted formats are: a) object of a class that inherits fromAbstractDataset
b) a string representing a fully qualified class name to such class c) a dictionary withtype
key pointing to a string from b), other keys are passed to the Dataset initializer.-root_dir
: Folder (path) to prepend to the filepath of the underlying dataset. If unspecified, defaults to “data”.-filepath_arg
: Underlying dataset initializer argument that will set the filepath. If unspecified, defaults to “filepath”.Example
Example of a catalog.yml entry:
processed_images: type: kedro_azureml.datasets.AzureMLPipelineDataset root_dir: 'data/01_raw' dataset: type: pillow.ImageDataset filepath: 'images.png'
V2 SDK
Use the dataset below when you’re using Azure ML SDK v2 (fsspec-based).
✅ Can be used for both remote and local runs.
- class kedro_azureml.datasets.asset_dataset.AzureMLAssetDataset(azureml_dataset: str, dataset: str | Type[AbstractDataset] | Dict[str, Any], root_dir: str = 'data', filepath_arg: str = 'filepath', azureml_type: Literal['uri_file', 'uri_folder'] = 'uri_folder', version: Version | None = None, metadata: Dict[str, Any] | None = None)
AzureMLAssetDataset enables kedro-azureml to use azureml v2-sdk Folder/File datasets for remote and local runs.
Args
-azureml_dataset
: Name of the AzureML dataset.-dataset
: Definition of the underlying dataset saved in the Folder/Filedataset. ``e.g. Parquet, Csv etc.-root_dir
: The local folder where the dataset should be saved during local runs. ``Relevant for local execution via kedro run.-filepath_arg
: Filepath arg on the wrapped dataset, defaults to filepath-azureml_type
: Either uri_folder or uri_file-version
: Version of the AzureML dataset to be used in kedro format.Example
Example of a catalog.yml entry:
my_folder_dataset: type: kedro_azureml.datasets.AzureMLAssetDataset azureml_dataset: my_azureml_folder_dataset root_dir: data/01_raw/some_folder/ versioned: True dataset: type: pandas.ParquetDataset filepath: "." my_file_dataset: type: kedro_azureml.datasets.AzureMLAssetDataset azureml_dataset: my_azureml_file_dataset root_dir: data/01_raw/some_other_folder/ versioned: True dataset: type: pandas.ParquetDataset filepath: "companies.csv"
- property azure_config: AzureMLConfig
AzureML config to be used by the dataset.
V1 SDK
Use the datasets below when you’re using Azure ML SDK v1 (direct Python).
⚠️ Deprecated - will be removed in future version of kedro-azureml.
- class kedro_azureml.datasets.AzureMLPandasDataset(*args, **kwargs)
- class kedro_azureml.datasets.AzureMLFileDataset(*args, **kwargs)