- class ibm_watson_machine_learning.helpers.connections.connections.DataConnection(location: S3Location | FSLocation | AssetLocation | CP4DAssetLocation | WMLSAssetLocation | WSDAssetLocation | CloudAssetLocation | NFSLocation | DeploymentOutputAssetLocation | ConnectionAssetLocation | DatabaseLocation | ContainerLocation | None = None, connection: S3Connection | NFSConnection | ConnectionAsset | None = None, data_join_node_name: str | List[str] | None = None, data_asset_id: str | None = None, connection_asset_id: str | None = None, **kwargs)[source]¶
Data Storage Connection class needed for WML training metadata (input data).
connection (NFSConnection or ConnectionAsset, optional) – connection parameters of specific type
location (Union[S3Location, FSLocation, AssetLocation]) – required location parameters of specific type
data_join_node_name (None or str or list[str], optional) –
name(s) for node(s):
None - data file name will be used as node name
str - it will became node name
list[str] - multiple names passed, several nodes will have the same data connection (used for excel files with multiple sheets)
data_asset_id (str, optional) – data asset ID if DataConnection should be pointing out to data asset
- classmethod from_studio(path: str) List[DataConnection] [source]¶
Create DataConnections from the credentials stored (connected) in Watson Studio. Only for COS.
path (str) – path in COS bucket to the training dataset
list with DataConnection objects
- Return type:
data_connections = DataConnection.from_studio(path='iris_dataset.csv')
- read(with_holdout_split: bool = False, csv_separator: str = ',', excel_sheet: str | int | None = None, encoding: str | None = 'utf-8', raw: bool | None = False, binary: bool | None = False, read_to_file: str | None = None, number_of_batch_rows: int | None = None, sampling_type: str | None = None, sample_size_limit: int | None = None, sample_rows_limit: int | None = None, sample_percentage_limit: float | None = None, **kwargs) DataFrame | Tuple[DataFrame, DataFrame] | bytes [source]¶
Download dataset stored in remote data storage. Returns batch up to 1GB.
with_holdout_split (bool, optional) – if True, data will be split to train and holdout dataset as it was by AutoAI
csv_separator (str, optional) – separator / delimiter for CSV file
excel_sheet (str, optional) – excel file sheet name to use, only use when xlsx file is an input, support for number of the sheet is deprecated
encoding (str, optional) – encoding type of the CSV
raw (bool, optional) – if False there wil be applied simple data preprocessing (the same as in the backend), if True, data will be not preprocessed
binary (bool, optional) – indicates to retrieve data in binary mode, the result will be a python binary type variable
read_to_file (str, optional) – stream read data to file under path specified as value of this parameter, use this parameter to prevent keeping data in-memory
number_of_batch_rows (int, optional) – number of rows to read in each batch when reading from flight connection
sampling_type (str, optional) – a sampling strategy how to read the data
sample_size_limit (int, optional) – upper limit for overall data that should be downloaded in bytes, default: 1 GB
sample_rows_limit (int, optional) – upper limit for overall data that should be downloaded in number of rows
sample_percentage_limit (float, optional) – upper limit for overall data that should be downloaded in percent of all dataset, this parameter is ignored, when sampling_type parameter is set to first_n_records, must be a float number between 0 and 1
If more than one of: sample_size_limit, sample_rows_limit, sample_percentage_limit are set, then downloaded data is limited to the lowest threshold.
pandas.DataFrame contains dataset from remote data storage : Xy_train
Tuple[pandas.DataFrame, pandas.DataFrame, pandas.DataFrame, pandas.DataFrame] : X_train, X_holdout, y_train, y_holdout
Tuple[pandas.DataFrame, pandas.DataFrame] : X_test, y_test containing training data and holdout data from remote storage
bytes object, auto holdout split from backend (only train data provided)
train_data_connections = optimizer.get_data_connections() data = train_data_connections.read() # all train data # or X_train, X_holdout, y_train, y_holdout = train_data_connections.read(with_holdout_split=True) # train and holdout data
User provided train and test data:
optimizer.fit(training_data_reference=[DataConnection], training_results_reference=DataConnection, test_data_reference=DataConnection) test_data_connection = optimizer.get_test_data_connections() X_test, y_test = test_data_connection.read() # only holdout data # and train_data_connections = optimizer.get_data_connections() data = train_connections.read() # only train data
Set initialized wml client in connection to enable write/read operations with connection to service.
wml_client (APIClient) – WML client to connect to service
- class ibm_watson_machine_learning.helpers.connections.connections.S3Location(bucket: str, path: str, **kwargs)[source]¶
Connection class to COS data storage in S3 format.
bucket (str) – COS bucket name
path (str) – COS data path in the bucket
excel_sheet (str, optional) – name of excel sheet if pointed dataset is excel file used for Batched Deployment scoring
model_location (str, optional) – path to the pipeline model in the COS
training_status (str, optional) – path to the training status json in COS
- class ibm_watson_machine_learning.helpers.connections.connections.DeploymentOutputAssetLocation(name: str, description: str = '')[source]¶
Connection class to data assets where output of batch deployment will be stored.
name (str) – name of .csv file which will be saved as data asset
description (str, optional) – description of the data asset