sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

[Feature Store] Feature group creation: provide a DataCatalogConfig while enabling glue table creation

Open simonvdk opened this issue 3 years ago • 6 comments

Use case Create a feature group with automatic glue table creation for the offline store metadata, while configuring the glue data catalog database and table names

Issue encountered It seems that providing a DataCatalogConfig and setting disable_glue_table_creation to false are mutually exclusive:

  • I can either not configure the glue database and table names and enable the glue table creation, so that the glue table with the default name and database is created upon feature group creation
  • OR I can provide a DataCatalogConfig but then I have to disable the glue table creation, so that the requested glue table is not created upon feature group creation

But I cannot provide a DataCatalogConfig and enable the glue table creation. Error encountered:

An error occurred (ValidationException) when calling the CreateFeatureGroup operation: Validation Error: DataCatalogConfig is not permitted in the request unless AutoCreateGlueTable is turned off. Please either set AutoCreateGlueTable to false or remove DataCatalogConfig from the request.

Why this seems to be an issue:

  • this behaviour (mutually exclusive) is not mentioned in the documentation. Also, there is no further mention or example of how to configure the offline store data catalog in the documentation
  • given the current state of the documentation, a user may want to configure the name of the glue database and table where the offline store metadata will be stored, while benefiting from the glue table creation upon feature group creation (with all the configuration - schema, storage descriptor etc - coming from the feature group information)
  • this extract from the java SDK documentation seems to indicate that the DataCatalogConfig should not be mutually exclusive with the automatic table creation

Ways to reproduce issue Reproduced with AWS SDK (2.50.0) and AWS CLI. Providing an OfflineStoreConfig with both DisableGlueTableCreation=False and a DataCatalogConfig with configured glue database (already created) and a glue table (that does not yet exist) raises the above error. Providing the DataCatalogConfig with DisableGlueTableCreation=True does not raise, but the glue table is not created either.

Example with AWS CLI:

aws sagemaker create-feature-group --cli-input-json '{"EventTimeFeatureName": "timestamp", "Description": "", "RecordIdentifierFeatureName": "record_id", "FeatureDefinitions": [{"FeatureName": "record_id", "FeatureType": "Integral"}, {"FeatureName": "timestamp", "FeatureType": "String"}], "OfflineStoreConfig": {"S3StorageConfig": {"S3Uri": "s3://my_bucket/my_prefix", "KmsKeyId": "arn:aws:kms:region:account_id:key/key_id"}, "DataCatalogConfig": {"TableName": "my_table", "Catalog": "account_id", "Database": "my_db"}, "DisableGlueTableCreation": false}, "FeatureGroupName": "my-feature-group"}'

Expected output A clearer documentation about how to configure the offline store data catalog (e.g. with an example in a notebook), and possibly the possibility to configure the data catalog while benefiting from the glue table creation

NB: A similar issue has been opened on the aws-cli repository

simonvdk avatar Feb 08 '22 20:02 simonvdk

so is this supported or not?

clausagerskov avatar Oct 26 '22 14:10 clausagerskov

@clausagerskov Apologies for the delay.

It seems that providing a DataCatalogConfig and setting disable_glue_table_creation to false are mutually exclusive:

Your conclusions are correct. Currently, we do not allow customers to provide DataCatalogConfig if they want Glue table to be auto-created. I'll discuss with the service team about the feasibility of supporting this.

In the mean time, we'll be updating both API documentation and notebook examples to make service expectations clear. Thank you for bringing this to our attention.

psnilesh avatar Jan 30 '23 04:01 psnilesh

Thanks @simonvdk for the explanation of details. I'm running into the same problem. Basically I want to specify my database and table when the Feature Group is created AND I want the Glue Catalog table created for me.

So currently what's the workaround?

  • Create the feature group
  • Manually create the Glue Catalog Table

If I manually create the glue table.. I'm guessing I can use the Feature Definitions within the Feature Group to 'help me' create the types for the glue table.

Here's pseudo code of how I might approach this.. I'm happy for someone to suggest an easier/better alternative. :)

my_feature_group = FeatureGroup(name=self.output_uuid, sagemaker_session=self.sm_session)
my_feature_group.load_feature_definitions(data_frame=self.input_df)
my_feature_group.create(
            s3_uri=s3_storage_path,
            record_identifier_name=self.id_column,
            event_time_feature_name=self.event_time_column,
            role_arn=self.sageworks_role_arn,
            enable_online_store=True
            data_catalog_config=my_config,
            disable_glue_table_creation=True
 )

<grab the feature definitions>
my_feature_group.feature_definitions
Out[4]: 
[FeatureDefinition(feature_name='id', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='name', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='age', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='score', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='date', feature_type=<FeatureTypeEnum.STRING: 'String'>)]

<take the above output, extract names, types, and fill in 'StorageDescriptor':  'Columns'>
<manually create Glue catalog table with boto3>

boto3 create table docs

Yes? .. this seems like a lot of work just so that we can place Feature Groups where we want them...

brifordwylie avatar Mar 19 '23 15:03 brifordwylie

We are facing something similar, but now by using Iceberg table types. If we set:

import pandas as pd
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.inputs import DataCatalogConfig
from sagemaker.feature_store.inputs import TableFormatEnum

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-feature-store'

customers_df = pd.read_csv('.././data/transformed/customers.csv')
customers_feature_group_name = f'customers-whatever'
customers_feature_group = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)
customers_feature_group.load_feature_definitions(data_frame=customers_df)

customers_feature_group.create(
    s3_uri=f's3://{default_bucket}/{prefix}', 
    record_identifier_name='customer_id', 
    event_time_feature_name='event_time', 
    role_arn=role, 
    enable_online_store=True,
    table_format=TableFormatEnum.ICEBERG,
    disable_glue_table_creation=True,
    data_catalog_config=DataCatalogConfig(
        catalog='AwsDataCatalog',
        database='dev_engineering_provisioned',
        table_name='customers_feature_group'
    )
)

we get:

An error occurred (ValidationException) when calling the CreateFeatureGroup operation: Validation Error: Iceberg table format is only supported when DisableGlueTableCreation is turned off. Please either set DisableGlueTableCreation to false or use Default table format.

And then, if we indeed turn if off (ie, changing disable_glue_table_creation=True to disable_glue_table_creation=False), we end up having the issue describe in this issue.

An error occurred (ValidationException) when calling the CreateFeatureGroup operation: Validation Error: DataCatalogConfig is not permitted in the request unless AutoCreateGlueTable is turned off. Please either set AutoCreateGlueTable to false or remove DataCatalogConfig from the request.

Which is surprising, because according to the doc, this parameter does not exist.

In our case, we don't want to create the table and database in defaults, but in specific ones.

toderesa97 avatar May 16 '23 14:05 toderesa97

Hello! I'm facing the same problem, is there a solution for this?

francescocamussoni avatar Jun 12 '23 14:06 francescocamussoni

I encountered the same issue. Due to data governance protocols, I cannot place the features in the default database (sagemaker_featurestore), so I would like to select a different database. However, I encountered the same issues described by the users posted on this thread. A prompt solution would be greatly appreciated!

yanivg10 avatar Aug 01 '23 00:08 yanivg10