Amazon Redshift is a quick, petabyte-scale, cloud information warehouse that tens of hundreds of shoppers depend on to energy their analytics workloads. Knowledge analysts and database builders wish to use this information to coach machine studying (ML) fashions, which might then be used to generate insights on new information to be used circumstances comparable to forecasting income, predicting buyer churn, and detecting anomalies. Amazon Redshift ML makes it straightforward for SQL customers to create, practice, and deploy ML fashions utilizing SQL instructions acquainted to many roles comparable to executives, enterprise analysts, and information analysts. We coated in a earlier publish how you should utilize information in Amazon Redshift to coach fashions in Amazon SageMaker, a completely managed ML service, after which make predictions inside your Redshift information warehouse.
Redshift ML at present helps ML algorithms comparable to XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Moreover, you possibly can import present SageMaker fashions into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.
Amazon SageMaker Characteristic Retailer is a completely managed, purpose-built repository to retailer, share, and handle options for ML fashions. Nonetheless, one problem in coaching a production-ready ML mannequin utilizing SageMaker Characteristic Retailer is entry to a various set of options that aren’t at all times owned and maintained by the staff that’s constructing the mannequin. For instance, an ML mannequin to establish fraudulent monetary transactions wants entry to each figuring out (gadget sort, browser) and transaction (quantity, credit score or debit, and so forth) associated options. As a knowledge scientist constructing an ML mannequin, you could have entry to the figuring out data however not the transaction data, and gaining access to a characteristic retailer solves this.
On this publish, we focus on the mixed characteristic retailer sample, which permits groups to take care of their very own native characteristic shops utilizing an area Redshift desk whereas nonetheless with the ability to entry shared options from the centralized characteristic retailer. In an area characteristic retailer, you possibly can retailer delicate information that may’t be shared throughout the group for regulatory and compliance causes.
We additionally present you how one can use acquainted SQL statements to create and practice ML fashions by combining shared options from the centralized retailer with native options and use these fashions to make in-database predictions on new information to be used circumstances comparable to fraud threat scoring.
Overview of answer
For this publish, we create an ML mannequin to foretell if a transaction is fraudulent or not, given the transaction report. To construct this, we have to engineer options that describe a person bank card’s spending sample, such because the variety of transactions or the typical transaction quantity, and likewise details about the service provider, the cardholder, the gadget used to make the fee, and every other information which may be related to detecting fraud.
To get began, we want an Amazon Redshift Serverless information warehouse with the Redshift ML characteristic enabled and an Amazon SageMaker Studio atmosphere with entry to SageMaker Characteristic Retailer. For an introduction to Redshift ML and directions on setting it up, see Create, practice, and deploy machine studying fashions in Amazon Redshift utilizing SQL with Amazon Redshift ML.
We additionally want an offline characteristic retailer to retailer options in characteristic teams. The offline retailer makes use of an Amazon Easy Storage Service (Amazon S3) bucket for storage and also can fetch information utilizing Amazon Athena queries. For an introduction to SageMaker Characteristic Retailer and directions on setting it up, see Getting began with Amazon SageMaker Characteristic Retailer.
The next diagram illustrates answer structure.
The workflow comprises the next steps:
- Create the offline characteristic group in SageMaker Characteristic Retailer and ingest information into the characteristic group.
- Create a Redshift desk and cargo native characteristic information into the desk.
- Create an exterior schema for Amazon Redshift Spectrum to entry the offline retailer information saved in Amazon S3 utilizing the AWS Glue Knowledge Catalog.
- Practice and validate a fraud threat scoring ML mannequin utilizing native characteristic information and exterior offline characteristic retailer information.
- Use the offline characteristic retailer and native retailer for inference.
Dataset
To show this use case, we use an artificial dataset with two tables: id
and transactions
. They’ll each be joined by the TransactionID
column. The transaction desk comprises details about a selected transaction, comparable to quantity, credit score or debit card, and so forth, and the id desk comprises details about the consumer, comparable to gadget sort and browser. The transaction should exist within the transaction desk, however won’t at all times be out there within the id desk.
The next is an instance of the transactions dataset.
The next is an instance of the id dataset.
Let’s assume that throughout the group, information science groups centrally handle the id information and course of it to extract options in a centralized offline characteristic retailer. The information warehouse staff ingests and analyzes transaction information in a Redshift desk, owned by them.
We work by means of this use case to know how the info warehouse staff can securely retrieve the most recent options from the id characteristic group and be part of it with transaction information in Amazon Redshift to create a characteristic set for coaching and inferencing a fraud detection mannequin.
Create the offline characteristic group and ingest information
To start out, we arrange SageMaker Characteristic Retailer, create a characteristic group for the id dataset, examine and course of the dataset, and ingest some pattern information. We then put together the transaction options from the transaction information and retailer it in Amazon S3 for additional loading into the Redshift desk.
Alternatively, you possibly can creator options utilizing Amazon SageMaker Knowledge Wrangler, create characteristic teams in SageMaker Characteristic Retailer, and ingest options in batches utilizing an Amazon SageMaker Processing job with a pocket book exported from SageMaker Knowledge Wrangler. This mode permits for batch ingestion into the offline retailer.
Let’s discover among the key steps on this part.
- Obtain the pattern pocket book.
- On the SageMaker console, beneath Pocket book within the navigation pane, select Pocket book cases.
- Find your pocket book occasion and select Open Jupyter.
- Select Add and add the pocket book you simply downloaded.
- Open the pocket book
sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb
. - Observe the directions and run all of the cells as much as the Cleanup Assets part.
The next are key steps from the pocket book:
- We create a Pandas DataFrame with the preliminary CSV information. We apply characteristic transformations for this dataset.
- We retailer the processed and reworked transaction dataset in an S3 bucket. This transaction information might be loaded later within the Redshift desk for constructing the native characteristic retailer.
- Subsequent, we want a report identifier title and an occasion time characteristic title. In our fraud detection instance, the column of curiosity is
TransactionID.EventTime
might be appended to your information when no timestamp is obtainable. Within the following code, you possibly can see how these variables are set, after whichEventTime
is appended to each options’ information. - We then create and ingest the info into the characteristic group utilizing the
SageMaker SDK FeatureGroup.ingest
API. This can be a small dataset and due to this fact might be loaded right into a Pandas DataFrame. After we work with giant quantities of information and hundreds of thousands of rows, there are different scalable mechanisms to ingest information into SageMaker Characteristic Retailer, comparable to batch ingestion with Apache Spark. - We will confirm that information has been ingested into the characteristic group by working Athena queries within the pocket book or working queries on the Athena console.
At this level, the id characteristic group is created in an offline characteristic retailer with historic information endured in Amazon S3. SageMaker Characteristic Retailer mechanically creates an AWS Glue Knowledge Catalog for the offline retailer, which allows us to run SQL queries towards the offline information utilizing Athena or Redshift Spectrum.
Create a Redshift desk and cargo native characteristic information
To construct a Redshift ML mannequin, we construct a coaching dataset becoming a member of the id information and transaction information utilizing SQL queries. The id information is in a centralized characteristic retailer the place the historic set of information are endured in Amazon S3. The transaction information is an area characteristic for coaching information that should made out there within the Redshift desk.
Let’s discover how one can create the schema and cargo the processed transaction information from Amazon S3 right into a Redshift desk.
- Create the
customer_transaction
desk and cargo day by day transaction information into the desk, which you’ll use to coach the ML mannequin: - Load the pattern information by utilizing the next command. Change your Area and S3 path as acceptable. You will see the S3 path within the S3 Bucket Setup For The OfflineStore part within the pocket book or by checking the
dataset_uri_prefix
within the pocket book.
Now that we now have created an area characteristic retailer for the transaction information, we deal with integrating a centralized characteristic retailer with Amazon Redshift to entry the id information.
Create an exterior schema for Redshift Spectrum to entry the offline retailer information
We’ve created a centralized characteristic retailer for id options, and we will entry this offline characteristic retailer utilizing providers comparable to Redshift Spectrum. When the id information is obtainable by means of the Redshift Spectrum desk, we will create a coaching dataset with characteristic values from the id characteristic group and customer_transaction
, becoming a member of on the TransactionId
column.
This part offers an outline of how one can allow Redshift Spectrum to question information immediately from information on Amazon S3 by means of an exterior database in an AWS Glue Knowledge Catalog.
- First, verify that the identity-feature-group desk is current within the Knowledge Catalog beneath the
sagemamker_featurestore
database. - Utilizing Redshift Question Editor V2, create an exterior schema utilizing the next command:
All of the tables, together with identity-feature-group
exterior tables, are seen beneath the sagemaker_featurestore
exterior schema. In Redshift Question Editor v2, you possibly can verify the contents of the exterior schema.
- Run the next question to pattern a couple of information—word that your desk title could also be totally different:
- Create a view to affix the most recent information from
identity-feature-group
andcustomer_transaction
on theTransactionId
column. Make sure to change the exterior desk title to match your exterior desk title:
Practice and validate the fraud threat scoring ML mannequin
Redshift ML offers you the pliability to specify your personal algorithms and mannequin varieties and likewise to supply your personal superior parameters, which might embody preprocessors, drawback sort, and hyperparameters. On this publish, we create a buyer mannequin by specifying AUTO OFF and the mannequin sort of XGBOOST. By turning AUTO OFF and utilizing XGBoost, we’re offering the mandatory inputs for SageMaker to coach the mannequin. A advantage of this may be quicker coaching instances. XGBoost is as open-source model of the gradient boosted timber algorithm. For extra particulars on XGBoost, consult with Construct XGBoost fashions with Amazon Redshift ML.
We practice the mannequin utilizing 80% of the dataset by filtering on transactiondt < 12517618
. The opposite 20% might be used for inference. A centralized characteristic retailer is helpful in offering the most recent supplementing information for coaching requests. Word that you’ll want to supply an S3 bucket title within the create mannequin assertion. It’s going to take roughly 10 minutes to create the mannequin.