8.8 C
London
Friday, October 27, 2023

Create, practice, and deploy Amazon Redshift ML mannequin integrating options from Amazon SageMaker Characteristic Retailer


Amazon Redshift is a quick, petabyte-scale, cloud information warehouse that tens of hundreds of shoppers depend on to energy their analytics workloads. Knowledge analysts and database builders wish to use this information to coach machine studying (ML) fashions, which might then be used to generate insights on new information to be used circumstances comparable to forecasting income, predicting buyer churn, and detecting anomalies. Amazon Redshift ML makes it straightforward for SQL customers to create, practice, and deploy ML fashions utilizing SQL instructions acquainted to many roles comparable to executives, enterprise analysts, and information analysts. We coated in a earlier publish how you should utilize information in Amazon Redshift to coach fashions in Amazon SageMaker, a completely managed ML service, after which make predictions inside your Redshift information warehouse.

Redshift ML at present helps ML algorithms comparable to XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Moreover, you possibly can import present SageMaker fashions into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

Amazon SageMaker Characteristic Retailer is a completely managed, purpose-built repository to retailer, share, and handle options for ML fashions. Nonetheless, one problem in coaching a production-ready ML mannequin utilizing SageMaker Characteristic Retailer is entry to a various set of options that aren’t at all times owned and maintained by the staff that’s constructing the mannequin. For instance, an ML mannequin to establish fraudulent monetary transactions wants entry to each figuring out (gadget sort, browser) and transaction (quantity, credit score or debit, and so forth) associated options. As a knowledge scientist constructing an ML mannequin, you could have entry to the figuring out data however not the transaction data, and gaining access to a characteristic retailer solves this.

On this publish, we focus on the mixed characteristic retailer sample, which permits groups to take care of their very own native characteristic shops utilizing an area Redshift desk whereas nonetheless with the ability to entry shared options from the centralized characteristic retailer. In an area characteristic retailer, you possibly can retailer delicate information that may’t be shared throughout the group for regulatory and compliance causes.

We additionally present you how one can use acquainted SQL statements to create and practice ML fashions by combining shared options from the centralized retailer with native options and use these fashions to make in-database predictions on new information to be used circumstances comparable to fraud threat scoring.

Overview of answer

For this publish, we create an ML mannequin to foretell if a transaction is fraudulent or not, given the transaction report. To construct this, we have to engineer options that describe a person bank card’s spending sample, such because the variety of transactions or the typical transaction quantity, and likewise details about the service provider, the cardholder, the gadget used to make the fee, and every other information which may be related to detecting fraud.

To get began, we want an Amazon Redshift Serverless information warehouse with the Redshift ML characteristic enabled and an Amazon SageMaker Studio atmosphere with entry to SageMaker Characteristic Retailer. For an introduction to Redshift ML and directions on setting it up, see Create, practice, and deploy machine studying fashions in Amazon Redshift utilizing SQL with Amazon Redshift ML.

We additionally want an offline characteristic retailer to retailer options in characteristic teams. The offline retailer makes use of an Amazon Easy Storage Service (Amazon S3) bucket for storage and also can fetch information utilizing Amazon Athena queries. For an introduction to SageMaker Characteristic Retailer and directions on setting it up, see Getting began with Amazon SageMaker Characteristic Retailer.

The next diagram illustrates answer structure.

The workflow comprises the next steps:

  1. Create the offline characteristic group in SageMaker Characteristic Retailer and ingest information into the characteristic group.
  2. Create a Redshift desk and cargo native characteristic information into the desk.
  3. Create an exterior schema for Amazon Redshift Spectrum to entry the offline retailer information saved in Amazon S3 utilizing the AWS Glue Knowledge Catalog.
  4. Practice and validate a fraud threat scoring ML mannequin utilizing native characteristic information and exterior offline characteristic retailer information.
  5. Use the offline characteristic retailer and native retailer for inference.

Dataset

To show this use case, we use an artificial dataset with two tables: id and transactions. They’ll each be joined by the TransactionID column. The transaction desk comprises details about a selected transaction, comparable to quantity, credit score or debit card, and so forth, and the id desk comprises details about the consumer, comparable to gadget sort and browser. The transaction should exist within the transaction desk, however won’t at all times be out there within the id desk.

The next is an instance of the transactions dataset.

The next is an instance of the id dataset.

Let’s assume that throughout the group, information science groups centrally handle the id information and course of it to extract options in a centralized offline characteristic retailer. The information warehouse staff ingests and analyzes transaction information in a Redshift desk, owned by them.

We work by means of this use case to know how the info warehouse staff can securely retrieve the most recent options from the id characteristic group and be part of it with transaction information in Amazon Redshift to create a characteristic set for coaching and inferencing a fraud detection mannequin.

Create the offline characteristic group and ingest information

To start out, we arrange SageMaker Characteristic Retailer, create a characteristic group for the id dataset, examine and course of the dataset, and ingest some pattern information. We then put together the transaction options from the transaction information and retailer it in Amazon S3 for additional loading into the Redshift desk.

Alternatively, you possibly can creator options utilizing Amazon SageMaker Knowledge Wrangler, create characteristic teams in SageMaker Characteristic Retailer, and ingest options in batches utilizing an Amazon SageMaker Processing job with a pocket book exported from SageMaker Knowledge Wrangler. This mode permits for batch ingestion into the offline retailer.

Let’s discover among the key steps on this part.

  1. Obtain the pattern pocket book.
  2. On the SageMaker console, beneath Pocket book within the navigation pane, select Pocket book cases.
  3. Find your pocket book occasion and select Open Jupyter.
  4. Select Add and add the pocket book you simply downloaded.
  5. Open the pocket book sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb.
  6. Observe the directions and run all of the cells as much as the Cleanup Assets part.

The next are key steps from the pocket book:

  1. We create a Pandas DataFrame with the preliminary CSV information. We apply characteristic transformations for this dataset.
    identity_data = pd.read_csv(io.BytesIO(identity_data_object["Body"].learn()))
    transaction_data = pd.read_csv(io.BytesIO(transaction_data_object["Body"].learn()))
    
    identity_data = identity_data.spherical(5)
    transaction_data = transaction_data.spherical(5)
    
    identity_data = identity_data.fillna(0)
    transaction_data = transaction_data.fillna(0)
    
    # Characteristic transformations for this dataset are utilized 
    # One scorching encode card4, card6
    encoded_card_bank = pd.get_dummies(transaction_data["card4"], prefix="card_bank")
    encoded_card_type = pd.get_dummies(transaction_data["card6"], prefix="card_type")
    
    transformed_transaction_data = pd.concat(
        [transaction_data, encoded_card_type, encoded_card_bank], axis=1
    )

  2. We retailer the processed and reworked transaction dataset in an S3 bucket. This transaction information might be loaded later within the Redshift desk for constructing the native characteristic retailer.
    transformed_transaction_data.to_csv("transformed_transaction_data.csv", header=False, index=False)
    s3_client.upload_file("transformed_transaction_data.csv", default_s3_bucket_name, prefix + "/training_input/transformed_transaction_data.csv")

  3. Subsequent, we want a report identifier title and an occasion time characteristic title. In our fraud detection instance, the column of curiosity is TransactionID.EventTime might be appended to your information when no timestamp is obtainable. Within the following code, you possibly can see how these variables are set, after which EventTime is appended to each options’ information.
    # report identifier and occasion time characteristic names
    record_identifier_feature_name = "TransactionID"
    event_time_feature_name = "EventTime"
    
    # append EventTime characteristic
    identity_data[event_time_feature_name] = pd.Sequence(
        [current_time_sec] * len(identity_data), dtype="float64"
    )

  4. We then create and ingest the info into the characteristic group utilizing the SageMaker SDK FeatureGroup.ingest API. This can be a small dataset and due to this fact might be loaded right into a Pandas DataFrame. After we work with giant quantities of information and hundreds of thousands of rows, there are different scalable mechanisms to ingest information into SageMaker Characteristic Retailer, comparable to batch ingestion with Apache Spark.
    identity_feature_group.create(
        s3_uri=<S3_Path_Feature_Store>,
        record_identifier_name=record_identifier_feature_name,
        event_time_feature_name=event_time_feature_name,
        role_arn=<role_arn>,
        enable_online_store=False,
    )
    
    identity_feature_group_name = "identity-feature-group"
    
    # load characteristic definitions to the characteristic group. SageMaker FeatureStore Python SDK will auto-detect the info schema primarily based on enter information.
    identity_feature_group.load_feature_definitions(data_frame=identity_data)
    identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)
    

  5. We will confirm that information has been ingested into the characteristic group by working Athena queries within the pocket book or working queries on the Athena console.

At this level, the id characteristic group is created in an offline characteristic retailer with historic information endured in Amazon S3. SageMaker Characteristic Retailer mechanically creates an AWS Glue Knowledge Catalog for the offline retailer, which allows us to run SQL queries towards the offline information utilizing Athena or Redshift Spectrum.

Create a Redshift desk and cargo native characteristic information

To construct a Redshift ML mannequin, we construct a coaching dataset becoming a member of the id information and transaction information utilizing SQL queries. The id information is in a centralized characteristic retailer the place the historic set of information are endured in Amazon S3. The transaction information is an area characteristic for coaching information that should made out there within the Redshift desk.

Let’s discover how one can create the schema and cargo the processed transaction information from Amazon S3 right into a Redshift desk.

  1. Create the customer_transaction desk and cargo day by day transaction information into the desk, which you’ll use to coach the ML mannequin:
    DROP TABLE customer_transaction;
    CREATE TABLE customer_transaction (
      TransactionID INT,    
      isFraud INT,  
      TransactionDT INT,    
      TransactionAmt decimal(10,2), 
      card1 INT,    
      card2 decimal(10,2),card3 decimal(10,2),  
      card4 VARCHAR(20),card5 decimal(10,2),    
      card6 VARCHAR(20),    
      B1 INT,B2 INT,B3 INT,B4 INT,B5 INT,B6 INT,
      B7 INT,B8 INT,B9 INT,B10 INT,B11 INT,B12 INT,
      F1 INT,F2 INT,F3 INT,F4 INT,F5 INT,F6 INT,
      F7 INT,F8 INT,F9 INT,F10 INT,F11 INT,F12 INT,
      F13 INT,F14 INT,F15 INT,F16 INT,F17 INT,  
      N1 VARCHAR(20),N2 VARCHAR(20),N3 VARCHAR(20), 
      N4 VARCHAR(20),N5 VARCHAR(20),N6 VARCHAR(20), 
      N7 VARCHAR(20),N8 VARCHAR(20),N9 VARCHAR(20), 
      card_type_0  boolean,
      card_type_credit boolean,
      card_type_debit  boolean,
      card_bank_0  boolean,
      card_bank_american_express boolean,
      card_bank_discover  boolean,
      card_bank_mastercard  boolean,
      card_bank_visa boolean  
    );

  2. Load the pattern information by utilizing the next command. Change your Area and S3 path as acceptable. You will see the S3 path within the S3 Bucket Setup For The OfflineStore part within the pocket book or by checking the dataset_uri_prefix within the pocket book.
    COPY customer_transaction
    FROM '<s3path>/transformed_transaction_data.csv' 
    IAM_ROLE default delimiter ',' 
    area 'your-region';

Now that we now have created an area characteristic retailer for the transaction information, we deal with integrating a centralized characteristic retailer with Amazon Redshift to entry the id information.

Create an exterior schema for Redshift Spectrum to entry the offline retailer information

We’ve created a centralized characteristic retailer for id options, and we will entry this offline characteristic retailer utilizing providers comparable to Redshift Spectrum. When the id information is obtainable by means of the Redshift Spectrum desk, we will create a coaching dataset with characteristic values from the id characteristic group and customer_transaction, becoming a member of on the TransactionId column.

This part offers an outline of how one can allow Redshift Spectrum to question information immediately from information on Amazon S3 by means of an exterior database in an AWS Glue Knowledge Catalog.

  1. First, verify that the identity-feature-group desk is current within the Knowledge Catalog beneath the sagemamker_featurestore database.
  2. Utilizing Redshift Question Editor V2, create an exterior schema utilizing the next command:
    CREATE EXTERNAL SCHEMA sagemaker_featurestore
    FROM DATA CATALOG
    DATABASE 'sagemaker_featurestore'
    IAM_ROLE default
    create exterior database if not exists;

All of the tables, together with identity-feature-group exterior tables, are seen beneath the sagemaker_featurestore exterior schema. In Redshift Question Editor v2, you possibly can verify the contents of the exterior schema.

  1. Run the next question to pattern a couple of information—word that your desk title could also be totally different:
    Choose * from sagemaker_featurestore.identity_feature_group_1680208535 restrict 10;

  2. Create a view to affix the most recent information from identity-feature-group and customer_transaction on the TransactionId column. Make sure to change the exterior desk title to match your exterior desk title:
    create or change view public.credit_fraud_detection_v
    AS choose  "isfraud",
            "transactiondt",
            "transactionamt",
            "card1","card2","card3","card5",
             case when "card_type_credit" = 'False' then 0 else 1 finish as card_type_credit,
             case when "card_type_debit" = 'False' then 0 else 1 finish as card_type_debit,
             case when "card_bank_american_express" = 'False' then 0 else 1 finish as card_bank_american_express,
             case when "card_bank_discover" = 'False' then 0 else 1 finish as card_bank_discover,
             case when "card_bank_mastercard" = 'False' then 0 else 1 finish as card_bank_mastercard,
             case when "card_bank_visa" = 'False' then 0 else 1 finish as card_bank_visa,
            "id_01","id_02","id_03","id_04","id_05"
    from public.customer_transaction ct left be part of sagemaker_featurestore.identity_feature_group_1680208535 id
    on id.transactionid = ct.transactionid with no schema binding;

Practice and validate the fraud threat scoring ML mannequin

Redshift ML offers you the pliability to specify your personal algorithms and mannequin varieties and likewise to supply your personal superior parameters, which might embody preprocessors, drawback sort, and hyperparameters. On this publish, we create a buyer mannequin by specifying AUTO OFF and the mannequin sort of XGBOOST. By turning AUTO OFF and utilizing XGBoost, we’re offering the mandatory inputs for SageMaker to coach the mannequin. A advantage of this may be quicker coaching instances. XGBoost is as open-source model of the gradient boosted timber algorithm. For extra particulars on XGBoost, consult with Construct XGBoost fashions with Amazon Redshift ML.

We practice the mannequin utilizing 80% of the dataset by filtering on transactiondt < 12517618. The opposite 20% might be used for inference. A centralized characteristic retailer is helpful in offering the most recent supplementing information for coaching requests. Word that you’ll want to supply an S3 bucket title within the create mannequin assertion. It’s going to take roughly 10 minutes to create the mannequin.

CREATE MODEL frauddetection_xgboost
FROM (choose  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05"
from credit_fraud_detection_v the place transactiondt < 12517618
)
TARGET isfraud
FUNCTION ml_fn_frauddetection_xgboost
IAM_ROLE default
AUTO OFF
MODEL_TYPE XGBOOST
OBJECTIVE 'binary:logistic'
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100')
SETTINGS (S3_BUCKET <s3_bucket>);

If you run the create mannequin command, it should full shortly in Amazon Redshift whereas the mannequin coaching is occurring within the background utilizing SageMaker. You may verify the standing of the mannequin by working a present mannequin command:

present mannequin frauddetection_xgboost;

The output of the present mannequin command exhibits that the mannequin state is TRAINING. It additionally exhibits different data such because the mannequin sort and the coaching job title that SageMaker assigned.
After a couple of minutes, we run the present mannequin command once more:

present mannequin frauddetection_xgboost;

Now the output exhibits the mannequin state is READY. We will additionally see the practice:error rating right here, which at 0 tells us we now have an excellent mannequin. Now that the mannequin is skilled, we will use it for working inference queries.

Use the offline characteristic retailer and native retailer for inference

We will use the SQL operate to use the ML mannequin to information in queries, stories, and dashboards. Let’s use the operate ml_fn_frauddetection_xgboost created by our mannequin towards our take a look at dataset by filtering the place transactiondt >=12517618, to foretell whether or not a transaction is fraudulent or not. SageMaker Characteristic Retailer might be helpful in supplementing information for inference requests.

Run the next question to foretell whether or not transactions are fraudulent or not:

choose  "isfraud" as "Precise",
        ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") as "Predicted"
from credit_fraud_detection_v the place transactiondt >= 12517618;

For binary and multi-class classification issues, we compute the accuracy because the mannequin metric. Accuracy might be calculated primarily based on the next:

accuracy = (sum (precise == predicted)/complete) *100

Let’s apply the previous code to our use case to search out the accuracy of the mannequin. We use the take a look at information (transactiondt >= 12517618) to check the accuracy, and use the newly created operate ml_fn_frauddetection_xgboost to foretell and take the columns apart from the goal and label because the enter:

-- verify accuracy 
WITH infer_data AS (
SELECT "isfraud" AS label,
ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") AS predicted,
CASE 
   WHEN label IS NULL
       THEN 0
   ELSE label
   END AS precise,
CASE 
   WHEN precise = predicted
       THEN 1::INT
   ELSE 0::INT
   END AS appropriate
FROM credit_fraud_detection_v the place transactiondt >= 12517618),
aggr_data AS (
SELECT SUM(appropriate) AS num_correct,
COUNT(*) AS complete
FROM infer_data) 

SELECT (num_correct::FLOAT / complete::FLOAT) AS accuracy FROM aggr_data;

Clear up

As a closing step, clear up the assets:

  1. Delete the Redshift cluster.
  2. Run the Cleanup Assets part of your pocket book.

Conclusion

Redshift ML allows you to convey machine studying to your information, powering quick and knowledgeable decision-making. SageMaker Characteristic Retailer offers a purpose-built characteristic administration answer to assist organizations scale ML improvement throughout enterprise items and information science groups.

On this publish, we confirmed how one can practice an XGBoost mannequin utilizing Redshift ML with information unfold throughout SageMaker Characteristic Retailer and a Redshift desk. Moreover, we confirmed how one can make inferences on a skilled mannequin to detect fraud utilizing Amazon Redshift SQL instructions.


In regards to the authors

Anirban Sinha is a Senior Technical Account Supervisor at AWS. He’s obsessed with constructing scalable information warehouses and large information options working carefully with prospects. He works with giant ISVs prospects, in serving to them construct and function safe, resilient, scalable, and high-performance SaaS functions within the cloud.

Phil Bates is a Senior Analytics Specialist Options Architect at AWS. He has greater than 25 years of expertise implementing large-scale information warehouse options. He’s obsessed with serving to prospects by means of their cloud journey and utilizing the ability of ML inside their information warehouse.

Gaurav Singh is a Senior Options Architect at AWS, specializing in AI/ML and Generative AI. Primarily based in Pune, India, he focuses on serving to prospects construct, deploy, and migrate ML manufacturing workloads to SageMaker at scale. In his spare time, Gaurav likes to discover nature, learn, and run.

Latest news
Related news

LEAVE A REPLY

Please enter your comment!
Please enter your name here