Amazon Redshift is a totally managed, petabyte-scale information warehouse service within the cloud. Tens of hundreds of consumers use Amazon Redshift to course of exabytes of knowledge on daily basis to energy their analytics workloads.
Amazon Redshift has added many options to boost analytical processing like ROLLUP, CUBE and GROUPING SETS, which have been demonstrated within the put up Simplify On-line Analytical Processing (OLAP) queries in Amazon Redshift utilizing new SQL constructs akin to ROLLUP, CUBE, and GROUPING SETS. Amazon Redshift has lately added many SQL instructions and expressions. On this put up, we discuss two new SQL options, the MERGE command and QUALIFY clause, which simplify information ingestion and information filtering.
One acquainted job in most downstream purposes is change information seize (CDC) and making use of it to its goal tables. This job requires analyzing the supply information to find out whether it is an replace or an insert to present goal information. With out the MERGE command, you wanted to check the brand new dataset in opposition to the prevailing dataset utilizing a enterprise key. When that didn’t match, you inserted new rows within the present dataset; in any other case, you up to date present dataset rows with new dataset values.
The MERGE command conditionally merges rows from a supply desk right into a goal desk. Historically, this might solely be achieved by utilizing a number of insert, replace, or delete statements individually. When utilizing a number of statements to replace or insert information, there’s a threat of inconsistencies between the completely different operations. Merge operation reduces this threat by making certain that every one operations are carried out collectively in a single transaction.
The QUALIFY clause filters the outcomes of a beforehand computed window perform in response to consumer‑specified search situations. You need to use the clause to use filtering situations to the results of a window perform with out utilizing a subquery. That is just like the HAVING clause, which applies a situation to additional filter rows from a WHERE clause. The distinction between QUALIFY and HAVING is that filtered outcomes from the QUALIFY clause might be based mostly on the results of working window capabilities on the information. You need to use each the QUALIFY and HAVING clauses in a single question.
On this put up, we show how you can use the MERGE command to implement CDC and how you can use QUALIFY to simplify validation of these adjustments.
Answer overview
On this use case, we now have an information warehouse, through which we now have a buyer dimension desk that should all the time get the most recent information from the supply system. This information should additionally replicate the preliminary creation time and final replace time for auditing and monitoring functions.
A easy strategy to clear up that is to override the shopper dimension absolutely on daily basis; nevertheless, that gained’t obtain the replace monitoring, which is an audit mandate, and it may not be possible to do for larger tables.
You possibly can load pattern information from Amazon S3 by following the instruction right here. Utilizing the prevailing buyer desk beneath sample_data_dev.tpcds
, we create a buyer dimension desk and a supply desk that can include each updates for present prospects and inserts for brand spanking new prospects. We use the MERGE command to merge the supply desk information with the goal desk (buyer dimension). We additionally present how you can use the QUALIFY clause to simplify validating the adjustments within the goal desk.
To comply with together with the steps on this put up, we advocate downloading the accompanying pocket book, which incorporates all of the scripts to run for this put up. To study authoring and working notebooks, check with Authoring and working notebooks.
Conditions
It is best to have the next conditions:
Create and populate the dimension desk
We use the prevailing buyer desk beneath sample_data_dev.tpcds
to create a customer_dimension
desk. Full the next steps:
- Create a desk utilizing a number of chosen fields, together with the enterprise key, and add a few upkeep fields for insert and replace timestamps:
- Populate the dimension desk:
- Validate the row rely and the contents of the desk:
Simulate buyer desk adjustments
Use the next code to simulate adjustments made to the desk:
Merge the supply desk into the goal desk
Now you may have a supply desk with some adjustments you’ll want to merge with the shopper dimension desk.
Earlier than the MERGE command, this kind of job wanted two separate UPDATE and INSERT instructions to implement:
The MERGE command makes use of a extra simple syntax, through which we use the important thing comparability consequence to determine if we carry out an replace DML operation (when matched) or an insert DML operation (when not matched):
Validate the information adjustments within the goal desk
Now we have to validate the information has made it accurately to the goal desk. We will first examine the up to date information utilizing the replace timestamp. As a result of this was our first replace, we are able to look at all rows the place the replace timestamp will not be null:
Use QUALIFY to simplify validation of the information adjustments
We have to look at the information inserted on this desk most lately. A technique to do this is to rank the information by its insert timestamp and get these with the primary rank. This requires utilizing the window perform rank()
and in addition requires a subquery to get the outcomes.
Earlier than the provision of QUALIFY, we would have liked to construct that utilizing a subquery like the next:
The QUALIFY perform eliminates the necessity for the subquery, as within the following code snippet:
Validate all information adjustments
We will union the outcomes of each queries to get all of the inserts and replace adjustments:
Clear up
To scrub up the assets used within the put up, delete the Redshift provisioned cluster or Redshift Serverless workgroup and namespace you created for this put up (this may also drop all of the objects created).
In the event you used an present Redshift provisioned cluster or Redshift Serverless workgroup and namespace, use the next code to drop these objects:
Conclusion
When utilizing a number of statements to replace or insert information, there’s a threat of inconsistencies between the completely different operations. The MERGE operation reduces this threat by making certain that every one operations are carried out collectively in a single transaction. For Amazon Redshift prospects who’re migrating from different information warehouse programs or who commonly must ingest fast-changing information into their Redshift warehouse, the MERGE command is a simple strategy to conditionally insert, replace, and delete information from goal tables based mostly on present and new supply information.
In most analytic queries that use window capabilities, you could want to make use of these window capabilities in your WHERE clause as nicely. Nonetheless, this isn’t permitted, and to take action, you need to construct a subquery that incorporates the required window perform after which use the ends in the dad or mum question within the WHERE clause. Utilizing the QUALIFY clause eliminates the necessity for a subquery and due to this fact simplifies the SQL assertion and makes it easier to put in writing and skim.
We encourage you to start out utilizing these new options and provides us your suggestions. For extra particulars, check with MERGE and QUALIFY clause.
In regards to the authors
Yanzhu Ji is a Product Supervisor within the Amazon Redshift workforce. She has expertise in product imaginative and prescient and technique in industry-leading information merchandise and platforms. She has excellent talent in constructing substantial software program merchandise utilizing net improvement, system design, database, and distributed programming methods. In her private life, Yanzhu likes portray, images, and taking part in tennis.
Ahmed Shehata is a Senior Analytics Specialist Options Architect at AWS based mostly on Toronto. He has greater than twenty years of expertise serving to prospects modernize their information platforms. Ahmed is obsessed with serving to prospects construct environment friendly, performant, and scalable analytic options.
Ranjan Burman is an Analytics Specialist Options Architect at AWS. He focuses on Amazon Redshift and helps prospects construct scalable analytical options. He has greater than 16 years of expertise in numerous database and information warehousing applied sciences. He’s obsessed with automating and fixing buyer issues with cloud options.