AWS Glue is a serverless knowledge integrating service that you should use to catalog knowledge and put together for analytics. With AWS Glue, you may uncover your knowledge, develop scripts to rework sources into targets, and schedule and run extract, remodel, and cargo (ETL) jobs in a serverless atmosphere. AWS Glue jobs are answerable for working the info processing logic.
One essential function of AWS Glue jobs is the power to make use of bookmark keys to course of knowledge incrementally. When an AWS Glue job is run, it reads knowledge from a knowledge supply and processes it. A number of columns from the supply desk will be specified as bookmark keys. The column ought to have sequentially growing or reducing values with out gaps. These values are used to mark the final processed file in a batch. The subsequent run of the job resumes from that time. This lets you course of giant quantities of information incrementally. With out job bookmark keys, AWS Glue jobs must reprocess all the info throughout each run. This may be time-consuming and dear. Through the use of bookmark keys, AWS Glue jobs can resume processing from the place they left off, saving time and decreasing prices.
This publish explains easy methods to use a number of columns as job bookmark keys in an AWS Glue job with a JDBC connection to the supply knowledge retailer. It additionally demonstrates easy methods to parameterize the bookmark key columns and desk names within the AWS Glue job connection choices.
This publish is concentrated in direction of architects and knowledge engineers who design and construct ETL pipelines on AWS. You might be anticipated to have a fundamental understanding of the AWS Administration Console, AWS Glue, Amazon Relational Database Service (Amazon RDS), and Amazon CloudWatch logs.
Resolution overview
To implement this answer, we full the next steps:
- Create an Amazon RDS for PostgreSQL occasion.
- Create two tables and insert pattern knowledge.
- Create and run an AWS Glue job to extract knowledge from the RDS for PostgreSQL DB occasion utilizing a number of job bookmark keys.
- Create and run a parameterized AWS Glue job to extract knowledge from totally different tables with separate bookmark keys
The next diagram illustrates the parts of this answer.
Deploy the answer
For this answer, we offer an AWS CloudFormation template that units up the companies included within the structure, to allow repeatable deployments. This template creates the next assets:
- An RDS for PostgreSQL occasion
- An Amazon Easy Storage Service (Amazon S3) bucket to retailer the info extracted from the RDS for PostgreSQL occasion
- An AWS Id and Entry Administration (IAM) position for AWS Glue
- Two AWS Glue jobs with job bookmarks enabled to incrementally extract knowledge from the RDS for PostgreSQL occasion
To deploy the answer, full the next steps:
- Select to launch the CloudFormation stack:
- Enter a stack identify.
- Choose I acknowledge that AWS CloudFormation may create IAM assets with customized names.
- Select Create stack.
- Wait till the creation of the stack is full, as proven on the AWS CloudFormation console.
- When the stack is full, copy the AWS Glue scripts to the S3 bucket
job-bookmark-keys-demo-<accountid>
. - Open AWS CloudShell.
- Run the next instructions and change
<accountid>
together with your AWS account ID:
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_1_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_1_job.py
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2907/glue/scenario_2_job.py s3://job-bookmark-keys-demo-<accountid>/scenario_2_job.py
Add pattern knowledge and run AWS Glue jobs
On this part, we connect with the RDS for PostgreSQL occasion through AWS Lambda and create two tables. We additionally insert pattern knowledge into each the tables.
- On the Lambda console, select Features within the navigation pane.
- Select the operate
LambdaRDSDDLExecute
. - Select Take a look at and select Invoke for the Lambda operate to insert the info.
The 2 tables product and deal with shall be created with pattern knowledge, as proven within the following screenshot.
Run the multiple_job_bookmark_keys AWS Glue job
We run the multiple_job_bookmark_keys
AWS Glue job twice to extract knowledge from the product desk of the RDS for PostgreSQL occasion. Within the first run, all the present information shall be extracted. Then we insert new information and run the job once more. The job ought to extract solely the newly inserted information within the second run.
- On the AWS Glue console, select Jobs within the navigation pane.
- Select the job
multiple_job_bookmark_keys
. - Select Run to run the job and select the Runs tab to watch the job progress.
- Select the Output logs hyperlink below CloudWatch logs after the job is full.
- Select the log stream within the subsequent window to see the output logs printed.
The AWS Glue job extracted all information from the supply desk product. It retains monitor of the final mixture of values within the columnsproduct_id
andmodel
.Subsequent, we run one other Lambda operate to insert a brand new file. Theproduct_id
45 already exists, however the inserted file could have a brand new model as 2, making the mix sequentially growing. - Run the
LambdaRDSDDLExecute_incremental
Lambda operate to insert the brand new file within theproduct
desk. - Run the AWS Glue job
multiple_job_bookmark_keys
once more after you insert the file and await it to succeed. - Select the Output logs hyperlink below CloudWatch logs.
- Select the log stream within the subsequent window to see solely the newly inserted file printed.
The job extracts solely these information which have a mixture higher than the beforehand extracted information.
Run the parameterised_job_bookmark_keys AWS Glue job
We now run the parameterized AWS Glue job that takes the desk identify and bookmark key column as parameters. We run this job to extract knowledge from totally different tables sustaining separate bookmarks.
The primary run shall be for the deal with desk with bookmarkkey
as address_id
. These are already populated with the job parameters.
- On the AWS Glue console, select Jobs within the navigation pane.
- Select the job
parameterised_job_bookmark_keys
. - Select Run to run the job and select the Runs tab to watch the job progress.
- Select the Output logs hyperlink below CloudWatch logs after the job is full.
- Select the log stream within the subsequent window to see all information from the deal with desk printed.
- On the Actions menu, select Run with parameters.
- Develop the Job parameters part.
- Change the job parameter values as follows:
- Key
--bookmarkkey
with worthproduct_id
- Key
--table_name
with worthproduct
- The S3 bucket identify is unchanged (
job-bookmark-keys-demo-<accountnumber>
)
- Key
- Select Run job to run the job and select the Runs tab to watch the job progress.
- Select the Output logs hyperlink below CloudWatch logs after the job is full.
- Select the log stream to see all of the information from the product desk printed.
The job maintains separate bookmarks for every of the tables when extracting the info from the supply knowledge retailer. That is achieved by including the desk identify to the job identify and transformation contexts within the AWS Glue job script.
Clear up
To keep away from incurring future prices, full the next steps:
- On the Amazon S3 console, select Buckets within the navigation pane.
- Choose the bucket with job-bookmark-keys in its identify.
- Select Empty to delete all of the information and folders in it.
- On the CloudFormation console, select Stacks within the navigation pane.
- Choose the stack you created to deploy the answer and select Delete.
Conclusion
This publish demonstrated passing multiple column of a desk as jobBookmarkKeys
in a JDBC connection to an AWS Glue job. It additionally defined how one can a parameterized AWS Glue job to extract knowledge from a number of tables whereas conserving their respective bookmarks. As a subsequent step, you may check the incremental knowledge extract by altering knowledge within the supply tables.
In regards to the Authors
Durga Prasad is a Sr Lead Advisor enabling clients construct their Knowledge Analytics options on AWS. He’s a espresso lover and enjoys enjoying badminton.
Murali Reddy is a Lead Advisor at Amazon Internet Providers (AWS), serving to clients construct and implement knowledge analytics answer. When he’s not working, Murali is an avid bike rider and loves exploring new locations.