Construct environment friendly ETL pipelines with AWS Step Features distributed map and redrive characteristic

AWS Step Features is a completely managed visible workflow service that lets you construct advanced knowledge processing pipelines involving a various set of extract, remodel, and cargo (ETL) applied sciences similar to AWS Glue, Amazon EMR, and Amazon Redshift. You may visually construct the workflow by wiring particular person knowledge pipeline duties and configuring payloads, retries, and error dealing with with minimal code.

Whereas Step Features helps computerized retries and error dealing with when knowledge pipeline duties fail as a consequence of momentary or transient errors, there could be everlasting failures similar to incorrect permissions, invalid knowledge, and enterprise logic failure throughout the pipeline run. This requires you to determine the difficulty within the step, repair the difficulty and restart the workflow. Beforehand, to rerun the failed step, you wanted to restart the whole workflow from the very starting. This results in delays in finishing the workflow, particularly if it’s a posh, long-running ETL pipeline. If the pipeline has many steps utilizing map and parallel states, this additionally results in elevated price as a consequence of will increase within the state transition for operating the pipeline from the start.

Step Features now helps the flexibility so that you can redrive your workflow from a failed, aborted, or timed-out state so you’ll be able to full workflows quicker and at a decrease price, and spend extra time delivering enterprise worth. Now you’ll be able to get better from unhandled failures quicker by redriving failed workflow runs, after downstream points are resolved, utilizing the identical enter offered to the failed state.

On this put up, we present you an ETL pipeline job that exports knowledge from Amazon Relational Database Service (Amazon RDS) tables utilizing the Step Features distributed map state. Then we simulate a failure and exhibit the best way to use the brand new redrive characteristic to restart the failed activity from the purpose of failure.

Resolution overview

One of many frequent functionalities concerned in knowledge pipelines is extracting knowledge from a number of knowledge sources and exporting it to a knowledge lake or synchronizing the info to a different database. You should utilize the Step Features distributed map state to run a whole bunch of such export or synchronization jobs in parallel. Distributed map can learn hundreds of thousands of objects from Amazon Easy Storage Service (Amazon S3) or hundreds of thousands of information from a single S3 object, and distribute the information to downstream steps. Step Features runs the steps inside the distributed map as baby workflows at a most parallelism of 10,000. A concurrency of 10,000 is nicely above the concurrency supported by many different AWS companies similar to AWS Glue, which has a comfortable restrict of 1,000 job runs per job.

The pattern knowledge pipeline sources product catalog knowledge from Amazon DynamoDB and buyer order knowledge from Amazon RDS for PostgreSQL database. The information is then cleansed, reworked, and uploaded to Amazon S3 for additional processing. The information pipeline begins with an AWS Glue crawler to create the Knowledge Catalog for the RDS database. As a result of beginning an AWS Glue crawler is asynchronous, the pipeline has a wait loop to verify if the crawler is full. After the AWS Glue crawler is full, the pipeline extracts knowledge from the DynamoDB desk and RDS tables. As a result of these two steps are impartial, they’re run as parallel steps: one utilizing an AWS Lambda operate to export, remodel, and cargo the info from DynamoDB to an S3 bucket, and the opposite utilizing a distributed map with AWS Glue job sync integration to do the identical from the RDS tables to an S3 bucket. Be aware that AWS Id and Entry Administration (IAM) permissions are required for invoking an AWS Glue job from Step Features. For extra info, confer with IAM Insurance policies for invoking AWS Glue job from Step Features.

The next diagram illustrates the Step Features workflow.

There are a number of tables associated to clients and order knowledge within the RDS database. Amazon S3 hosts the metadata of all of the tables as a .csv file. The pipeline makes use of the Step Features distributed map to learn the desk metadata from Amazon S3, iterate on each single merchandise, and name the downstream AWS Glue job in parallel to export the info. See the next code:

"States": {
            "Map": {
              "Sort": "Map",
              "ItemProcessor": {
                "ProcessorConfig": {
                  "Mode": "DISTRIBUTED",
                  "ExecutionType": "STANDARD"
                },
                "StartAt": "Export knowledge for a desk",
                "States": {
                  "Export knowledge for a desk": {
                    "Sort": "Activity",
                    "Useful resource": "arn:aws:states:::glue:startJobRun.sync",
                    "Parameters": {
                      "JobName": "ExportTableData",
                      "Arguments": {
                        "--dbtable.$": "$.tables"
                      }
                    },
                    "Finish": true
                  }
                }
              },
              "Label": "Map",
              "ItemReader": {
                "Useful resource": "arn:aws:states:::s3:getObject",
                "ReaderConfig": {
                  "InputType": "CSV",
                  "CSVHeaderLocation": "FIRST_ROW"
                },
                "Parameters": {
                  "Bucket": "123456789012-stepfunction-redrive",
                  "Key": "tables.csv"
                }
              },
              "ResultPath": null,
              "Finish": true
            }
          }

Conditions

To deploy the answer, you want the next conditions:

Launch the CloudFormation template

Full the next steps to deploy the answer sources utilizing AWS CloudFormation:

Select Launch Stack to launch the CloudFormation stack:
Enter a stack identify.
Choose all of the verify containers beneath Capabilities and transforms.
Select Create stack.

The CloudFormation template creates many sources, together with the next:

The information pipeline described earlier as a Step Features workflow
An S3 bucket to retailer the exported knowledge and the metadata of the tables in Amazon RDS
A product catalog desk in DynamoDB
An RDS for PostgreSQL database occasion with pre-loaded tables
An AWS Glue crawler that crawls the RDS desk and creates an AWS Glue Knowledge Catalog
A parameterized AWS Glue job to export knowledge from the RDS desk to an S3 bucket
A Lambda operate to export knowledge from DynamoDB to an S3 bucket

Simulate the failure

Full the next steps to check the answer:

On the Step Features console, select State machines within the navigation pane.
Select the workflow named ETL_Process.
Run the workflow with default enter.

Inside a number of seconds, the workflow fails on the distributed map state.

You may examine the map run errors by accessing the Step Features workflow execution occasions for map runs and baby workflows. On this instance, you’ll be able to identification the exception is because of Glue.ConcurrentRunsExceededException from AWS Glue. The error signifies there are extra concurrent requests to run an AWS Glue job than are configured. Distributed map reads the desk metadata from Amazon S3 and invokes as many AWS Glue jobs because the variety of rows within the .csv file, however AWS Glue job is about with the concurrency of three when it’s created. This resulted within the baby workflow failure, cascading the failure to the distributed map state after which the parallel state. The opposite step within the parallel state to fetch the DynamoDB desk ran efficiently. If any step within the parallel state fails, the entire state fails, as seen with the cascading failure.

Deal with failures with distributed map

By default, when a state reviews an error, Step Features causes the workflow to fail. There are a number of methods you’ll be able to deal with this failure with distributed map state:

Step Features lets you catch errors, retry errors, and fail again to a different state to deal with errors gracefully. See the next code:

Retry": [
                      {
                        "ErrorEquals": [
                          "Glue.ConcurrentRunsExceededException "
                        ],
                        "BackoffRate": 20,
                        "IntervalSeconds": 10,
                        "MaxAttempts": 3,
                        "Remark": "Exception",
                        "JitterStrategy": "FULL"
                      }
                    ]

Typically, companies can tolerate failures. That is very true if you find yourself processing hundreds of thousands of things and also you count on knowledge high quality points within the dataset. By default, when an iteration of map state fails, all different iterations are aborted. With distributed map, you’ll be able to specify the utmost variety of, or proportion of, failed gadgets as a failure threshold. If the failure is inside the tolerable degree, the distributed map doesn’t fail.
The distributed map state lets you management the concurrency of the kid workflows. You may set the concurrency to map it to the AWS Glue job concurrency. Keep in mind, this concurrency is relevant solely on the workflow execution degree—not throughout workflow executions.
You may redrive the failed state from the purpose of failure after fixing the basis explanation for the error.

Redrive the failed state

The foundation explanation for the difficulty within the pattern answer is the AWS Glue job concurrency. To deal with this by redriving the failed state, full the next steps:

On the AWS Glue console, navigate to the job named ExportsTableData.
On the Job particulars tab, beneath Superior properties, replace Most concurrency to five.

With the launch of redrive characteristic, You should utilize redrive to restart executions of customary workflows that didn’t full efficiently within the final 14 days. These embrace failed, aborted, or timed-out runs. You may solely redrive a failed workflow from the step the place it failed utilizing the identical enter because the final non-successful state. You may’t redrive a failed workflow utilizing a state machine definition that’s totally different from the preliminary workflow execution. After the failed state is redriven efficiently, Step Features runs all of the downstream duties robotically. To study extra about how distributed map redrive works, confer with Redriving Map Runs.

As a result of the distributed map runs the steps contained in the map as baby workflows, the workflow IAM execution position wants permission to redrive the map run to restart the distributed map state:

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Effect": "Allow",
      "Action": [
        "states:RedriveExecution"
      ],
      "Useful resource": "arn:aws:states:us-east-2:123456789012:execution:myStateMachine/myMapRunLabel:*"
    }
  ]
}

You may redrive a workflow from its failed step programmatically, by way of the AWS Command Line Interface (AWS CLI) or AWS SDK, or utilizing the Step Features console, which offers a visible operator expertise.

On the Step Features console, navigate to the failed workflow you need to redrive.
On the Particulars tab, select Redrive from failure.

The pipeline now runs efficiently as a result of there may be sufficient concurrency to run the AWS Glue jobs.

To redrive a workflow programmatically from its level of failure, name the new Redrive Execution API motion. The identical workflow begins from the final non-successful state and makes use of the identical enter because the final non-successful state from the preliminary failed workflow. The state to redrive from the workflow definition and the earlier enter are immutable.

Be aware the next relating to various kinds of baby workflows:

Redrive for categorical baby workflows – For failed baby workflows which can be categorical workflows inside a distributed map, the redrive functionality ensures a seamless restart from the start of the kid workflow. This lets you resolve points which can be particular to particular person iterations with out restarting the whole map.
Redrive for traditional baby workflows – For failed baby workflows inside a distributed map which can be customary workflows, the redrive characteristic capabilities the identical method as with standalone customary workflows. You may restart the failed state inside every map iteration from its level of failure, skipping pointless steps which have already efficiently run.

You should utilize Step Features standing change notifications with Amazon EventBridge for failure notifications similar to sending an electronic mail on failure.

Clear up

To wash up your sources, delete the CloudFormation stack by way of the AWS CloudFormation console.

Conclusion

On this put up, we confirmed you the best way to use the Step Features redrive characteristic to redrive a failed step inside a distributed map by restarting the failed step from the purpose of failure. The distributed map state lets you write workflows that coordinate large-scale parallel workloads inside your serverless functions. Step Features runs the steps inside the distributed map as baby workflows at a most parallelism of 10,000, which is nicely above the concurrency supported by many AWS companies.

To study extra about distributed map, confer with Step Features – Distributed Map. To study extra about redriving workflows, confer with Redriving executions.

Concerning the Authors

Sriharsh Adari is a Senior Options Architect at Amazon Net Providers (AWS), the place he helps clients work backwards from enterprise outcomes to develop revolutionary options on AWS. Through the years, he has helped a number of clients on knowledge platform transformations throughout business verticals. His core space of experience embrace Expertise Technique, Knowledge Analytics, and Knowledge Science. In his spare time, he enjoys enjoying Tennis.

Joe Morotti is a Senior Options Architect at Amazon Net Providers (AWS), working with Enterprise clients throughout the Midwest US to develop revolutionary options on AWS. He has held a variety of technical roles and enjoys exhibiting clients the artwork of the attainable. He has attained seven AWS certification and has a ardour for AI/ML and the contact middle house. In his free time, he enjoys spending high quality time along with his household exploring new locations and overanalyzing his sports activities group’s efficiency.

Uma Ramadoss is a specialist Options Architect at Amazon Net Providers, centered on the Serverless platform. She is answerable for serving to clients design and function event-driven cloud-native functions and trendy enterprise workflows utilizing companies like Lambda, EventBridge, Step Features, and Amazon MWAA.