Use Apache Iceberg in your information lake with Amazon S3, AWS Glue, and Snowflake

This publish is co-written with Andries Engelbrecht and Scott Teal from Snowflake.

Companies are continuously evolving, and information leaders are challenged every single day to satisfy new necessities. For a lot of enterprises and huge organizations, it’s not possible to have one processing engine or device to cope with the assorted enterprise necessities. They perceive {that a} one-size-fits-all method not works, and acknowledge the worth in adopting scalable, versatile instruments and open information codecs to help interoperability in a contemporary information structure to speed up the supply of recent options.

Prospects are utilizing AWS and Snowflake to develop purpose-built information architectures that present the efficiency required for contemporary analytics and synthetic intelligence (AI) use instances. Implementing these options requires information sharing between purpose-built information shops. For this reason Snowflake and AWS are delivering enhanced help for Apache Iceberg to allow and facilitate information interoperability between information companies.

Apache Iceberg is an open-source desk format that gives reliability, simplicity, and excessive efficiency for giant datasets with transactional integrity between numerous processing engines. On this publish, we focus on the next:

Benefits of Iceberg tables for information lakes
Two architectural patterns for sharing Iceberg tables between AWS and Snowflake:
- Handle your Iceberg tables with AWS Glue Knowledge Catalog
- Handle your Iceberg tables with Snowflake
The method of changing present information lakes tables to Iceberg tables with out copying the information

Now that you’ve got a high-level understanding of the matters, let’s dive into every of them intimately.

Benefits of Apache Iceberg

Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source information desk format that helps simplify information processing on massive datasets saved in information lakes. Knowledge engineers use Apache Iceberg as a result of it’s quick, environment friendly, and dependable at any scale and retains data of how datasets change over time. Apache Iceberg gives integrations with standard information processing frameworks akin to Apache Spark, Apache Flink, Apache Hive, Presto, and extra.

Iceberg tables keep metadata to summary massive collections of information, offering information administration options together with time journey, rollback, information compaction, and full schema evolution, lowering administration overhead. Initially developed at Netflix earlier than being open sourced to the Apache Software program Basis, Apache Iceberg was a blank-slate design to unravel frequent information lake challenges like consumer expertise, reliability, and efficiency, and is now supported by a strong group of builders targeted on frequently bettering and including new options to the mission, serving actual consumer wants and offering them with optionality.

Transactional information lakes constructed on AWS and Snowflake

Snowflake supplies numerous integrations for Iceberg tables with a number of storage choices, together with Amazon S3, and a number of catalog choices, together with AWS Glue Knowledge Catalog and Snowflake. AWS supplies integrations for numerous AWS companies with Iceberg tables as properly, together with AWS Glue Knowledge Catalog for monitoring desk metadata. Combining Snowflake and AWS offers you a number of choices to construct out a transactional information lake for analytical and different use instances akin to information sharing and collaboration. By including a metadata layer to information lakes, you get a greater consumer expertise, simplified administration, and improved efficiency and reliability on very massive datasets.

Handle your Iceberg desk with AWS Glue

You should utilize AWS Glue to ingest, catalog, remodel, and handle the information on Amazon Easy Storage Service (Amazon S3). AWS Glue is a serverless information integration service that means that you can visually create, run, and monitor extract, remodel, and cargo (ETL) pipelines to load information into your information lakes in Iceberg format. With AWS Glue, you possibly can uncover and hook up with greater than 70 numerous information sources and handle your information in a centralized information catalog. Snowflake integrates with AWS Glue Knowledge Catalog to entry the Iceberg desk catalog and the information on Amazon S3 for analytical queries. This significantly improves efficiency and compute value compared to exterior tables on Snowflake, as a result of the extra metadata improves pruning in question plans.

You should utilize this similar integration to benefit from the information sharing and collaboration capabilities in Snowflake. This may be very highly effective if in case you have information in Amazon S3 and have to allow Snowflake information sharing with different enterprise models, companions, suppliers, or clients.

The next structure diagram supplies a high-level overview of this sample.

The workflow contains the next steps:

AWS Glue extracts information from functions, databases, and streaming sources. AWS Glue then transforms it and hundreds it into the information lake in Amazon S3 in Iceberg desk format, whereas inserting and updating the metadata concerning the Iceberg desk in AWS Glue Knowledge Catalog.
The AWS Glue crawler generates and updates Iceberg desk metadata and shops it in AWS Glue Knowledge Catalog for present Iceberg tables on an S3 information lake.
Snowflake integrates with AWS Glue Knowledge Catalog to retrieve the snapshot location.
Within the occasion of a question, Snowflake makes use of the snapshot location from AWS Glue Knowledge Catalog to learn Iceberg desk information in Amazon S3.
Snowflake can question throughout Iceberg and Snowflake desk codecs. You’ll be able to share information for collaboration with a number of accounts in the identical Snowflake area. You may also use information in Snowflake for visualization utilizing Amazon QuickSight, or use it for machine studying (ML) and synthetic intelligence (AI) functions with Amazon SageMaker.

Handle your Iceberg desk with Snowflake

A second sample additionally supplies interoperability throughout AWS and Snowflake, however implements information engineering pipelines for ingestion and transformation to Snowflake. On this sample, information is loaded to Iceberg tables by Snowflake via integrations with AWS companies like AWS Glue or via different sources like Snowpipe. Snowflake then writes information on to Amazon S3 in Iceberg format for downstream entry by Snowflake and numerous AWS companies, and Snowflake manages the Iceberg catalog that tracks snapshot areas throughout tables for AWS companies to entry.

Just like the earlier sample, you need to use Snowflake-managed Iceberg tables with Snowflake information sharing, however you may also use S3 to share datasets in instances the place one social gathering doesn’t have entry to Snowflake.

The next structure diagram supplies an outline of this sample with Snowflake-managed Iceberg tables.

This workflow consists of the next steps:

Along with loading information through the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you possibly can combine information through the Snowflake Knowledge Sharing.
Snowflake writes Iceberg tables to Amazon S3 and updates metadata robotically with each transaction.
Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads utilizing companies like QuickSight and SageMaker.
Apache Spark companies on AWS can entry snapshot areas from Snowflake through a Snowflake Iceberg Catalog SDK and instantly scan the Iceberg desk information in Amazon S3.

Evaluating options

These two patterns spotlight choices accessible to information personas right now to maximise their information interoperability between Snowflake and AWS utilizing Apache Iceberg. However which sample is good in your use case? Should you’re already utilizing AWS Glue Knowledge Catalog and solely require Snowflake for learn queries, then the primary sample can combine Snowflake with AWS Glue and Amazon S3 to question Iceberg tables. Should you’re not already utilizing AWS Glue Knowledge Catalog and require Snowflake to carry out reads and writes, then the second sample is probably going a very good answer that enables for storing and accessing information from AWS.

Contemplating that reads and writes will most likely function on a per-table foundation fairly than the complete information structure, it’s advisable to make use of a mix of each patterns.

Migrate present information lakes to a transactional information lake utilizing Apache Iceberg

You’ll be able to convert present Parquet, ORC, and Avro-based information lake tables on Amazon S3 to Iceberg format to reap the advantages of transactional integrity whereas bettering efficiency and consumer expertise. There are a number of Iceberg desk migration choices (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating present information lake tables in-place to Iceberg format, which is preferable to rewriting the entire underlying information information—a expensive and time-consuming effort with massive datasets. On this part, we deal with ADD_FILES, as a result of it’s helpful for customized migrations.

For ADD_FILES choices, you need to use AWS Glue to generate Iceberg metadata and statistics for an present information lake desk and create new Iceberg tables in AWS Glue Knowledge Catalog for future use without having to rewrite the underlying information. For directions on producing Iceberg metadata and statistics utilizing AWS Glue, discuss with Migrate an present information lake to a transactional information lake utilizing Apache Iceberg or Convert present Amazon S3 information lake tables to Snowflake Unmanaged Iceberg tables utilizing AWS Glue.

This feature requires that you simply pause information pipelines whereas changing the information to Iceberg tables, which is a simple course of in AWS Glue as a result of the vacation spot simply must be modified to an Iceberg desk.

Conclusion

On this publish, you noticed the 2 structure patterns for implementing Apache Iceberg in a knowledge lake for higher interoperability throughout AWS and Snowflake. We additionally supplied steerage on migrating present information lake tables to Iceberg format.

Join AWS Dev Day on April 10 to get hands-on not solely with Apache Iceberg, but additionally with streaming information pipelines with Amazon Knowledge Firehose and Snowpipe Streaming, and generative AI functions with Streamlit in Snowflake and Amazon Bedrock.

Concerning the Authors

Andries Engelbrecht is a Principal Companion Options Architect at Snowflake and works with strategic companions. He’s actively engaged with strategic companions like AWS supporting product and repair integrations in addition to the event of joint options with companions. Andries has over 20 years of expertise within the discipline of information and analytics.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in large information companies. He’s captivated with serving to clients construct fashionable information architectures on the AWS Cloud. He has helped clients of all sizes implement information administration, information warehouse, and information lake options.

Brian Dolan joined Amazon as a Army Relations Supervisor in 2012 after his first profession as a Naval Aviator. In 2014, Brian joined Amazon Net Providers, the place he helped Canadian clients from startups to enterprises discover the AWS Cloud. Most lately, Brian was a member of the Non-Relational Enterprise Improvement workforce as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces earlier than becoming a member of the Analytics Worldwide Specialist Group in 2022 as a Go-To-Market Specialist for AWS Glue.

Nidhi Gupta is a Sr. Companion Answer Architect at AWS. She spends her days working with clients and companions, fixing architectural challenges. She is captivated with information integration and orchestration, serverless and massive information processing, and machine studying. Nidhi has intensive expertise main the structure design and manufacturing launch and deployments for information workloads.