This submit is co-written with Karthik Kondamudi and Jenny Thompson from Sew Repair.
Sew Repair is a customized clothes styling service for males, girls, and youngsters. At Sew Repair, we’ve got been powered by information science since its basis and depend on many fashionable information lake and information processing applied sciences. In our infrastructure, Apache Kafka has emerged as a robust instrument for managing occasion streams and facilitating real-time information processing. At Sew Repair, we’ve got used Kafka extensively as a part of our information infrastructure to help varied wants throughout the enterprise for over six years. Kafka performs a central position within the Sew Repair efforts to overtake its occasion supply infrastructure and construct a self-service information integration platform.
In the event you’d prefer to know extra background about how we use Kafka at Sew Repair, please consult with our beforehand revealed weblog submit, Placing the Energy of Kafka into the Fingers of Knowledge Scientists. This submit contains way more info on enterprise use instances, structure diagrams, and technical infrastructure.
On this submit, we are going to describe how and why we determined emigrate from self-managed Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK). We’ll begin with an outline of our self-managed Kafka, why we selected emigrate to Amazon MSK, and finally how we did it.
- Kafka clusters overview
- Why migrate to Amazon MSK
- How we migrated to Amazon MSK
- Navigating challenges and classes discovered
- Conclusion
Kafka Clusters Overview
At Sew Repair, we depend on a number of totally different Kafka clusters devoted to particular functions. This enables us to scale these clusters independently and apply extra stringent SLAs and message supply ensures per cluster. This additionally reduces general danger by minimizing the impression of modifications and upgrades and permits us to isolate and repair any points that happen inside a single cluster.
Our major Kafka cluster serves because the spine of our information infrastructure. It handles a mess of vital features, together with managing enterprise occasions, facilitating microservice communication, supporting characteristic technology for machine studying workflows, and way more. The soundness, reliability, and efficiency of this cluster are of utmost significance to our operations.
Our logging cluster performs an important position in our information infrastructure. It serves as a centralized repository for varied software logs, together with net server logs and Nginx server logs. These logs present worthwhile insights for monitoring and troubleshooting functions. The logging cluster ensures easy operations and environment friendly evaluation of log information.
Why migrate to Amazon MSK
Up to now six years, our information infrastructure group has diligently managed our Kafka clusters. Whereas our group has acquired intensive data in sustaining Kafka, we’ve got additionally confronted challenges comparable to rolling deployments for model upgrades, making use of OS patches, and the general operational overhead.
At Sew Repair, our engineers thrive on creating new options and increasing our service choices to thrill our prospects. Nevertheless, we acknowledged that allocating vital assets to Kafka upkeep was taking away treasured time from innovation. To beat this problem, we got down to discover a managed service supplier that might deal with upkeep duties like upgrades and patching whereas granting us full management over cluster operations, together with partition administration and rebalancing. We additionally sought a simple scaling resolution for storage volumes, conserving our prices in examine whereas being able to accommodate future progress.
After thorough analysis of a number of choices, we discovered the proper match in Amazon MSK as a result of it permits us to dump cluster upkeep to the extremely expert Amazon engineers. With Amazon MSK in place, our groups can now focus their vitality on growing progressive functions distinctive and worthwhile to Sew Repair, as an alternative of getting caught up in Kafka administration duties.
Amazon MSK streamlines the method, eliminating the necessity for guide configurations, extra software program installations, and worries about scaling. It merely works, enabling us to focus on delivering distinctive worth to our cherished prospects.
How we migrated to Amazon MSK
Whereas planning our migration, we desired to modify particular companies to Amazon MSK individually with no downtime, guaranteeing that solely a particular subset of companies could be migrated at a time. The general infrastructure would run in a hybrid setting the place some companies connect with Amazon MSK and others to the present Kafka infrastructure.
We determined to begin the migration with our much less vital logging cluster first after which proceed to migrating the principle cluster. Though the logs are important for monitoring and troubleshooting functions, they maintain comparatively much less significance to the core enterprise operations. Moreover, the quantity and kinds of customers and producers for the logging cluster is smaller, making it a better selection to begin with. Then, we had been in a position to apply our learnings from the logging cluster migration to the principle cluster. This deliberate selection enabled us to execute the migration course of in a managed method, minimizing any potential disruptions to our vital techniques.
Over time, our skilled information infrastructure group has employed Apache Kafka MirrorMaker 2 (MM2) to duplicate information between totally different Kafka clusters. Presently, we depend on MM2 to duplicate information from two totally different manufacturing Kafka clusters. Given its confirmed monitor file inside our group, we determined to make use of MM2 as the first instrument for our information migration course of.
The overall steering for MM2 is as follows:
- Start with much less vital functions.
- Carry out lively migrations.
- Familiarize your self with key finest practices for MM2.
- Implement monitoring to validate the migration.
- Accumulate important insights for migrating different functions.
MM2 gives versatile deployment choices, permitting it to perform as a standalone cluster or be embedded inside an present Kafka Join cluster. For our migration undertaking, we deployed a devoted Kafka Join cluster working in distributed mode.
This setup offered the scalability we would have liked, permitting us to simply broaden the standalone cluster if mandatory. Relying on particular use instances comparable to geoproximity, excessive availability (HA), or migrations, MM2 could be configured for active-active replication, active-passive replication, or each. In our case, as we migrated from self-managed Kafka to Amazon MSK, we opted for an active-passive configuration, the place MirrorMaker was used for migration functions and subsequently taken offline upon completion.
MirrorMaker configuration and replication coverage
By default, MirrorMaker renames replication subjects by prefixing the title of the supply Kafka cluster to the vacation spot cluster. As an illustration, if we replicate matter A from the supply cluster “present” to the brand new cluster “newkafka,” the replicated matter could be named “present.A” in “newkafka.” Nevertheless, this default habits could be modified to take care of constant matter names throughout the newly created MSK cluster.
To take care of constant matter names within the newly created MSK cluster and keep away from downstream points, we utilized the CustomReplicationPolicy jar offered by AWS. This jar, included in our MirrorMaker setup, allowed us to duplicate subjects with equivalent names within the MSK cluster. Moreover, we utilized MirrorCheckpointConnector to synchronize client offsets from the supply cluster to the goal cluster and MirrorHeartbeatConnector to make sure connectivity between the clusters.
Monitoring and metrics
MirrorMaker comes outfitted with built-in metrics to observe replication lag and different important parameters. We built-in these metrics into our MirrorMaker setup, exporting them to Grafana for visualization. Since we’ve got been utilizing Grafana to observe different techniques, we determined to make use of it throughout migration as effectively. This enabled us to intently monitor the replication standing throughout the migration course of. The particular metrics we monitored will likely be described in additional element beneath.
Moreover, we monitored the MirrorCheckpointConnector included with MirrorMaker, because it periodically emits checkpoints within the vacation spot cluster. These checkpoints contained offsets for every client group within the supply cluster, guaranteeing seamless synchronization between the clusters.
Community structure
At Sew Repair, we use a number of digital personal clouds (VPCs) via Amazon Digital Personal Cloud (Amazon VPC) for setting isolation in every of our AWS accounts. We’ve been utilizing separate manufacturing and staging VPCs since we initially began utilizing AWS. When mandatory, peering of VPCs throughout accounts is dealt with via AWS Transit Gateway. To take care of the robust isolation between environments we’ve got been utilizing all alongside, we created separate MSK clusters of their respective VPCs for manufacturing and staging environments.
Aspect be aware: Will probably be simpler now to shortly join Kafka shoppers hosted in several digital personal clouds with not too long ago introduced Amazon MSK multi-VPC personal connectivity, which was not out there on the time of our migration.
Migration steps: Excessive-level overview
On this part, we define the high-level sequence of occasions for the migration course of.
Kafka Join setup and MM2 deploy
First, we deployed a brand new Kafka Join cluster on an Amazon Elastic Compute Cloud (Amazon EC2) cluster as an middleman between the present Kafka cluster and the brand new MSK cluster. Subsequent, we deployed the three MirrorMaker connectors to this Kafka Join cluster. Initially, this cluster was configured to reflect all the present subjects and their configurations into the vacation spot MSK cluster. (We ultimately modified this configuration to be extra granular, as described within the “Navigating challenges and classes discovered” part beneath.)
Monitor replication progress with MM metrics
Make the most of the JMX metrics supplied by MirrorMaker to observe the progress of knowledge replication. Along with complete metrics, we primarily targeted on key metrics, specifically replication-latency-ms and checkpoint-latency-ms. These metrics present invaluable insights into the replication standing, together with essential points comparable to replication lag and checkpoint latency. By seamlessly exporting these metrics to Grafana, you acquire the flexibility to visualise and intently monitor the progress of replication, guaranteeing the profitable replica of each historic and new information by MirrorMaker.
Consider utilization metrics and provisioning
Analyze the utilization metrics of the brand new MSK cluster to make sure correct provisioning. Contemplate elements comparable to storage, throughput, and efficiency. If required, resize the cluster to fulfill the noticed utilization patterns. Whereas resizing could introduce extra time to the migration course of, it’s a cost-effective measure in the long term.
Sync client offsets between supply and goal clusters
Make sure that client offsets are synchronized between the supply in-house clusters and the goal MSK clusters. As soon as the buyer offsets are in sync, redirect the customers of the present in-house clusters to devour information from the brand new MSK cluster. This step ensures a seamless transition for customers and permits uninterrupted information circulate throughout the migration.
Replace producer functions
After confirming that every one customers are efficiently consuming information from the brand new MSK cluster, replace the producer functions to write down information on to the brand new cluster. This remaining step completes the migration course of, guaranteeing that every one information is now being written to the brand new MSK cluster and taking full benefit of its capabilities.
Navigating challenges and classes discovered
Throughout our migration, we encountered three challenges that required cautious consideration: scalable storage, extra granular configuration of replication configuration, and reminiscence allocation.
Initially, we confronted points with auto scaling Amazon MSK storage. We discovered storage auto scaling requires a 24-hour cool-off interval earlier than one other scaling occasion can happen. We noticed this when migrating the logging cluster, and we utilized our learnings from this and factored within the cool-off interval throughout manufacturing cluster migration.
Moreover, to optimize MirrorMaker replication velocity, we up to date the unique configuration to divide the replication jobs into batches based mostly on quantity and allotted extra duties to high-volume subjects.
In the course of the preliminary section, we initiated replication utilizing a single connector to switch all subjects from the supply to focus on clusters, encompassing a major variety of duties. Nevertheless, we encountered challenges comparable to growing replication lag for high-volume subjects and slower replication for particular subjects. Upon cautious examination of the metrics, we adopted another method by segregating high-volume subjects into a number of connectors. In essence, we divided the subjects into classes of excessive, medium, and low volumes, assigning them to respective connectors and adjusting the variety of duties based mostly on replication latency. This strategic adjustment yielded constructive outcomes, permitting us to realize sooner and extra environment friendly information replication throughout the board.
Lastly, we encountered Java digital machine heap reminiscence exhaustion, leading to lacking metrics whereas operating MirrorMaker replication. To handle this, we elevated reminiscence allocation and restarted the MirrorMaker course of.
Conclusion
Sew Repair’s migration from self-managed Kafka to Amazon MSK has allowed us to shift our focus from upkeep duties to delivering worth for our prospects. It has diminished our infrastructure prices by 40 p.c and given us the arrogance that we will simply scale the clusters sooner or later if wanted. By strategically planning the migration and utilizing Apache Kafka MirrorMaker, we achieved a seamless transition whereas guaranteeing excessive availability. The combination of monitoring and metrics offered worthwhile insights throughout the migration course of, and Sew Repair efficiently navigated challenges alongside the best way. The migration to Amazon MSK has empowered Sew Repair to maximise the capabilities of Kafka whereas benefiting from the experience of Amazon engineers, setting the stage for continued progress and innovation.
Additional studying
In regards to the Authors
Karthik Kondamudi is an Engineering Supervisor within the Knowledge and ML Platform Group at StitchFix. His pursuits lie in Distributed Methods and large-scale information processing. Past work, he enjoys spending time with household and mountain climbing. A canine lover, he’s additionally captivated with sports activities, notably cricket, tennis, and soccer.
Jenny Thompson is a Knowledge Platform Engineer at Sew Repair. She works on quite a lot of techniques for Knowledge Scientists, and enjoys making issues clear, easy, and simple to make use of. She additionally likes making pancakes and Pavlova, shopping for furnishings on Craigslist, and getting rained on throughout picnics.
Rahul Nammireddy is a Senior Options Architect at AWS, focusses on guiding digital native prospects via their cloud native transformation. With a ardour for AI/ML applied sciences, he works with prospects in industries comparable to retail and telecom, serving to them innovate at a fast tempo. All through his 23+ years profession, Rahul has held key technical management roles in a various vary of firms, from startups to publicly listed organizations, showcasing his experience as a builder and driving innovation. In his spare time, he enjoys watching soccer and taking part in cricket.
Todd McGrath is an information streaming specialist at Amazon Internet Companies the place he advises prospects on their streaming methods, integration, structure, and options. On the private aspect, he enjoys watching and supporting his 3 youngsters of their most popular actions in addition to following his personal pursuits comparable to fishing, pickleball, ice hockey, and pleased hour with family and friends on pontoon boats. Join with him on LinkedIn.