How Salesforce optimized their detection and response platform utilizing AWS managed companies

This can be a visitor weblog submit co-authored with Atul Khare and Bhupender Panwar from Salesforce.

Headquartered in San Francisco, Salesforce, Inc. is a cloud-based buyer relationship administration (CRM) software program firm constructing synthetic intelligence (AI)-powered enterprise functions that permit companies to attach with their prospects in new and customized methods.

The Salesforce Belief Intelligence Platform (TIP) log platform workforce is chargeable for information pipeline and information lake infrastructure, offering log ingestion, normalization, persistence, search, and detection functionality to make sure Salesforce is secure from menace actors. It runs miscellaneous companies to facilitate investigation, mitigation, and containment for safety operations. The TIP workforce is essential to securing Salesforce’s infrastructure, detecting malicious menace actions, and offering well timed responses to safety occasions. That is achieved by amassing and inspecting petabytes of safety logs throughout dozens of organizations, some with 1000’s of accounts.

On this submit, we focus on how the Salesforce TIP workforce optimized their structure utilizing Amazon Internet Companies (AWS) managed companies to attain higher scalability, value, and operational effectivity.

TIP present structure chook’s eye view and scale of the platform

The primary key efficiency indicator (KPI) for the TIP platform is its functionality to ingest a excessive quantity of safety logs from a wide range of Salesforce inside methods in actual time and course of them with excessive velocity. The platform ingests greater than 1 PB of knowledge per day, greater than 10 million occasions per second, and greater than 200 totally different log varieties. The platform ingests log information in JSON, textual content, and Widespread Occasion Format (CEF) codecs.

The message bus in TIP’s present structure primarily makes use of Apache Kafka for ingesting totally different log varieties coming from the upstream methods. Kafka had a single matter for all of the log varieties earlier than they had been consumed by totally different downstream functions together with Splunk, Streaming Search, and Log Normalizer. The Normalized Parquet Logs are saved in an Amazon Easy Storage Service (Amazon S3) information lake and cataloged into Hive Metastore (HMS) on an Amazon Relational Database Service (Amazon RDS) occasion primarily based on S3 occasion notifications. The info lake customers then use Apache Presto working on Amazon EMR cluster to carry out one-time queries. Different groups together with the Information Science and Machine Studying groups use the platform to detect, analyze, and management safety threats.

Challenges with the present TIP log platform structure

A few of the predominant challenges that TIP’s present structure was going through embody:

Heavy operational overhead and upkeep value managing the Kafka cluster
Excessive value to serve (CTS) to satisfy rising enterprise wants
Compute threads restricted by partitions’ numbers
Troublesome to scale out when site visitors will increase
Weekly patching creates lags
Challenges with HMS scalability

All these challenges motivated the TIP workforce to embark on a journey to create a extra optimized platform that’s simpler to scale with much less operational overhead and decrease CTS.

New TIP log platform structure

The Salesforce TIP log platform engineering workforce, in collaboration with AWS, began constructing the brand new structure to interchange the Kafka-based message bus answer with the absolutely managed AWS messaging and notification options Amazon Easy Queue Service (Amazon SQS) and Amazon Easy Notification Service (Amazon SNS). Within the new design, the upstream methods ship their logs to a central Amazon S3 storage location, which invokes a course of to partition the logs and retailer them in an S3 information lake. Shopper functions equivalent to Splunk get the messages delivered to their system utilizing Amazon SQS. Equally, the partitioned log information via Amazon SQS occasions initializes a log normalization course of that delivers the normalized log information to open supply Delta Lake tables on an S3 information lake. One of many main adjustments within the new structure is using an AWS Glue Information Catalog to interchange the earlier Hive Metastore. The one-time evaluation functions use Apache Trino on an Amazon EMR cluster to question the Delta Tables cataloged in AWS Glue. Different client functions additionally learn the information from S3 information lake information saved in Delta Desk format. Extra particulars on a number of the necessary processes are as follows:

Log partitioner (Spark structured stream)

This service ingests logs from the Amazon S3 SNS SQS-based retailer and shops them within the partitioned (by log varieties) format in S3 for additional downstream consumptions from the Amazon SNS SQS subscription. That is the bronze layer of the TIP information lake.

Log normalizer (Spark structured stream)

One of many downstream customers of log partitioner (Splunk Ingestor is one other one), the log normalizer ingests the information from Partitioned Output S3, utilizing Amazon SNS SQS notifications, and enriches them utilizing Salesforce customized parsers and tags. Lastly, this enriched information is landed within the information lake on S3. That is the silver layer of the TIP information lake.

Machine studying and different information analytics customers (Trino, Flink, and Spark Jobs)

These customers devour from the silver layer of the TIP information lake and run analytics for safety detection use circumstances. The sooner Kafka interface is now transformed to delta streams ingestion, which concludes the full elimination of the Kafka bus from the TIP information pipeline.

Benefits of the brand new TIP log platform structure

The primary benefits realized by the Salesforce TIP workforce primarily based on this new structure utilizing Amazon S3, Amazon SNS, and Amazon SQS embody:

Price financial savings of roughly $400 thousand monthly
Auto scaling to satisfy rising enterprise wants
Zero DevOps upkeep overhead
No mapping of partitions to compute threads
Compute assets will be scaled up and down independently
Absolutely managed Information Catalog to cut back the operational overhead of managing HMS

Abstract

On this weblog submit we mentioned how the Salesforce Belief Intelligence Platform (TIP) optimized their information pipeline by changing the Kafka-based message bus answer with absolutely managed AWS messaging and notification options utilizing Amazon SQS and Amazon SNS. Salesforce and AWS groups labored collectively to verify this new platform seamlessly scales to ingest greater than 1 PB of knowledge per day, greater than 10 hundreds of thousands occasions per second, and greater than 200 totally different log varieties. Attain out to your AWS account workforce when you have comparable use circumstances and also you need assistance architecting your platform to attain operational efficiencies and scale.

Concerning the authors

Atul Khare is a Director of Engineering at Salesforce Safety, the place he spearheads the Safety Log Platform and Information Lakehouse initiatives. He helps various safety prospects by constructing strong large information ETL pipeline that’s elastic, resilient, and simple to make use of, offering uniform & constant safety datasets for menace detection and response operations, AI, forensic evaluation, analytics, and compliance wants throughout all Salesforce clouds. Past his skilled endeavors, Atul enjoys performing music along with his band to boost funds for native charities.

Bhupender Panwar is a Large Information Architect at Salesforce and seasoned advocate for giant information and cloud computing. His background encompasses the event of data-intensive functions and pipelines, fixing intricate architectural and scalability challenges, and extracting precious insights from in depth datasets inside the expertise trade. Outdoors of his large information work, Bhupender likes to hike, bike, get pleasure from journey and is a superb foodie.

Avijit Goswami is a Principal Options Architect at AWS specialised in information and analytics. He helps AWS strategic prospects in constructing high-performing, safe, and scalable information lake options on AWS utilizing AWS managed companies and open-source options. Outdoors of his work, Avijit likes to journey, hike within the San Francisco Bay Space trails, watch sports activities, and hearken to music.

Vikas Panghal is the Principal Product Supervisor main the product administration workforce for Amazon SNS and Amazon SQS. He has deep experience in event-driven and messaging functions and brings a wealth of data and expertise to his position, shaping the way forward for messaging companies. He’s keen about serving to prospects construct extremely scalable, fault-tolerant, and loosely coupled methods. Outdoors of labor, he enjoys spending time along with his household outside, enjoying chess, and working.