Mastering market dynamics: Reworking transaction value analytics with ultra-precise Tick Historical past – PCAP and Amazon Athena for Apache Spark

This put up is cowritten with Pramod Nayak, LakshmiKanth Mannem and Vivek Aggarwal from the Low Latency Group of LSEG.

Transaction value evaluation (TCA) is extensively utilized by merchants, portfolio managers, and brokers for pre-trade and post-trade evaluation, and helps them measure and optimize transaction prices and the effectiveness of their buying and selling methods. On this put up, we analyze choices bid-ask spreads from the LSEG Tick Historical past – PCAP dataset utilizing Amazon Athena for Apache Spark. We present you entry information, outline customized capabilities to use on information, question and filter the dataset, and visualize the outcomes of the evaluation, all with out having to fret about establishing infrastructure or configuring Spark, even for big datasets.

Background

Choices Value Reporting Authority (OPRA) serves as an important securities info processor, accumulating, consolidating, and disseminating final sale stories, quotes, and pertinent info for US Choices. With 18 energetic US Choices exchanges and over 1.5 million eligible contracts, OPRA performs a pivotal position in offering complete market information.

On February 5, 2024, the Securities Trade Automation Company (SIAC) is about to improve the OPRA feed from 48 to 96 multicast channels. This enhancement goals to optimize image distribution and line capability utilization in response to escalating buying and selling exercise and volatility within the US choices market. SIAC has advisable that corporations put together for peak information charges of as much as 37.3 GBits per second.

Regardless of the improve not instantly altering the whole quantity of revealed information, it permits OPRA to disseminate information at a considerably sooner price. This transition is essential for addressing the calls for of the dynamic choices market.

OPRA stands out as one essentially the most voluminous feeds, with a peak of 150.4 billion messages in a single day in Q3 2023 and a capability headroom requirement of 400 billion messages over a single day. Capturing each single message is essential for transaction value analytics, market liquidity monitoring, buying and selling technique analysis, and market analysis.

Concerning the information

LSEG Tick Historical past – PCAP is a cloud-based repository, exceeding 30 PB, housing ultra-high-quality world market information. This information is meticulously captured immediately inside the alternate information facilities, using redundant seize processes strategically positioned in main major and backup alternate information facilities worldwide. LSEG’s seize know-how ensures lossless information seize and makes use of a GPS time-source for nanosecond timestamp precision. Moreover, refined information arbitrage methods are employed to seamlessly fill any information gaps. Subsequent to seize, the info undergoes meticulous processing and arbitration, and is then normalized into Parquet format utilizing LSEG’s Actual Time Extremely Direct (RTUD) feed handlers.

The normalization course of, which is integral to getting ready the info for evaluation, generates as much as 6 TB of compressed Parquet information per day. The huge quantity of information is attributed to the surrounding nature of OPRA, spanning a number of exchanges, and that includes quite a few choices contracts characterised by various attributes. Elevated market volatility and market making exercise on the choices exchanges additional contribute to the quantity of information revealed on OPRA.

The attributes of Tick Historical past – PCAP allow corporations to conduct varied analyses, together with the next:

Pre-trade evaluation – Consider potential commerce influence and discover totally different execution methods based mostly on historic information
Publish-trade analysis – Measure precise execution prices towards benchmarks to evaluate the efficiency of execution methods
Optimized execution – High quality-tune execution methods based mostly on historic market patterns to reduce market influence and scale back general buying and selling prices
Danger administration – Determine slippage patterns, determine outliers, and proactively handle dangers related to buying and selling actions
Efficiency attribution – Separate the influence of buying and selling choices from funding choices when analyzing portfolio efficiency

The LSEG Tick Historical past – PCAP dataset is out there in AWS Information Trade and might be accessed on AWS Market. With AWS Information Trade for Amazon S3, you possibly can entry PCAP information immediately from LSEG’s Amazon Easy Storage Service (Amazon S3) buckets, eliminating the necessity for corporations to retailer their very own copy of the info. This method streamlines information administration and storage, offering purchasers instant entry to high-quality PCAP or normalized information with ease of use, integration, and substantial information storage financial savings.

Athena for Apache Spark

For analytical endeavors, Athena for Apache Spark provides a simplified pocket book expertise accessible by the Athena console or Athena APIs, permitting you to construct interactive Apache Spark functions. With an optimized Spark runtime, Athena helps the evaluation of petabytes of information by dynamically scaling the variety of Spark engines is lower than a second. Furthermore, frequent Python libraries corresponding to pandas and NumPy are seamlessly built-in, permitting for the creation of intricate software logic. The flexibleness extends to the importation of customized libraries to be used in notebooks. Athena for Spark accommodates most open-data codecs and is seamlessly built-in with the AWS Glue Information Catalog.

Dataset

For this evaluation, we used the LSEG Tick Historical past – PCAP OPRA dataset from Could 17, 2023. This dataset includes the next parts:

Finest bid and supply (BBO) – Reviews the best bid and lowest ask for a safety at a given alternate
Nationwide greatest bid and supply (NBBO) – Reviews the best bid and lowest ask for a safety throughout all exchanges
Trades – Data accomplished trades throughout all exchanges

The dataset includes the next information volumes:

Trades – 160 MB distributed throughout roughly 60 compressed Parquet information
BBO – 2.4 TB distributed throughout roughly 300 compressed Parquet information
NBBO – 2.8 TB distributed throughout roughly 200 compressed Parquet information

Evaluation overview

Analyzing OPRA Tick Historical past information for Transaction Value Evaluation (TCA) includes scrutinizing market quotes and trades round a particular commerce occasion. We use the next metrics as a part of this examine:

Quoted unfold (QS) – Calculated because the distinction between the BBO ask and the BBO bid
Efficient unfold (ES) – Calculated because the distinction between the commerce worth and the midpoint of the BBO (BBO bid + (BBO ask – BBO bid)/2)
Efficient/quoted unfold (EQF) – Calculated as (ES / QS) * 100

We calculate these spreads earlier than the commerce and moreover at 4 intervals after the commerce (simply after, 1 second, 10 seconds, and 60 seconds after the commerce).

Configure Athena for Apache Spark

To configure Athena for Apache Spark, full the next steps:

On the Athena console, below Get began, choose Analyze your information utilizing PySpark and Spark SQL.
If that is your first time utilizing Athena Spark, select Create workgroup.
For Workgroup identify¸ enter a reputation for the workgroup, corresponding to tca-analysis.
Within the Analytics engine part, choose Apache Spark.
Within the Further configurations part, you possibly can select Use defaults or present a customized AWS Id and Entry Administration (IAM) position and Amazon S3 location for calculation outcomes.
Select Create workgroup.
After you create the workgroup, navigate to the Notebooks tab and select Create pocket book.
Enter a reputation to your pocket book, corresponding to tca-analysis-with-tick-history.
Select Create to create your pocket book.

Launch your pocket book

In case you have already created a Spark workgroup, choose Launch pocket book editor below Get began.

After your pocket book is created, you’ll be redirected to the interactive pocket book editor.

Now we will add and run the next code to our pocket book.

Create an evaluation

Full the next steps to create an evaluation:

import pandas as pd
import plotly.categorical as px
import plotly.graph_objects as go

Create our information frames for BBO, NBBO, and trades:

bbo_quote = spark.learn.parquet(f"s3://<bucket>/mt=bbo_quote/f=opra/dt=2023-05-17/*")
bbo_quote.createOrReplaceTempView("bbo_quote")
nbbo_quote = spark.learn.parquet(f"s3://<bucket>/mt=nbbo_quote/f=opra/dt=2023-05-17/*")
nbbo_quote.createOrReplaceTempView("nbbo_quote")
trades = spark.learn.parquet(f"s3://<bucket>/mt=commerce/f=opra/dt=2023-05-17/29_1.parquet")
trades.createOrReplaceTempView("trades")

Now we will determine a commerce to make use of for transaction value evaluation:

filtered_trades = spark.sql("choose Product, Value,Amount, ReceiptTimestamp, MarketParticipant from trades")

We get the next output:

+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|Product |Value |Amount |ReceiptTimestamp |MarketParticipant| 
+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|QQQ 230518C00329000|1.1700000000000000000|10.0000000000000000000|1684338565538021907,NYSEArca|
|QQQ 230518C00329000|1.1700000000000000000|20.0000000000000000000|1684338576071397557,NASDAQOMXPHLX|
|QQQ 230518C00329000|1.1600000000000000000|1.0000000000000000000|1684338579104713924,ISE|
|QQQ 230518C00329000|1.1400000000000000000|1.0000000000000000000|1684338580263307057,NASDAQOMXBX_Options|
|QQQ 230518C00329000|1.1200000000000000000|1.0000000000000000000|1684338581025332599,ISE|
+---------------------+---------------------+---------------------+-------------------+-----------------+

We use the highlighted commerce info going ahead for the commerce product (tp), commerce worth (tpr), and commerce time (tt).

Right here we create a variety of helper capabilities for our evaluation

def calculate_es_qs_eqf(df, trade_price):
    df['BidPrice'] = df['BidPrice'].astype('double')
    df['AskPrice'] = df['AskPrice'].astype('double')
    df["ES"] = ((df["AskPrice"]-df["BidPrice"])/2) - trade_price
    df["QS"] = df["AskPrice"]-df["BidPrice"]
    df["EQF"] = (df["ES"]/df["QS"])*100
    return df

def get_trade_before_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].groupby(groupby_col).final()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_trade_after_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].groupby(groupby_col).first()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_nbbo_trade_before_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].iloc[-1:]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

def get_nbbo_trade_after_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].iloc[:1]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

Within the following perform, we create the dataset that comprises all of the quotes earlier than and after the commerce. Athena Spark robotically determines what number of DPUs to launch for processing our dataset.

def get_tca_analysis_via_df_single_query(trade_product, trade_price, trade_time):
    # BBO quotes
    bbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, MarketParticipant FROM bbo_quote the place Product="{trade_product}";")
    bbos = bbos.toPandas()

    bbo_just_before = get_trade_before_n_seconds(trade_time, bbos, seconds=0, groupby_col="MarketParticipant")
    bbo_just_after = get_trade_after_n_seconds(trade_time, bbos, seconds=0, groupby_col="MarketParticipant")
    bbo_1s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=1, groupby_col="MarketParticipant")
    bbo_10s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=10, groupby_col="MarketParticipant")
    bbo_60s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=60, groupby_col="MarketParticipant")
    
    all_bbos = pd.concat([bbo_just_before, bbo_just_after, bbo_1s_after, bbo_10s_after, bbo_60s_after], ignore_index=True, type=False)
    bbos_calculated = calculate_es_qs_eqf(all_bbos, trade_price)

    #NBBO quotes
    nbbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, BestBidParticipant, BestAskParticipant FROM nbbo_quote the place Product="{trade_product}";")
    nbbos = nbbos.toPandas()

    nbbo_just_before = get_nbbo_trade_before_n_seconds(trade_time,nbbos, seconds=0)
    nbbo_just_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=0)
    nbbo_1s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=1)
    nbbo_10s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=10)
    nbbo_60s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=60)

    all_nbbos = pd.concat([nbbo_just_before, nbbo_just_after, nbbo_1s_after, nbbo_10s_after, nbbo_60s_after], ignore_index=True, type=False)
    nbbos_calculated = calculate_es_qs_eqf(all_nbbos, trade_price)

    calc = pd.concat([bbos_calculated, nbbos_calculated], ignore_index=True, type=False)
    
    return calc

Now let’s name the TCA evaluation perform with the knowledge from our chosen commerce:

tp = "QQQ 230518C00329000"
tpr = 1.16
tt = 1684338579104713924
c = get_tca_analysis_via_df_single_query(tp, tpr, tt)

Visualize the evaluation outcomes

Now let’s create the info frames we use for our visualization. Every information body comprises quotes for one of many 5 time intervals for every information feed (BBO, NBBO):

bbo = c[c['MarketParticipant'].isin(['BBO'])]
bbo_bef = bbo[bbo['ReceiptTimestamp'] < tt]
bbo_aft_0 = bbo[bbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
bbo_aft_1 = bbo[bbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
bbo_aft_10 = bbo[bbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
bbo_aft_60 = bbo[bbo['ReceiptTimestamp'] > (tt+60000000000)]

nbbo = c[~c['MarketParticipant'].isin(['BBO'])]
nbbo_bef = nbbo[nbbo['ReceiptTimestamp'] < tt]
nbbo_aft_0 = nbbo[nbbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
nbbo_aft_1 = nbbo[nbbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
nbbo_aft_10 = nbbo[nbbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
nbbo_aft_60 = nbbo[nbbo['ReceiptTimestamp'] > (tt+60000000000)]

Within the following sections, we offer instance code to create totally different visualizations.

Plot QS and NBBO earlier than the commerce

Use the next code to plot the quoted unfold and NBBO earlier than the commerce:

fig = px.bar(title="Quoted Unfold Earlier than The Commerce",
    x=bbo_bef.MarketParticipant,
    y=bbo_bef['QS'],
    labels={'x': 'Market', 'y':'Quoted Unfold'})
fig.add_hline(y=nbbo_bef.iloc[0]['QS'],
    line_width=1, line_dash="sprint", line_color="crimson",
    annotation_text="NBBO", annotation_font_color="crimson")
%plotly fig

Plot QS for every market and NBBO after the commerce

Use the next code to plot the quoted unfold for every market and NBBO instantly after the commerce:

fig = px.bar(title="Quoted Unfold After The Commerce",
    x=bbo_aft_0.MarketParticipant,
    y=bbo_aft_0['QS'],
    labels={'x': 'Market', 'y':'Quoted Unfold'})
fig.add_hline(
    y=nbbo_aft_0.iloc[0]['QS'],
    line_width=1, line_dash="sprint", line_color="crimson",
    annotation_text="NBBO", annotation_font_color="crimson")
%plotly fig

Plot QS for every time interval and every marketplace for BBO

Use the next code to plot the quoted unfold for every time interval and every marketplace for BBO:

fig = go.Determine(information=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['QS']),
    go.Bar(identify="0s after commerce", x=bbo_aft_0.MarketParticipant.distinctive(), y=bbo_aft_0['QS']),
    go.Bar(identify="1s after commerce", x=bbo_aft_1.MarketParticipant.distinctive(), y=bbo_aft_1['QS']),
    go.Bar(identify="10s after commerce", x=bbo_aft_10.MarketParticipant.distinctive(), y=bbo_aft_10['QS']),
    go.Bar(identify="60s after commerce", x=bbo_aft_60.MarketParticipant.distinctive(), y=bbo_aft_60['QS'])])
fig.update_layout(barmode="group",title="BBO Quoted Unfold Per Market/TimeFrame",
    xaxis={'title':'Market'},
    yaxis={'title':'Quoted Unfold'})
%plotly fig

Plot ES for every time interval and marketplace for BBO

Use the next code to plot the efficient unfold for every time interval and marketplace for BBO:

fig = go.Determine(information=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['ES']),
    go.Bar(identify="0s after commerce", x=bbo_aft_0.MarketParticipant.distinctive(), y=bbo_aft_0['ES']),
    go.Bar(identify="1s after commerce", x=bbo_aft_1.MarketParticipant.distinctive(), y=bbo_aft_1['ES']),
    go.Bar(identify="10s after commerce", x=bbo_aft_10.MarketParticipant.distinctive(), y=bbo_aft_10['ES']),
    go.Bar(identify="60s after commerce", x=bbo_aft_60.MarketParticipant.distinctive(), y=bbo_aft_60['ES'])])
fig.update_layout(barmode="group",title="BBO Efficient Unfold Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Efficient Unfold'})
%plotly fig

Plot EQF for every time interval and marketplace for BBO

Use the next code to plot the efficient/quoted unfold for every time interval and marketplace for BBO:

fig = go.Determine(information=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['EQF']),
    go.Bar(identify="0s after commerce", x=bbo_aft_0.MarketParticipant.distinctive(), y=bbo_aft_0['EQF']),
    go.Bar(identify="1s after commerce", x=bbo_aft_1.MarketParticipant.distinctive(), y=bbo_aft_1['EQF']),
    go.Bar(identify="10s after commerce", x=bbo_aft_10.MarketParticipant.distinctive(), y=bbo_aft_10['EQF']),
    go.Bar(identify="60s after commerce", x=bbo_aft_60.MarketParticipant.distinctive(), y=bbo_aft_60['EQF'])])
fig.update_layout(barmode="group",title="BBO Efficient/Quoted Unfold Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Efficient/Quoted Unfold'})
%plotly fig

Athena Spark calculation efficiency

Once you run a code block, Athena Spark robotically determines what number of DPUs it requires to finish the calculation. Within the final code block, the place we name the tca_analysis perform, we are literally instructing Spark to course of the info, and we then convert the ensuing Spark dataframes into Pandas dataframes. This constitutes essentially the most intensive processing a part of the evaluation, and when Athena Spark runs this block, it reveals the progress bar, elapsed time, and what number of DPUs are processing information presently. For instance, within the following calculation, Athena Spark is using 18 DPUs.

Once you configure your Athena Spark pocket book, you will have the choice of setting the utmost variety of DPUs that it will probably use. The default is 20 DPUs, however we examined this calculation with 10, 20, and 40 DPUs to display how Athena Spark robotically scales to run our evaluation. We noticed that Athena Spark scales linearly, taking quarter-hour and 21 seconds when the pocket book was configured with a most of 10 DPUs, 8 minutes and 23 seconds when the pocket book was configured with 20 DPUs, and 4 minutes and 44 seconds when the pocket book was configured with 40 DPUs. As a result of Athena Spark prices based mostly on DPU utilization, at a per-second granularity, the price of these calculations is analogous, however if you happen to set a better most DPU worth, Athena Spark can return the results of the evaluation a lot sooner. For extra particulars on Athena Spark pricing please click on right here.

Conclusion

On this put up, we demonstrated how you should utilize high-fidelity OPRA information from LSEG’s Tick Historical past-PCAP to carry out transaction value analytics utilizing Athena Spark. The supply of OPRA information in a well timed method, complemented with accessibility improvements of AWS Information Trade for Amazon S3, strategically reduces the time to analytics for corporations seeking to create actionable insights for essential buying and selling choices. OPRA generates about 7 TB of normalized Parquet information every day, and managing the infrastructure to offer analytics based mostly on OPRA information is difficult.

Athena’s scalability in dealing with large-scale information processing for Tick Historical past – PCAP for OPRA information makes it a compelling selection for organizations in search of swift and scalable analytics options in AWS. This put up reveals the seamless interplay between the AWS ecosystem and Tick Historical past-PCAP information and the way monetary establishments can reap the benefits of this synergy to drive data-driven decision-making for essential buying and selling and funding methods.

Concerning the Authors

Pramod Nayak is the Director of Product Administration of the Low Latency Group at LSEG. Pramod has over 10 years of expertise within the monetary know-how business, specializing in software program growth, analytics, and information administration. Pramod is a former software program engineer and captivated with market information and quantitative buying and selling.

LakshmiKanth Mannem is a Product Supervisor within the Low Latency Group of LSEG. He focuses on information and platform merchandise for the low-latency market information business. LakshmiKanth helps prospects construct essentially the most optimum options for his or her market information wants.

Vivek Aggarwal is a Senior Information Engineer within the Low Latency Group of LSEG. Vivek works on growing and sustaining information pipelines for processing and supply of captured market information feeds and reference information feeds.

Alket Memushaj is a Principal Architect within the Monetary Providers Market Growth group at AWS. Alket is chargeable for technical technique, working with companions and prospects to deploy even essentially the most demanding capital markets workloads to the AWS Cloud.