Sooner Rust Toolchains for Android

Posted by Chris Wailes – Senior Software program Engineer

The efficiency, security, and developer productiveness supplied by Rust has led to fast adoption within the Android Platform. Since slower construct instances are a priority when utilizing Rust, notably inside a large venture like Android, we have labored to ship the quickest model of the Rust toolchain that we will. To do that we leverage a number of types of profiling and optimization, in addition to tuning C/C++, linker, and Rust flags. A lot of what I’m about to explain is much like the construct course of for the official releases of the Rust toolchain, however tailor-made for the particular wants of the Android codebase. I hope that this publish will likely be usually informative and, if you’re a maintainer of a Rust toolchain, could make your life simpler.

Android’s Compilers

Whereas Android is actually not distinctive in its want for a performant cross-compiling toolchain this reality, mixed with the big variety of each day Android construct invocations, implies that we should rigorously stability tradeoffs between the time it takes to construct a toolchain, the toolchain’s measurement, and the produced compiler’s efficiency.

Our Construct Course of

To be clear, the optimizations listed beneath are additionally current within the variations of rustc which might be obtained utilizing rustup. What differentiates the Android toolchain from the official releases, moreover the provenance, are the cross-compilation targets accessible and the codebase used for profiling. All efficiency numbers listed beneath are the time it takes to construct the Rust elements of an Android picture and will not be reflective of the speedup when compiling different codebases with our toolchain.

Codegen Items (CGU1)

When Rust compiles a crate it can break it into some variety of code era models. Every impartial chunk of code is generated and optimized concurrently after which later re-combined. This method permits LLVM to course of every code era unit individually and improves compile time however can scale back the efficiency of the generated code. A few of this efficiency will be recovered by way of using Hyperlink Time Optimization (LTO), however this isn’t assured to realize the identical efficiency as if the crate have been compiled in a single codegen unit.

To show as many alternatives for optimization as potential and guarantee reproducible builds we add the -C codegen-units=1 choice to the RUSTFLAGS surroundings variable. This reduces the dimensions of the toolchain by ~5.5% whereas rising efficiency by ~1.8%.

Remember that setting this feature will decelerate the time it takes to construct the toolchain by ~2x (measured on our workstations).

GC Sections

Many initiatives, together with the Rust toolchain, have capabilities, lessons, and even total namespaces that aren’t wanted in sure contexts. The most secure and best possibility is to depart these code objects within the ultimate product. This may improve code measurement and will lower efficiency (as a result of caching and structure points), but it surely ought to by no means produce a miscompiled or mislinked binary.

It’s potential, nevertheless, to ask the linker to take away code objects that aren’t transitively referenced from the most important()operate utilizing the –gc-sections linker argument. The linker can solely function on a section-basis, so, if any object in a piece is referenced, the whole part should be retained. Because of this it’s also widespread to cross the -ffunction-sections and -fdata-sections choices to the compiler or code era backend. This may be sure that every code object is given an impartial part, thus permitting the linker’s rubbish assortment cross to gather objects individually.

This is likely one of the first optimizations we carried out and, on the time, it produced vital measurement financial savings (on the order of 100s of MiBs). Nonetheless, most of those features have been subsumed by these produced from setting -C codegen-units=1 when they’re utilized in mixture and there’s now no distinction between the 2 produced toolchains in measurement or efficiency. Nonetheless, as a result of additional overhead, we don’t all the time use CGU1 when constructing the toolchain. When testing for correctness the ultimate velocity of the compiler is much less vital and, as such, we enable the toolchain to be constructed with the default variety of codegen models. In these conditions we nonetheless run part GC throughout linking because it yields some efficiency and measurement advantages at a really low price.

Hyperlink-Time Optimization (LTO)

A compiler can solely optimize the capabilities and knowledge it might see. Constructing a library or executable from impartial object information or libraries can velocity up compilation however at the price of optimizations that rely on data that’s solely accessible when the ultimate binary is assembled. Hyperlink-Time Optimization offers the compiler one other alternative to investigate and modify the binary throughout linking.

For the Android Rust toolchain we carry out skinny LTO on each the C++ code in LLVM and the Rust code that makes up the Rust compiler and instruments. As a result of the IR emitted by our clang may be a distinct model than the IR emitted by rustc we will’t carry out cross-language LTO or statically hyperlink towards libLLVM. The efficiency features from utilizing an LTO optimized shared library are better than these from utilizing a non-LTO optimized static library nevertheless, so we’ve opted to make use of shared linking.

Utilizing CGU1, GC sections, and LTO produces a speedup of ~7.7% and measurement enchancment of ~5.4% over the baseline. This works out to a speedup of ~6% over the earlier stage within the pipeline due solely to LTO.

Profile-Guided Optimization (PGO)

Command line arguments, surroundings variables, and the contents of information can all affect how a program executes. Some blocks of code may be used often whereas different branches and capabilities could solely be used when an error happens. By profiling an utility because it executes we will gather knowledge on how usually these code blocks are executed. This knowledge can then be used to information optimizations when recompiling this system.

We use instrumented binaries to gather profiles from each constructing the Rust toolchain itself and from constructing the Rust elements of Android photographs for x86_64, aarch64, and riscv64. These 4 profiles are then mixed and the toolchain is recompiled with profile-guided optimizations.

Because of this, the toolchain achieves a ~19.8% speedup and 5.3% discount in measurement over the baseline compiler. This can be a 13.2% speedup over the earlier stage within the compiler.

BOLT: Binary Optimization and Structure Device

Even with LTO enabled the linker continues to be answerable for the structure of the ultimate binary. As a result of it isn’t being guided by any profiling data the linker would possibly by chance place a operate that’s often known as (scorching) subsequent to a operate that’s hardly ever known as (chilly). When the recent operate is later known as all capabilities on the identical reminiscence web page will likely be loaded. The chilly capabilities at the moment are taking on area that might be allotted to different scorching capabilities, thus forcing the extra pages that do include these capabilities to be loaded.

BOLT mitigates this downside by utilizing a further set of layout-focused profiling data to re-organize capabilities and knowledge. For the needs of rushing up rustc we profiled libLLVM, libstd, and librustc_driver, that are the compiler’s most important dependencies. These libraries are then BOLT optimized utilizing the next choices:

--peepholes=all
--data=<path-to-profile>
--reorder-blocks=ext-tsp
–-reorder-functions=hfsort
--split-functions
--split-all-cold
--split-eh
--dyno-stats

Any further libraries matching lib/*.so are optimized with out profiles utilizing solely –peepholes=all.

Making use of BOLT to our toolchain produces a speedup over the baseline compiler of ~24.7% at a measurement improve of ~10.9%. This can be a speedup of ~6.1% over the PGOed compiler with out BOLT.

If you’re interested by utilizing BOLT in your individual venture/construct I provide these two bits of recommendation: 1) you’ll must emit further relocation data into your binaries utilizing the -Wl,–emit-relocs linker argument and a pair of) use the identical enter library when invoking BOLT to supply the instrumented and the optimized variations.

Conclusion

Graph of normalized size and duration comparison between Toolchain size and Android Rust build time

Optimizations	Speedup vs Baseline
Monolithic	1.8%
Mono + GC Sections	1.9%
Mono + GC + LTO	7.7%
Mono + GC + LTO + PGO	19.8%
Mono + GC + LTO + PGO + BOLT	24.7%

By compiling as a single code era unit, rubbish accumulating our knowledge objects, performing each link-time and profile-guided optimizations, and leveraging the BOLT device we have been in a position to velocity up the time it takes to compile the Rust elements of Android by 24.8%. For each 50k Android builds per day run in our CI infrastructure we save ~10K hours of serial execution.

Our trade shouldn’t be one to face nonetheless and there’ll certainly be one other device and one other set of profiles in want of accumulating within the close to future. Till then we’ll proceed making incremental enhancements in the hunt for further efficiency. Glad coding!