Building Rust Code - Current Issues

As rustpkg is still in its infancy, most Rust code tends to be built with make, other tools, or by hand. I've been working on updating Servo's build system to something a bit more reliable and fast, and so I've been giving a lot of thought to build tooling with regards to Rust.

In this post, I want to cover what the current issues are with building Rust code, especially with regards to external tooling. I'll also describe some recent work I did to address these issues. In the future, I want to cover specific ways to integrate Rust with a few different build tools.

Current Issues

Building Rust with existing build tools is a little difficult at the moment. The main issues are related to Rust's attempt to be a better systems language than the existing options.

For example, Rust uses a larger compilation unit than C and C++ compilers, and existing build tools are designed around single file compilation. Rust libraries are output with unpredictable names. And dependency information must be done manually.

Compilation Unit

Many programming languages compile one source file to one output file and then collect the results into some final product. In C, you compile .c files to .o files, then archive or link them into .lib, .a, .dylib, and so on depending on the platform and whether you are building an executable, static library, or shared library. Even Java compiles .java inputs to one or more .class outputs, which are then normally packaged into a .jar.

In Rust, the unit of compilation is the crate, which is a collection of modules and items. A crate may consist of a single source file or an arbitrary number of them in some directory hierarchy, but its output is a single executable or library.

Using crates as the compilation unit makes sense from a compiler point of view, as it has more knowledge during compilation to work from. It also makes sense from a versioning point of view as all of the crate's contents goes together. Using crates as the compilation unit allows for cyclic dependencies between modules in the same crates, which is useful to express some things. It also means that separate declaration and implementation pieces are not needed, such as the header files in C and C++.

Most build tools assume a model similar to that of a typical C compiler. For example, make has pattern rules that can take and input to and output based on on filename transformations. These work great if one input produces one output, but they don't work well in other cases.

Rust still has a main input file, the one you pass to the compiler, so this difference doesn't have a lot of ramifications when using existing build tools.

Output Names

Compilers generally have an option for what to name their output files, or else they derive the output name with some simple formula. C compilers use the -o option to name the output; Java just names the files after the classes they contain. Rust also has a -o option, which works like you expect, except in the case of libraries where it is ignored.

Libraries in Rust are special in order to avoid naming collisions. Since libraries often end up stored centrally, only one library can have a given name. If I create a library called libgeom it will conflict with someone else's libgeom. Operating systems and distributions end up resolving these conflicts by changing the names slightly, but it's a huge annoyance. To avoid collisions, Rust includes a unique identifier called the crate hash in the name. Now my Rust library libgeom-f32ab99 doesn't conflict with libgeom-00a9edc.

Unfortunately, the current Rust compiler computes the crate hash by hashing the link metadata, such as name and version, along with the link metadata of its dependencies. This results in a crate hash that only the Rust compiler is realistically able to compute, making it seem pseudo-random. This causes a huge problem for build tooling as the output filename for libraries in unknown.

To work around this problem when using make, the Rust and Servo build systems use a dummy target called libfoo.dummy for a library called foo, and after running rustc to build the library, it creates the libfoo.dummy file so that make has some well known output to reason about. This workaround is a bit messy and pollutes the build files.

Here's an example of what a Makefile looks like with this .dummy workaround:

RUSTC ?= rustc

SOURCES = $(find . -name '*.rs')

all: librust-geom.dummy

librust-geom.dummy: lib.rs $(SOURCES)
    @$(RUSTC) --lib $<
    @touch $@

clean:
    @rm -f *.dummy *.so *.dylib *.dll

While this works, it also has some drawbacks. For example, if you edit a file during a long compile, the libfoo.dummy will get updated after the compile is finished, and rerunning the build won't detect any changes. The timestamp of the input file will be older than the final output file that the build tool is checking. If the build system knew the real output file name, it could compare the correct timestamps, but that information has been locked inside the Rust compiler.

Dependency Information

Build systems need to be reliable. When you edit a file, it should trigger the correct things to get rebuilt. If nothing changes, nothing should get rebuilt. It's extremely frustrating if you edit a file, rebuild the library, and find that your code changes aren't reflected in the new output for some reason or that the library is not rebuilt at all. Reliable builds need accurate dependency information in order to accomplish this.

There's currently no way for external build tools to get dependency information about Rust crates. This means that developers tend to list dependencies by hand which is pretty fragile.

One quick way to approximate dependency info is just to recursively find every *.rs in the crate's source directory. This can be wrong for multiple reasons; perhaps the include! or include_str! macros are used to pull in files that aren't named *.rs or conditional compilation may omit several files.

This is similar to dealing with header dependencies by hand when working with C and C++ code. C compilers have options to generate dependency info to deal with this, which used by tools like CMake.

The price of inaccurate or missing dependency info is an unreliable build and a frustrated developer. If you find yourself reaching for make clean, you're probably suffering from this.

Making It Better

It's possible to solve these problems without sacrificing the things we want and falling back to doing exactly what C compilers do. By making the output file knowable and handling dependencies automatically we make make build tool integration easy and the resulting builds reliable. This is exactly what I've been working on the last few weeks.

Stable and Computable Hashes

The first thing we need is to make the crate hash stable and easily computable by external tools. Internally, the Rust compiler uses SipHash to compute the crate hash, and takes into account arbitrary link metadata as well as the link metadata of its dependencies. SipHash is not something easily computed from a Makefile and the link metadata is not so easy to slurp and normalize from some dependency graph.

I've just landed a pull request that replaces the link metadata with a package identifier, which is a crate level attribute called pkgid. You declare it like #[pkgid="github.com/mozilla-servo/rust-geom#0.1"]; at the top of your lib.rs. The first part, github.com/mozilla-servo, is a path, which serves as both a namespace for your crate and a location hint as to where it can be obtained (for use by rustpkg for example). Then comes the crate's name, rust-geom. Following that is the version identifier 0.1. If no pkgid attribute is provided, one is inferred with an empty path, a 0.0 version, and a name based on the name of the input file.

To generate a crate hash, we take the SHA256 digest of the pkgid attribute. SHA256 is readily available in most languages or on the command line, and the pkgid attribute is very easy to find by running a regular expression over the main input file. The first eight digits of this hash are used for the filename, but the full hash is stored in the crate metadata and used as part of the symbol hashes.

Since the crate hash no longer depends on the crate's dependencies, it is stable so long as the pkgid attribute doesn't change. This should happen very infrequently, for instance when the library changes versions.

This makes the crate hash computable by pretty much any build tool you can find, and means rustc generates predictable output filenames for libraries.

Dependency Management

I've also got a pull request, which should land soon, to enable rustc to output make-compatible dependency information similar to the -MMD flag of gcc. To use it, you give rustc the --dep-info option and for an input file of lib.rs it will create a lib.d which can be used by make or other tools to learn the true dependencies.

The lib.d file will look something like this:

librust-geom-da91df73-0.0.dylib: lib.rs matrix.rs matrix2d.rs point.rs rect.rs side_offsets.rs size.rs

Note that this list of dependencies will include code pulled in via the include! and include_str! macros as well.

Here's an example of a handwritten Makefile using dependency info. Note that this uses a hard-coded output file name, which works because crate hash is stable unless the pkgid attribute is changed:

RUSTC ?= rustc

all: librust-geom-851fed20-0.1.dylib

librust-geom-851fed20-0.1.dylib: lib.rs
    @$(RUSTC) --dep-info --lib $<

-include lib.d

Now it will notice when you change any of the .rs files without needed to explicitly list them, and this will get updated as your code changes automatically. A little Makefile abstraction on top of this can make it quite nice and portable.

Next Up

In the next few posts, I'll show examples of integrating the improved Rust compiler with some existing build systems like make, CMake, and tup.

(Update: the next post covers building Rust with Make.)

more info