1# Using Pulley: Wasmtime's Portable, Optimizing Interpreter 2 3On architectures such as x86\_64 or aarch64 Wasmtime will by default use the 4Cranelift compiler to translate WebAssembly to native machine code and execute 5it. Cranelift does not support all architectures, however, for example i686 6(32-bit Intel machines) is not supported at this time. To help execute 7WebAssembly on these architectures Wasmtime comes with an interpreter called 8Pulley. 9 10Pulley is a bytecode interpreter originally proposed [in an RFC][rfc] which is 11intended to primarily be portable. Pulley is a loose backronym for "Portable, 12Universal, Low-Level Execution strategY" but mostly just a theme on 13machines/tools (Cranelift, Winch, Pulley, ...). Pulley is a distinct target and 14execution environment for Wasmtime. 15 16## Enabling Pulley 17 18The Pulley interpreter is enabled via one of two means: 19 201. On architectures which have Cranelift support, Pulley must be enabled via the 21 `pulley` crate feature of the `wasmtime` crate. This feature is otherwise 22 off-by-default. 23 242. On architectures which do NOT have Cranelift support, Pulley is already 25 enabled by default. This means that Wasmtime can execute WebAssembly by 26 default on any platform, it'll just be faster on Cranelift-supported 27 platforms. 28 29For platforms in category (2) there is no opt-in necessary to execute Pulley as 30that's already the default target. Platforms in category (1), such as 31`x86_64-unknown-linux-gnu`, may still want to execute Pulley to run tests, 32evaluate the implementation, benchmark, etc. 33 34To force execution of Pulley on any platform the `pulley` crate feature of 35the `wasmtime` crate must be enabled in addition to configuring a target. 36Specifying a target is done with the `--target` CLI option to the `wasmtime` 37executable, the [`Config::target`] method in Rust, or the 38[`wasmtime_config_target_set`] C API. The target string for pulley must be one 39of: 40 41[`Config::target`]: https://docs.rs/wasmtime/latest/wasmtime/struct.Config.html#method.target 42[`wasmtime_config_target_set`]: https://docs.wasmtime.dev/c-api/config_8h.html#ae68a2737ba1680e75cddb6ede08d682a 43 44* `pulley32` - for 32-bit little-endian hosts 45* `pulley32be` - for 32-bit big-endian hosts 46* `pulley64` - for 64-bit little-endian hosts 47* `pulley64be` - for 64-bit big-endian hosts 48 49The Pulley target string must match the environment that the Pulley Bytecode 50will be executing in. Some examples of Pulley targets are: 51 52| Host target | Pulley target | 53|----------------------------|---------------| 54| `x86_64-unknown-linux-gnu` | `pulley64` | 55| `i686-unknown-linux-gnu` | `pulley32` | 56| `s390x-unknown-linux-gnu` | `pulley64be` | 57 58Wasmtime will return an error trying to load bytecode compiled for the wrong 59Pulley target. When Pulley is the default target for a particular host then the 60correct Pulley target will be selected automatically. Specifying the Pulley 61target may still be necessary when cross-compiling from one platform to another, 62however. 63 64## Using Pulley 65 66Using Pulley in Wasmtime requires no further configuration beyond specifying the 67target for Pulley. Once that is done all of the Wasmtime crate's Rust APIs or C 68API work as usual. For example when specifying `wasmtime run --target pulley64` 69on the CLI this will execute all WebAssembly in the interpreter rather than via 70Cranelift. 71 72Pulley at this time has the same feature parity for WebAssembly as Cranelift 73does. This means that all WebAssembly proposals and features supported by 74Wasmtime are supported by Pulley. 75 76If you notice anything awry, however, please feel free to file an issue. 77 78## Impact of using Pulley 79 80Pulley is an interpreter for its own bytecode format. While the design of Pulley 81is optimized for speed you should still expect a ~10x order-of-magnitude 82slowdown relative to native code or Cranelift. This means that Pulley is likely 83not suitable for compute-intensive tasks that must run in as little time as 84possible. 85 86The primary goal of Pulley is to enable using and embedding Wasmtime across a 87variety of platforms simultaneously. The same API/interface is used to interact 88with the runtime and loading WebAssembly module regardless of the host 89architecture. 90 91Pulley bytecode is produced by the Cranelift compiler today in a similar manner 92to native platforms. Pulley is not designed for quickly loading WebAssembly 93modules as Cranelift is an optimizing compiler. Compiling WebAssembly to Pulley 94bytecode should be expected to take about the same time as compiling to native 95platforms. 96 97## Disabling SIMD in Pulley 98 99By default all Pulley opcodes are enabled in the interpreter meaning it's 100possible to execute any Pulley bytecode created by Cranelift and Wasmtime. This 101includes, for example, SIMD opcodes for all of the WebAssembly SIMD proposal. 102Not all WebAssembly modules use these opcodes though nor do all embeddings want 103to enable it, so Pulley supports a custom Rust flag that can be specified at 104compile time to compile-out the SIMD opcodes: 105 106```text 107RUSTFLAGS=--cfg=pulley_disable_interp_simd 108``` 109 110When specified the Pulley interpreter will no longer include code to execute 111SIMD opcodes. Instead attempting to execute any opcode will raise a "disabled 112opcode" trap instead. If doing this it's recommended to pair it with 113`Config::wasm_simd(false)` to ensure that SIMD-using modules do not pass 114validation. 115 116## High-level Design of Pulley 117 118This section is not necessary for users of Pulley but for those interested this 119is a description of the high-level design of Pulley. The Pulley virtual machine 120consists of: 121 122* 32 "X" integer registers each of which are 64-bits large. (`XReg`) 123* 32 "F" float registers each of which are 64-bits large. (`FReg`) 124* 32 "V" vector registers each of which are 128-bits large. (`VReg`) 125* A dynamically allocated "stack" on the host's heap. 126* A frame pointer register. 127* A link register to store the return address for the current function. 128 129This state lives in [`MachineState`] which is in turned stored in a [`Vm`]. 130Pulley's source code lives in `pulley/` in the Wasmtime repository. 131 132Pulley's bytecode is defined in `pulley/src/lib.rs` with a combination of the 133`for_each_op!` and `for_each_extended_op!` macros. Opcode numbers and opcode 134layout are defined by the structure of these macros. The macros are used to 135"derive" encoding/decoding/traits/etc used throughout the `pulley_interpreter` 136crate. 137 138Pulley opcodes are a single discriminator byte followed by any immediates. 139Immediates are not aligned and require unaligned loads/stores to work with them. 140Pulley has more than 256 opcodes, however, which is where "extended" opcodes 141come into play. The final Pulley opcode is reserved to indicate that an extended 142opcode is being used. Extended opcodes follow this initial discriminator with a 14316-bit integer which further indicates which extended opcode is being used. This 144design is intended to allow common operations to be encoded more compactly while 145less common operations can still be packed in effectively without limit. 146 147Pulley opcode assignment happens through the order of the `for_each_op!` macro 148which means that it's not portable across multiple versions of Wasmtime. 149 150The interpreter is an implementation of the [`OpVisitor`] and 151[`ExtendedOpVisitor`] traits. This is located at `pulley/src/interp.rs`. Notably 152this means that there's a method-per-opcode and is how the interpreter is 153implemented. 154 155The interpreter loop itself is implemented in one of two ways: 156 1571. A "match loop" which is a Rust `loop { ... }` which internally uses the 158 [`Decode`] trait on each opcode. This is not literally modeled as but 159 compiles down to something that looks like `loop { match .. { ... } }`. This 160 interpreter loop is located at `pulley/src/interp/match_loop.rs`. 161 1622. A "tail loop" were each opcode handler is a Rust function. Control flow 163 between opcodes continues with tail-calls and exiting the interpreter is done 164 by returning from the function. Tail calls are not available in stable Rust 165 so this interpreter loop is not used by default. It can be enabled, though, 166 with `RUSTFLAGS=--cfg=pulley_assume_llvm_makes_tail_calls` to rely on LLVM's 167 tail-call-optimization pass to implement the loop. 168 169The "match loop" is the default interpreter loop as it's portable and works on 170stable Rust. The "tail loop" is thought to probably perform better than the 171"match loop" but it's not available on stable Rust (`become` in Rust is an 172unfinished nightly feature at this time) or portable (tail-call-optimization 173doesn't happen the same in LLVM on all architectures). 174 175### Inspecting Pulley Bytecode 176 177Like when compiling to native the `wasmtime objdump` command can be used to 178inspect compiled bytecode: 179 180```shell-session 181$ wasmtime compile --target pulley64 foo.wat 182$ wasmtime objdump foo.cwasm --addresses --bytes 1830x000000: wasm[0]::function[20]: 184 0: 9f 10 00 08 00 push_frame_save 16, x19 185 5: 40 13 00 xmov x19, x0 186 8: 03 13 13 3f cb 89 00 call2 x19, x19, 0x89cb3f // target = 0x89cb47 187 f: 03 13 13 8c ab 84 00 call2 x19, x19, 0x84ab8c // target = 0x84ab9b 188 16: 03 13 13 5b 12 00 00 call2 x19, x19, 0x125b // target = 0x1271 189 1d: 03 13 13 9f 12 00 00 call2 x19, x19, 0x129f // target = 0x12bc 190 24: 03 13 13 e0 45 00 00 call2 x19, x19, 0x45e0 // target = 0x4604 191... 192``` 193 194### Profiling Pulley 195 196Profiling the Pulley interpreter can be done with native profiler such as `perf` 197but this has a few downsides: 198 199* When profiling the "match loop" it's not clear what machine code corresponds 200 to which Pulley opcode. Most of the time all the samples are just in the one 201 big "run" function. 202 203* When profiling with the "tail loop" you can see hot opcodes much more clearly, 204 but it can be difficult to understand why a particular opcode was chosen. 205 206It can sometimes be more beneficial to see time spent per Pulley opcode itself 207in the context of the all Pulley opcodes. In a similar manner as you can look at 208instruction-level profiling in `perf` it can be useful to look at opcode-level 209profiling of Pulley. 210 211Pulley has limited support for opcode-level profiling. This is off-by-default as 212it has a performance hit for the interpreter. To collect a profile with the 213`wasmtime` CLI you'll have to build from source and enable the `profile-pulley` 214feature: 215 216```console 217cargo run --features profile-pulley --release run --profile pulley --target pulley64 foo.wat 218``` 219 220This will compile an optimized `wasmtime` executable with the `profile-pulley` 221Cargo feature enabled. The `--profile pulley` flag can then be passed to the 222`wasmtime` CLI to enable the profiler at runtime. 223 224The command will emit a `pulley-$pid.data` file which contains raw data about 225Pulley opcodes and samples taken. To view this file you can use: 226 227```console 228cargo run -p pulley-interpreter --example profiler-html --all-features ./pulley-$pid.data 229``` 230 231This will load the `pulley-*.data` file, parse it, collate the results, and 232display the hottest functions. The hottest function is emitted last and 233instructions are annotated with the `%` of samples taken that were executing at 234that instruction. 235 236Some more information can be found in [the PR that implemented Pulley profiling 237support][profile-pr] 238 239[`OpVisitor`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/decode/trait.OpVisitor.html 240[`MachineState`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/interp/struct.MachineState.html 241[`Vm`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/interp/struct.Vm.html 242[rfc]: https://github.com/bytecodealliance/rfcs/blob/main/accepted/pulley.md 243[`ExtendedOpVisitor`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/decode/trait.ExtendedOpVisitor.html 244[`Decode`]: https://docs.rs/pulley-interpreter/latest/pulley_interpreter/decode/trait.Decode.html 245[profile-pr]: https://github.com/bytecodealliance/wasmtime/pull/10034 246