Platform Agnostic Bytecode


How to use the slides - Full screen (new tab)
Slides Content
--- title: Platform Agnostic Bytecode description: What are PABs and why they exist? duration: 1 hour ---

Platform Agnostic Bytecode


Review of Compilers

🀯 Fun Side Reading: Reflections on Trusting Trust

Notes:

Just a very quick reminder of how compilers work. Humans write programs in some human readable language like Lauren talked about. Then the compiler translates the semantics of that program into an equivalent program in a much lower more machine-readable language called a bytecode.

CLICK

Whenever I show this diagram or talk about compilers, I always like to mention one of my favorite essays ever. Ken Thompson's 1984 Turing Award lecture.


Definition

A PAB is a bytecode that follows two main principles:

  • Turing Completeness, as a standard bytecode would respect
  • Support for tooling that makes it executable on every machine

Notes:

Ideally a bytecode like this is designed to be executed on a virtual machine that follows general known patterns.


High Level Languages
PABs
Architecture's bytecode

Notes:

From left to right you can see different levels of abstraction over the program that will ultimately be run on some machine. Generally, from a high level language you need two compilation step if you want to pass through a PAB.

Other examples of PABs used right now:

  • Inside the Linux Kernel -> eBPF
  • Inside browsers -> Wasm
  • Inside Blockchains -> Wasm
    • Full nodes
    • Light nodes (Wasm inside Wasm)
  • LLVM Toolchain -> LLVM IR

---v

Compiling in a PAB

Notes:

So when we are using a PAB, we need to compile twice. This is, of course, the cost to using a PAB. In this lesson we'll also explore the advantages.


What a PAB allows is:

  • Portability
    - Avoid Hardware Centralization
    
  • Determinism
    - Make consensus possible
    

Notes:

The main goal of a PAB is to make the code portable, you should be able to compile it once and then share it around without caring about the architecture on which will be executed. Of course in a decentralized network we want that different nodes, with different architectures came up to the same result if the input are the same, that's called determinism, if a PAB would not have determinism then reaching consensus is impossible.

---v

That's why PABs are so important

Desireable Features

  • Hardware Independence
  • Efficiency
  • Tool Simplicity
  • Support as Compilation Target
  • Sandboxing

Notes:

  • Hardware Independence: It should not be tightly related to a specific architecture, otherwise the execution on different machine could be convoluted
  • Efficiency: the execution of a PAB should be efficient, the problem for a PAB is that in the execution time is also considered the "translation" to the machine's bytecode or the interpretation
  • Support as Compilation Target: The PAB should be possible to be compiled by as many as possible High Level languages
  • Tool Simplicity: If the tools that makes the PAB executable are extremely complex then nobody will use it

---v

Sandboxing?

An environment for running untrusted code without affecting the host.

A SmartContract is Arbitrary Code that may be executed on other people's infrastructure, we don't want SmartContracts capable of destroying the nodes on which they are executed

Notes:

CLICK read definition

The term sandbox is an analogy to kids playing in a sandbox. The parent puts the kid in the sandbox and tells them they can play in the sandbox and they are safe as long as they stay in. Don't go in the woods and get bitten by a snake or in the road and get hit by a car. Just stay in the sandbox.

Of course the analogy isn't perfect. The children in the sandbox stay there because the parent asked them to. They could leave anytime they wanted to. For actual untrusted code, a better analogy would be a walled garden or a Jail

---v

Sandboxing?

A sandboxed environment must be created by the executor of the PAB.

Notes:

Of course the security can be seen by various point of view and some examples are:

  • Compilation takes too much time -> compiling bomb
  • Access to the environment -> "buffer overflow" techniques

Those things can't be addressed by the PAB itself but they can give good guidelines and code design to make an 100% secure implementation of the executor possible.


PAB's lifecycle example



WebAssembly


Wasm's key points

  • Hardware-independent
    • Binary instruction format for a stack-based virtual machine
  • Supported as compilation target by many languages
    • Rust, C, C++ and many others
  • Fast (with near-native performance)
  • Safe (executed in a sandboxed environment)
  • Open (programs can interoperate with their environment)

Notes:

Wasm seems to respect every rating points we defined before


Stack-Based Virtual Machine Example

Adding two number in wasm text representation (.wat)

(module
  (import "console" "log" (func $log (param i32)))
  (func $main
    ;; load `10` and `3` onto the stack
    i32.const 10
    i32.const 3

    i32.add ;; add up both numbers
    call $log ;; log the result
  )
  (start $main)
)

Notes:

Wasm has also a text representation, Wat has some features that allow for better readability:

  • Stack push operations can be grouped to its consuming instruction.
  • Labels can be applied to elements.
  • Blocks can enclosed with parenthesis instead of explicit start/end instructions.

Instructions push results to the stack and use values on the stack as arguments, the compilation process generally translate this stack-based bytecode to register based, where registers are used to pass values to instructions as a primary mechanism. The compilation will try to elide the wasm stack and work with only the architecture registers.

There is another type of stack used in wasm and that's called: shadow stack, resource to learn more: https://hackmd.io/RNp7oBzKQmmaGvssJDHxrw


Wasm seems to be a perfect PAB, but

  • How does communication with the environment work?
  • How the memory is managed?
  • How is it executed?

Notes:

Assuming all the things we said before wasm seems to be perfect but how those things really works?


Communication with the Environment

Let's call Embedder the program that will take the wasm blob as input and execute it

  • the wasm blob may expect parameters from the embedder
    • embedder -> wasm
  • the embedder may act on a return value from the wasm
    • wasm -> embedder

---v

Problem

Wasm has no ambient access to the computing environment in which code is executed


Solution

Notes:

  • Every interaction with the environment can be done only by a set of functions, called Host Functions, provided by the embedder and imported in wasm
  • The embedder is able to call the functions defined in wasm blob, called Runtime API, and pass arguments through a shared memory

Memory

In addition to the stack Wasm has also access to memory provided by the embedder, the Linear Memory.


  • This area will be used also used as a frontier for data sharing
  • To make everything secure the Embedder is doing incredibly convoluted things

Notes:

From Wasm the Linear Memory is byte addressable Linear Memory can be manipulated using functions called 'store' and 'load'

The Rust compiler uses for dynamic/heap memory and to pass non primitives values to functions by emulating an additional stack within the linear memory, this emulated stack (the shadow stack) is what we would understand as stack in other architectures

---v

Example

Notes:

Here's an example, wasm sees linear memory like a byte array and if it tries to access the second byte, it would use an index 1. When it's time to execute it the embedder will see this access and translate the linear memory access at index 1 to a standard memory access to base_linear_memory + 1.

Buffer overflow? Wasm uses 32 bit, this makes impossible to have an offset bigger then 4GiB, this means that the embedder can leave those 4GiB free in its virtual memory to makes impossible to the wasm blob to access any environment information. Even if the offset is only positive there are embedded that are defining as protected the 2GiB before the BLM so that if for some reason the wasm code trick the embedder to treat the offset as a signed number that would cause an Operating System error.


How Wasm is executed

There are multiple ways to execute wasm:

  • Ahead Of Time Compilation
  • Just in Time Compilation
  • Single Pass Compilation
  • Interpretation
  • ...

Notes:

AOT: Compile all the code at the beginning, this allows to makes a lot of improvement to the final code efficiency JIT: The code is compiled only when needed, examples are functions that are compiled only when called, this leave space only to partials improvements SPC: This is a specific technique of compilation that is made in linear time, the compilation is done only passing once on the code Interpretation: The wasm blob is treated as any other interpreted language and executed in a Virtual Machine

---v

Wasmtime

  • It is a stand alone wasm environment
  • Wasmtime is built on the optimizing Cranelift code generator to quickly generate high-quality machine code either at runtime (JIT) or ahead-of-time (AOT)
  • It executes the compiled wasm blob in sandboxed environment while keeping everything extremely secure

Notes:

Cranelift is a fast, secure, relatively simple and innovative compiler backend. It takes an intermediate representation of a program generated by some frontend and compiles it to executable machine code

---v

Wasm lifecycle in Wasmtime

---v

Wasmi

  • It is a wasm environment with support for embedded environment such as WebAssembly itself
  • Focus on simple, correct and deterministic WebAssembly execution
  • The technique of execution is interpretation but:
    • The wasm code is transpiled to WasmI IR, another stack-based bytecode
    • The WasmI IR is then interpreted by a Virtual Machine

Notes:

proposal to switch from a stack based ir to registry based ir https://github.com/paritytech/wasmi/issues/361

paper explaining the efficiency of translating wasm to registry based code https://www.intel.com/content/www/us/en/developer/articles/technical/webassembly-interpreter-design-wasm-micro-runtime.html

Due to it's characteristics it is mainly used to execute SmartContracts on chain

---v

Wasm lifecycle in Wasmi


Alternatives

---v

EVM

  • The Ethereum Virtual Machine executes a stack machine
    • Interesting: here the bytecode was create to be executed in a blockchain, so instructions are not hardware-dependent but there are instruction tightly related to Cryptography and others blockchain instructions

---v

CosmWasm

  • Wasm is always used but with different tools
  • They use CosmWasm as Embedder and internally is used Wasmer, a Single Pass Compiler

---v

Solana eBPF

  • eBPF is used as PAB, but intrinsically eBPF has a lot of restrictions
  • Solana forked the eBPF backend of LLVM to makes every program to be compiled in eBPF
  • The Embedder is rBFP, a virtual machine for eBPF programs

Notes:

https://forum.polkadot.network/t/ebpf-contracts-hackathon/1084

---v

RISC-V ?!

  • RISC-V is a new instruction-set architecture
  • main goals are:
    • real ISA suitable for direct native hardware implementation
    • avoids β€œover-architecting”

Being so simple and "Hardware-Independent" there are work in progress experiments to test if it is suitable to become the new polkadot smart contract language

Notes:

Discussion about using RISC-V as smart contract language: https://forum.polkadot.network/t/exploring-alternatives-to-wasm-for-smart-contracts/2434

RISC-V Instruction Set Manual, Unprivileged ISA: https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf


Activity: Compiling Rust to Wasm

  • Let's make a simple Rust crate that compiles to Wasm!
  • Clone the repo

---v

Activity: Compiling Rust to Wasm

  • A target triple consists of three strings separated by a hyphen, with a possible fourth string at the end preceded by a hyphen.
  • The first is the architecture, the second is the "vendor", the third is the OS type, and the optional fourth is environment type.
  • wasm32-unknown-emscripten: Legacy, provides some kind of std-like environment
  • wasm32-unknown-unknown βœ“ WebAssembly: Can compile anywhere, can run anywhere, no std
  • wasm32-wasi βœ“ WebAssembly with WASI

---v

Rust -> Wasm Details

#![allow(unused)]
fn main() {
#[no_mangle] // don't re-name symbols while linking
pub extern "C" fn add_one() { // use C-style ABI
  ...
}
}

and if a library:

[lib]
crate-type = ["cdylib"]

---v

Activity: Compiling Rust to Wasm

rustup target add wasm32-unknown-unknown

cargo build --target wasm32-unknown-unknown --release

wasmtime ./target/wasm32-unknown-unknown/release/wasm-crate.wasm --invoke <func_name> <arg1> <arg2> ...

---v

Additional Resources! πŸ˜‹

Check speaker notes (click "s" πŸ˜‰)

Notes: