Crash-Resilient Wayland Compositing

A commonly held misbelief about one of the possibly negative consequences with migrating from X11 to Wayland is that the system as a whole will become more brittle due to the merger of the display server and the window manager into a ‘compositor’. The argument goes that a bug in the higher level window management parts would kill both the “display server” part, and subsequently its clients, and this is well illustrated in this issue from camp GNOME. To quote <jadahl>s comment in the thread:

I know I have been talking about this before but I think eventually
we really want to do the compositing/UI split. Not now, since it's a
huge undertaking and will involve a massive amount of work, but
eventually it's the solution that I think we should work towards -
long term. The reason is not only because of stability, but for
responsiveness and other things as well.

I think for this bug, a WONTFIX is appropriate. There is no single
"fix" for stability to do here. We just need to hunt down as many
mutter/gnome-shell crashes as we can.

Fortunately, the protocol does not have any provisions that would make a more resilient design impossible – although, to the best of my knowledge, Arcan is the only* implementation that can readily demonstrate this fully. This article will therefore go into details as to how that is achieved, and show just how much work and planning that takes.

The following illustration shows the layers in place, and the remainder of this article will go through them one at a time, divided as follows:

Crash Recovery Stages

The two dashed boxes indicate some form of strict code- memory- or symbol- namespace separation. To quickly navigate to each section, use the links below:

  1. Surviving Window Manager crashes
  2. Surviving Protocol implementation crashes
  3. Surviving Compositor level crashes
  4. Surviving Display System crashes

A combined video clip (for those that dislike the wordpress- embedded video approach) can be found here:

Important: The setup here uses two separate instances of Arcan for demonstration purposes. The left screen is running on GPU1, which will be used as a monitoring console for checking the right screen (GPU2) where all the demo stuff will happen. It has been set up so that CTRL-ALT-F1 act as an on/off toggle for ignoring input on the instance controlling the left screen, with CTRL-ALT-F2 doing the same for the instance controlling the right screen.

Notes: It is worthwhile to note that KDE Plasma and KWin do have a separation that works to the extent of stage 1. It could be argued that MIR and Enlightenment has something to this effect, with the latter described in Wayland Recovery: A Journey Of Discovery – but I disagree; re-pairing client server-side metadata by agreeing on unique identifiers that persist across sessions is but one small and quite optional part of the solution, there are much stronger things that be done. That said, as visible at the end of the video you indirectly see this tactic being used also. The underlying implementation detail is simply that each resource bundle (“segment”) allocated across shmif always gets assigned a unique identifier, unless one is provided upon connection, allowing the window manager to store that identifier combined with layout information. Durden, for instance, does that periodically after something relevant in the layout has changed. When clients reconnect via the recovery mechanism covered later, this identifier is provided upon registration and window layout can be restored.

1. Surviving Window Manager crashes

For this demo, I have added a ‘Crash WM’ menu command that tries to call a function which doesn’t exist (a fatal scripting error), which you can see the highlighted error message on the left screen (this_function_will_crash_the_wm()).

The individual steps are, roughly:

  1. VM script error handler -> longjmp back into a part of main().
  2. Drop all ‘non-externally connected’ resources.
  3. Reinitialise VM using new or previous set of scripts.
  4. Expose externally connected resources through special ‘adopt’ event handler.

The first part of the strategy here is separation between the core engine and the policy layer (window manager) by a scripting interpreter (the Lua language). Lua was chosen here due to its strong track record, the quality of the available JIT, the way the VM integrates with C code, the detailed control over dependencies, the small code and memory footprint and it being “mostly harmless, with few surprises” as a language, and a low “barrier to entry”. Blizzard, for instance, used Lua as the UI scripting layer in World of Warcraft for good reasons and to great effect.

Note: A more subtle detail that is currently being implemented, is that, due to the way the engine binds to the Lua layer, it is easy to make the bindings double as a privileged protocol, using the documentation format and preexisting documentation as an interface description language. By wrapping the Lua VM integration API with a line format and building the binding code twice, with one round as a normal scripted build and another as a ‘protocol build’, we will be able to fully replicate the X model with a separate Window Manager ‘client’ process written – but without the synchronisation problems, security drawbacks or complex legacy.

When the Lua virtual machine encounters an error, it invokes an error handler. This error handler will deallocate all resources that are not designated for keeping clients alive, but keep the rest of the engine intact (the longjmp step).

When the virtual machine is rebuilt, the scripts will be invoked with a special ‘adoption’ event handler – a way to communicate with the window management scripts to determine if it accepts a specific external client connection, or if it should be discarded. The nifty part about this design is that it can also be used to swap out window manager schemes at runtime, or ‘reset’ the current one – all using the same execution path.

The active script also has the option to ‘tag’ a connection handle with metadata that can be accessed from the adoption event handler in order to reconstruct properties such as window position, size and workspace assignment. The DE in the video, Durden, uses the tag for this very purpose. It also periodically flushes these tags to disks, which works both as ‘remember client settings’ and for rebuilding workspace state on harder crashes (see section 3).

2. Surviving Protocol- implementation crashes

For this demo, I have modified the wayland server side implementation to crash when a client tries to spawn a popup. There are two gtk3-demo instances up and running, where I trigger the popup-crash on one, and you can see that the other is left intact.

Wayland support is implemented via a protocol bridge, waybridge – or by the name of its binary: arcan-wayland. It communicates with the main process via the internal ‘SHMIF’ API as a means of segmenting the engine processing pipeline into multiple processes, as trying to perform normal multithreading inside the engine itself is a terrible idea with very few benefits.

At the time of the earlier presentations on Arcan development, the strategy for Wayland support was simply to let it live inside the main compositor process – and this is what others are also doing. After experimenting with the wayland-server API however, it ticked off all the check-boxes on my “danger, Will Robinson!” chart for working in memory-unsafe languages (including it being multithread-unsafe), placing it firmly in a resounding “not a chance, should not be used inside a privileged process” territory; there are worse dangers than The Wayland Zombie Apocalypse lurking here. Thus, the wayland support was implemented in front of SHMIF rather than behind it, but I digress.

Waybridge has 2.5 execution modes:

  • simple mode – all clients share the same resource and process compartment.
  • single-exec mode – only one client will be bridged, when it disconnects, we terminate.

With simple mode, things work as expected: one process is responsible for mediating connections and if the process happens to crash, everyone else go down with it. The single-exec mode, on the other hand, is neat because a protocol level implementation crash will only affect a single client. It can also achieve better parallelism for shared-memory type- buffers, as the cost for synchronising and transferring the contents to a GPU will be absorbed by the bridge-process assigned to the responsible client at times where the compositor may be prevented from doing so itself.

What happens in single-exec mode, i.e.

./arcan-wayland -exec gtk3-demo

Is that the wayland abstract ‘display’ will be created with all the supported protocols. The related environment variables (XDG_RUNTIME_DIR, …) and temporary directories will be generated, and then inherit+fork+exec the specified client.

Note: This does not currently “play well” with some other software that relies on the folder (notably pulseaudio), though that will be fixed shortly.

The missing “.5” execution mode is a slightly cheaper hybrid that will use the fork-and-continue pattern in order to multiplex on the same listening socket and thus removing the need to ‘wrap’ clients; yet still keep them in separate processes. Unfortunately, this mode still lacks some ugly tricks in order to work around issues with the design of the wayland-server API.

Another neat gain with this approach is that each client can be provided with a different set of supported protocols; not all clients need to be allowed direct GPU access, for instance. The decoupling also means that the implementation can be upgraded live and we can load balance. Other than that, if you, like me, mostly work with terminals or special-purpose graphics programs that do not speak Wayland, you can opt-in and out of the attack surface dynamically. Per-client sandboxing or inter-positioning tricks in order to stub or tamper with parasitic dependencies(D-Bus, …)  that tend to come with wayland clients due to the barebones nature of the protocol, also becomes relatively easy.

3. Surviving Compositor level crashes

Time to increase the level of difficulty. Normally, I run Arcan with something like the following:

while true; do arcan -b durden durden; done

This means that if the program terminates, it is just restarted. The -b argument is related to “1. surviving window manager crashes”, i.e. in the event of a crash, reload the last set of scripts, it could also be the name of another window manager – or when doing development work, combined with a git stash to fall back to the ‘safe’ version in the case of a repeatable error during startup.

In this clip, you see me manually sending SIGKILL to the display server controlling the right display, and the screen goes black. I thereafter start it again and you can see the clients come back without any intervention, preserving both position, size and hierarchy.

This one requires cooperation throughout the stack, with most work on the ‘client’ side, and here are some Wayland woes in that there isn’t any protocol in place to let the client know what to do in the event of a crash. However, due to the decoupling between protocol implementation and server process, that task is outsourced to the IPC subsystem (SHMIF) which ties the two together.

What happens first is that the window manager API has commands in place to instruct a client where to connect in the event of the connection dying unexpectedly. This covers two use cases, namely crash recovery, and for migrating a client between multiple server instances.

When the waybridge detects that the main process has died, it goes into a reconnect/sleep on fail – loop against a connection primitive that was provided as a ‘fallback’. When the new connection is enabled, waybridge enumerates all surfaces tied to a client, and renegotiates new server-side resources, updates buffers and hierarchy based on the last known state.

The ugly caveat to this solution right now, is that connections can be granted additional privileges on an individual basis, normally based on if they have been launched from a trusted path of execution (Arcan spawned the process with inherited connection primitives) or not. Since this chain is broken when the display server is restarted, such privileges are lost and the user (or scripts acting on behalf of the user) need to interactively re-enable such privileges.

The big (huge) upside, on top of the added resilience, is that it allows clients to jump between different display server instances, allowing configurations like multi-GPU via one-instance per GPU like patterns, or by selectively moving certain clients to other specialised server instances, e.g. a network proxy.

4. Surviving Display System crashes

This one is somewhat more difficult to demonstrate – and the full story would require more cooperation from the lower graphics stack and access to features that are scheduled for a later release, so it is only marginally interesting in the current state of things.

The basic display server level requirement is being able to, at any time, generate local copies of all GPU-bound assets, and then rebuild itself on another GPU and – failing that, telling clients to go somewhere else. On the other hand, this is, to some extent, also required by the CTRL+ALT+Fn style of virtual terminal switching.

The ‘go somewhere else’ step is easy, and is just reusing the same reconnect mechanism that was shown in ‘Surviving Compositor Level Crashes’ and combining it with a network proxy, through which the client can either remote-render- or migrate state.

The rebuild-step starts out as simple since initially, the rendering pipeline is typically only drawing a bunch of quadrilaterals with external texture sources – easy enough. With window manager specific resources and custom drawing, multi-pass, render-to-texture effects, shaders and so on, the difficulty level rises considerably as the supported feature set between the devices might not match. Then it becomes worse when the clients that draw using accelerated devices that might have been irreversibly lost also need to be able to do this – along with protocol for transferring new device primitives and dynamically reloading related support libraries.

The upside is that when the feature is robust, all major prerequisites for proper multi-Vendor-multi-GPU scan-out, load-balancing and synthesis, live driver updates and GPU/display-to-client handover have been fulfilled – and the volatility is mitigated by the crash level recovery from section 3, but that is a topic for another time.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s