Our reference desktop environment, Durden, rarely gets covered here these days. This is mostly due to the major features are since long in place and that part of the project is biding its time with smaller fixes while waiting for improvements in the rest of the stack.
Recently the stars aligned and I had some time over to work on the accessibility story, particularly on the ‘no vision’ parts. Other forms (e.g. limited mobility, low vision) will be covered in future articles but most things are already in place, including eye tracking and bespoke devices like stream decks.
Here is a short recording of the first run of a clean installation, setting things up.
This is enabled by default during first setup. The first question presented will be to disable it, but there is no hidden trapdoor combination or setting to enable it for the first time.
The following recording shows using the builtin menu system to start a terminal emulator and open a PDF and the mouse cursor to navigate, with OCR results as I go. These will be elaborated on further in this article.
There is a previous higher-level article which covered a rougher outline of what is intended, but this one is more specific about work that can be used today — albeit with some rough corners still.
One detail to take away from that article is how the architecture splits user data processing into specific one-purpose replaceable programs. These can be isolated and restricted to a much stronger degree than a more generic application.
These are referred to as frameservers. They have a role (archetype) and the ones of interest here are decode and encode. Decode translates from a computer native representative to a human presentable one, like image loading into a pixel soup or annotated text into sound via synthesised speech. Encode goes from potentially lossy computer to human translation such as pixel soup to text via OCR, image description or transcribing audio.
Another detail is that the “screen reader” here is not the traditionally detached component that tries to stitch narration together through multiple sidebands. Instead, it is an “always present first class mechanism” that the window manager should leverage. There is no disconnect between how we provide visual information from aural and they naturally blend with the extensive network transparency as part of our ‘many devices, one desktop’ design target.
Key Features
Here is a short list of the things currently in place:
- Multiple simultaneous positional voice profiles for different types of information
- On-Demand Client requested Accessibility windows
- Force-injected client accessibility fallback with default OCR latched to frame updates
- Buffered Text Input with lookup oracle
- All desktop interaction controls as a file-system
- Command-line shell aware
- Keyboard controlled OCR
- Content aware mouse- audio feedback
- Special controls for text-only windows tracking changes
There is a lot more planned or in progress:
- Premade bootable live / VM image and generator
- Bindings to additional client support (AccessKit)
- Compatibility preset profiles to help transition from NVDA/Orca
- Extend accessibility fallback with description oracles (LLM:offline, …)
- Extend speech process format with waveform/tone/samples for formatting
- Extended lookup oracle
- Stronger language controls/toggles
- Scope based ‘few-line’ text editor extension to shell
- Haptic support
- Braille Terminal output (lacks hardware access currently)
- Indexer for extracting or generating alt-text descriptions
Screen Reading Basics
Let’s unpack some of what happens during setup.
The TTS tool in Durden uses one or many speech profiles that can be found in the durden/devmaps/tts folder. Each profile describe one voice, how its speech synthesis should work, which kinds of information it should convey and any input controls.
They allow for multiple voices to carry different information at different positions to form a soundscape, so you can use ‘clean’ voices for important notifications and ‘fast robotic’ for dense information where actions like flush pending speech buffer doesn’t accidentally cancel out something important.
A compact form starts something like this:
model = "English (Great Britain)",
gain = 1.0, gap = 10, pitch = 60, rate = 180, range = 60, channel = "l", name = "basic_eng", punct = 1, cappitch = 5000
These are just the basic voice synthesis parameters one would expect. Then it gets better.
actions = {
select = {"title", "title_text"},
menu = "menu",
clipboard = "clip",
clipboard_paste = "clip-paste",
notification = "notify"
}
bindings = {
m1_r = "/global/tools/tts/voices/basic_eng/flush",
m1_t = "/global/tools/tts/voices/basic_eng/slow_replay"
}
These tell what this voice gets to do. The action keys mark which subsystems it should jack into, and some custom prefix announcement as value. The profile above would present all menu navigation, system notifications, clipboard access and window title on selection.
The bindings are keyboard binding overlays that take priority when the voice is active. Just like any other binding, they map to paths in the virtual filesystem that Durden is structured around. The two examples shown cancels all pending speech for the specific voice, or turns down the speech rate to low and repeats the last message then returns it back to the voice default. There are, of course, many others to chose from including things like key echo and so on.
Holding down any of the meta or accessibility bound buttons for a few seconds without activating a specific keybinding will play the current ones back for you to ease learning or refresh your memory.
By adding a position = {-10, 0, -10} attribute to the profile the system switches to 3D positional audio and, in this example, positions the voice to your back left. This feature also introduced the /target/audio/position=x,y,z which lets you move any audio from the selected window to a specific position around you, along with /target/audio/move=x,y,z,dt which would slide the voice around over time.
With an event trigger, e.g. /target/triggers/select/add=/target/audio/position=0,0,0 and /target/triggers/deselect/add=/target/audio/position=10,0,-10 the soundscape around you also match window management state itself.
The cursor profile has a few more things to it:
cursor = {
alt_text = "over ",
xy_beep = {65.41, 523.25},
xy_beep_tone = "sine",
xy_tuitone = "square",
xy_tuitone_empty = "triangle"
...
}
The alt_text option would read any tagged UI elements with their text description. XY beeps specifies the frequency range to map the mouse cursor coordinates based on their screen position so that the pitch and gain changes as you slide it across the screen.
The waveform used to generate the wave will also change with the type of contents the cursor is over, so that text windows such as terminal emulators will get a distinct tone and distinguishes between empty and populated cells.
The following clip shows navigating over UI elements, an browser window and a terminal window. You can also hear how the ‘select text to copy to clipboard’ doubles as a more reliable means of hearing text contents in uncooperative windows.
There are also more experimental parts to the cursor, such as using a GPU preprocessing stage to attenuate edge features and then convert a custom region beneath the cursor into sounds. While it takes some training to decipher, this is another form of seeing with sound and applies to any graphical content, including webcam feeds. After some hours I (barely) managed to play some graphical adventure games with it.
Kicking it up a notch
Time to bring out the spice weasel and get back to the OCR part I briefly mentioned earlier.
The astute reader of this blog will recall the post on Leveraging the “Display Server” to Improve Debugging. A main point of that is that the IPC system is designed such that the window manager can push a typed data container (window) and the client can map and populate it. If a client doesn’t, there is an implementation that comes along with the IPC library. That is not only true for the debug type but for the accessibility one as well.
This means that with the push of a button we can probe the accessibility support for a window, and if none exists, substitute our own. This is partly intended to provide support for AccessKit which will complete the solution with cooperative navigation of the data model of an application.
The current fallback spawns an encode session which latches frame delivery to the OCR engine. The client doesn’t get to continue rendering without the OCR pass completed so the contents is forced to stay in synch.
This also where our terminal replacement comes in, particularly the TUI library used to provide something infinitely better than ncurses. The current implementation turns the accessibility implementation into a window attachment with a larger font (for the ‘low vision’ case), the shell populates it with the most important content and the text to speech tool kicks in.
Such text-only windows also get added controls, /global/tools/tts/voices/basic_eng/text_window/(speak_new, at_cursor, cursor_row, synch_cursor, step_row=n) for one dedicated reading cursor, that you move around separately.
Our custom CLI shell, Cat9, probes accessibility at startup. If found, it will adjust its presentation, layout and input control to match so that there is instant feedback. There is still the normal view to explore with keyboard controls, but added aural cues. The following clip demonstrates this both visually and aurally:
All this is, of course, network transparent.
A final related nugget covers both accessibility and security. The path /target/input/text lets you prepare a text locally that will be sent as simulated discrete characters. These are spaced based on a typing model, meaning that for the many poor clients that stream over a network, someone with signal processing 101 doing side channel analysis for reconstructing plaintext from encrypted channel metadata will be none the wiser.
This is useful for other things as well. For an input prompt one can set an oracle which provides suggestion completions. This is provided by the decode frameserver through hunspell, though other ones are just around the corner for filling the role of Input Method Engines, Password Manager integration, more complex grammar suggestions and offline LLM.
The following clip shows how I first type something into this tool, then read it back from the content of the window itself.
The current caveat is that it still does not work with X11 and Wayland clients due to their embarrassingly poor input models. Some workarounds are on their way, but there are a lot of problems to work around, especially for non-latin languages.