Trying to understand Atomics

As multi-core processing continues to become more relevant, having a good understanding of the primitives used to achieve fast and correct concurrent (and ideally parallel) processing is very important.

Personally, I’m pretty confident with Mutexes and channels, especially in Rust where the data protected by a Mutex is actually, you know, protected by it. There is however another set of primitives: Atomics. This primitives has enticed me ever since I watched Herb Sutters excellent talk: Lock-Free Programming (or, Juggling Razor Blades).

The Promise of Atomics

Atomics sound great, they allow you to safely mutate data between threads without having to do any locking, so no blocking on OS calls or whatnot.

The basic idea is that any instruction on an atomic type happens in a single transaction. So other threads will always either see the full result of a transaction, or the state before the transaction, never an in-between value.

Available operations

Usually, the operations supported on these Atomic types are:

  • load – Loads the value of the given atomic, “atomic” in the sense that only a real stored value is read, never an in-between value caused by a computation or partial update.
  • store – Stores the given value, “atomic” in the sense that other threads will either see the full written value, or the old one.
  • fetch_add,_sub,_and,_or_max,_min, etc. (for int types) – Add, subtract, etc. operations. “Atomic” in the sense that only the full result of the operation is stored, and no updates can be missed. I.e. if 10 threads all increment an atomic with value 0, the result will definitely be 10. Usually these operations return the value the atomic previously held.
  • compare_exchange – One of the more advanced instructions, checks that the atomic is of a given value, and only if this is the case, stores a new value into it. This is “atomic” in the sense that there may not be an update to the value in-between checking that it is of the right value and storing something in it.

These operations are only individually atomic. A sequence of them is not in itself atomic. So trying to replace a compare_exchange call with a manual check will not be an atomic operation!
In the example below, The result may be:

  • Thread 1 is first!
  • Thread 2 is first!
  • Thread 1 is first!
    Thread 2 is first!
  • Thread 2 is first!
    Thread 1 is first!
// Thread 1
if (my_atomic.load(Ordering::Relaxed) == 0) {
    // Thread 2 may store "2" into the atomic now, Ordering::Relaxed);
    println!("Thread 1 is first!");
// Thread2
if (my_atomic.load(Ordering::Relaxed == 0) {
    // Thread 1 may store "1" into the atomic now, Ordering::Relaxed);
    println!("Thread 2 is first!");

So there may still be operations of other threads in-between the atomic operations of one thread! There can easily be race-conditions with atomics!

The correct Rust implementation would look like this:

// Thread 1
if atomic
    .compare_exchange(0, 1, Ordering::Relaxed, Ordering::Relaxed)
    println!("Thread 1 is first!");
// Thread 2
if atomic
    .compare_exchange(0, 2, Ordering::Relaxed, Ordering::Relaxed)
    println!("Thread 2 is first!");

Memory Order

You may have noticed that the previous examples have all used this extra Ordering::Relaxed argument. These Orderings are part of the C++20 memory model for atomics, which Rust has simply copied with a few minor tweaks.

But what do these orderings actually do? Wasn’t the point of Atomics, that they provide atomic operations by themselves, so what is this extra argument for? This is the part that I’m still trying to wrap my head around. I’ll try to explain my current understanding of these. This understanding may not be correct! You have been warned! But I will keep updating this post to reflect my current understanding and to keep this as a reference.

So far as I can tell, the memory Ordering is actually pretty much irrelevant, as long as we’re talking about synchronizing through a single atomic variable. As mentioned earlier, the atomic operations are all indeed atomic, no matter what ordering is used. And they will eventually be seen by all other threads. So as long as no other shared memory depends on the value of an atomic, Ordering::Relaxed is fine to use!

What does this mean in practice? In our earlier examples, the threads communicated entirely through the given “atomic” variable. They only exchanged the value inside the atomic, nothing else.
Therefore using Ordering::Relaxed is fine to use.

However, what happens if we also change another value (atomic or not), maybe in a producer-consumer fashion?

let value = AtomicI32::new(0);
let has_value = AtomicBool::new(false);

// Thread 1 (Producer), Ordering::Relaxed);, Ordering::Relaxed);

// Thread 2 (Consumer)
while !has_value.load(Ordering::Relaxed) { std::hint::spin_loop(); }
let value = value.load(Ordering::Relaxed);
println!("The value is {value}");

From what I’ve gathered, this code seems correct at first, but actually isn’t! It’s entirely possible that the consumer thread may read the value as 0. But how can this be? We’re assigning “value” to 5 before we assign “has_value”. Right? Right?!

Unfortunately, the hardware may actually not do as requested. Modern CPUs can arbitrarily reorder instructions to achieve a better utilization of their individual cores. As the stores to “value” and “has_value” don’t have a data-dependency between them (i.e. the store to “has_value” doesn’t need to know what’s inside “value” to succeed), they may be reordered by the CPU to execute the store to “has_value” first.

These loads and stores will still be atomic, so the consumer thread will still always read either “0” or “5” and never “1” or another undefined value. But the point is that even with atomics, operations that happen to two different memory locations (atomic or not) are not synchronized.

Solving these problems is what these memory ordering arguments are for. They’re for specifying what should happen to other memory accesses before/after an access to the atomic.

There currently exist 5 of these orderings in Rust:

  • Relaxed
  • Acquire
  • Release
  • Acquire Release (AcqRel)
  • Sequentially Consistent (SeqCst)

There is an additional one in C++, which I won’t cover.

Sequentially Consistent

This is the most easy-to-use and strongest ordering. It basically means “do not reorder anything around this instruction”. Anything that happens after this instruction will happen after it and anything that happens before it will happen before it.

If in doubt, this is the go-to ordering, as it’s the strongest. If you use this for all your atomic operations, your program should execute in the intended and intuitive order.

Acquire & Release

These orderings are for “acquiring” data from another thread and “releasing” data to another thread respectively.

So if we update an atomic that signifies another thread to read data we modified, we’ll want to use “Release”. In the previous example, this would be the “has_value” variable.

More formally, a store with “Release” ordering guarantees that any memory access that happened before it (in the example the store to “value”) does indeed happen before, even for non-atomic variables or atomics that used “Relaxed” ordering. However, any access that happens after the “Release” operation may actually happen before it.

Especially this last part has been counter-intuitive for me. How is it okay to move another operation before this store? Imagine this example code:, Ordering::Relaxed);, Ordering::Release);, Ordering::Relaxed);

Here the store to set “value” to 10 may actually happen before the store to has_value. But this doesn’t change the behavior of another thread observing the “has_value” variable. After “has_value” is true, the value can be either 5 or 10, the other thread can always read either value. As previously stated, there are no atomic guarantees between atomic operations. So from another threads perspective, no matter what we do, the “value” will be set to 10 at “some point” and there’s no way for us to know when exactly by looking at “has_value”. Whenever we check “has_value” the “value” may have already changed to 10. Basically any access to “value” before checking “has_value” may be 0,5, or 10. The only thing we know after checking that “has_value” is true is that it’s 5, or 10 but never 0., Ordering::Relaxed);, Ordering::Relaxed);, Ordering::Release);

But even if we check has_value, that doesn’t mean that our access to “value” will actually happen “after” that check. The CPU may still freely reorder these two calls to load “value” first. That is where “Acquire” comes in. Acquire is used to make sure that any operation done after this one actually stays after it. Instructions that happened before may still be reordered to happen after, using the same logic as described for Release.

Acquire also only makes sense when paired with a previous store with “Release”. As only then do we know that certain operations have definitely happened before we read “has_value” variable and that we’re definitely reading the results after this has happened.

The thing to keep in mind with all of this is that this affects the entire memory of the CPU, not just the atomic, so that’s where the real use-case is. To make sure “other memory” is synchronized correctly, not the atomic itself.

This blog post also does a great job of explaining how this happens on a CPU level:

So if we were to rewrite the previous example correctly, we would have to use “Release” when setting the “has_value” variable and “Acquire” when reading from it.

let value = AtomicI32::new(0);
let has_value = AtomicBool::new(false);

// Thread 1 (Producer), Ordering::Relaxed);, Ordering::Release);

// Thread 2 (Consumer)
while !has_value.load(Ordering::Acquire) { std::hint::spin_loop(); }
let value = value.load(Ordering::Relaxed);
println!("The value is {value}");

And there we have it, a consumer-producer system that correctly uses atomics.

An OpenGL preprocessor for Rust

At the moment I’m working on a game project, written in Rust, using pure OpenGL for the graphics backend.

Whilst I’ve become far more confident with OpenGL once I found the amazing RenderDoc, writing plain GLSL code is still annoying. Code is often duplicated, libraries don’t really exist, and sometimes constants need to be known at compile time (like the size of an array).

This is especially problematic if these constants actually originate in your games logic (like the number of player types). Updating these values manually in your shader code is repetitive and prone to both error and simple forgetfulness.

For all these reasons it’s really helpful to build some kind of preprocessor for your GLSL code that can include other files, so you can organize your Code into manageable chunks.1

Enter: Tera

Thanks to the amazing Rust ecosystem, we don’t actually have to write our own preprocessor. Because Rust is also often used for web projects, which need a lot of templated web-pages, a Jinja-like templating engine already exists: Tera.

A Tera template doesn’t just allow you to include other files, but can also receive a context from Rust which allows you to provide values to your OpenGL code at compile time. It even runs a simple scripting language for dynamic template instantiation. Which means we could even write simple macros. This is just what we need! So how do we integrate Tera into our own OpenGL engine?

Well, this depends on what your requirements for the preprocessing you have…

To or not to

There are basically two points in our program’s lifetime at which we can run our preprocessor:

  • cargo build i.e. when we compile our Rust program
  • cargo run i.e. when we run our program and compile our OpenGL shaders

For me this boils down to one question:

Why do at runtime, what you can do at compile time?

This question prods at the fact that, as developers, we sometimes tend to calculate a lot of stuff at runtime that we could have just as well hard-coded at compile-time. Whilst the idea of hard-coding might sound vile to some, it actually has a lot of merits. It’s much less error-prone, the result is already there and can be double-checked, it’s faster for the end user, as no processing is needed at runtime. Furthermore we don’t have to test nearly as much, there’s simply less to go wrong at runtime. This blog post goes into more detail on this idea.

So why would we do our preprocessing at runtime? Sometimes we might have to. Especially if we want to pass Rust values to our preprocessor that we either only know at runtime, or only know once part of our game is already compiled. This is most likely the case if your game’s game logic and rendering aren’t separated into multiple crates. If so, I recommend you just run the GLSL preprocessing right before compiling your GLSL shaders during the runtime of your game.

However in my case, the client’s rendering and game logic are in separate crates, so I can access the game logic even before compiling the client. So I’ll use the power of Cargo’s build scripts to precompile my shaders at compile time.

A build script is simply a small rust script (called that is compiled and run by Cargo before compiling your actual crate. In my case, it looks like this:

// the shared crate contains my game logic
use shared::game::components::modules::MODULE_DATA;
use std::{env, error::Error, fs};
use tera::*;

// All my shaders reside in the 'src/shaders' directory
fn generate_shaders() -> std::result::Result<(), Box <dyn Error>> {
    let tera = Tera::new("src/shaders/*")?;

    let mut context = Context::new();
    // You can basically insert any data you want here, as long as it's Serialize+Deserialize.
    // In my case this is the data for all types of unit modules.
    context.insert("module_data", &MODULE_DATA);

    let output_path = env::var("OUT_DIR")?;
    fs::create_dir_all(format!("{}/shaders/", output_path))?;

    for file in fs::read_dir("src/shaders")? {
        let file = file?.file_name();
        let file_name = file.to_str().unwrap();

        let result = tera.render(file_name, &context)?;
        fs::write(format!("{}/shaders/{}", output_path, file_name), result)?;
        println!("cargo:rerun-if-changed=src/shaders/{}", file_name);

fn main() {
    if let Err(err) = generate_shaders() {
         // panic here for a nicer error message, otherwise it will 
        // be flattened to one line for some reason
        panic!("Unable to generate shaders\n{}", err);

And to make Tera available in the script, we have to add this to our Cargo.toml:

# Shader preprocessing
# Game logic (use your own crate here)

This code will iterate over all files in the src/shaders directory, instantiate (render) the file as a Tera template and write it to wherever the OUT_DIR environment variable points to. Unfortunately we can’t write our generated templates back out into our src directory. This is mandated by convention as described in the Cargo docs. As a consequence we can’t easily inspect the generated source code though. We’ll fix this later.

Also note the generous amount of ?-Operators in this code. Because we’re doing this at compile-time, we can freely abort the build should something go wrong. We don’t have to try to fix the problem at runtime or to build a nice error message for the user. This output is for the developer and can be a bit crude2. If you are using a similar system at runtime, you should probably build something more sophisticated though.

Using the preprocessor

In any shader in the src/shaders directory, we can now use Tera’s {% include "myfile.glsl" %} directive to include other glsl code. We can also access our game’s constants at compile time using the Tera Context. In my case I use {{ module_data | length }} to define the length of some of my uniform arrays. I’ll never have to update that number by hand again!

And there’s a lot more we can do, as we have the full power of Tera available. We can generate code conditionally, in loops, and even inherit code from other templates. Really, your imagination is the limit here!

So let’s get those shaders compiling, shall we…

Getting the shaders into Rust

Similar to how we preprocess the shaders at compile time, we can also take care of this step during compile time. Rust provides the <a href="">include_str!</a> macro, to include an external file as a &'static str. To make this even easier, I wrote a small wrapper macro:

macro_rules! include_shader {
    ($path:literal) => {
        include_str!(concat!(env!("OUT_DIR"), "/shaders/", $path))

Now we can just write include_shader!("myshader.glsl") to get our shader source into our Rust program. No external file loading required!

As this will hard-code our shaders into our Rust executable, we don’t even have to ship them externally with the project. But this also means we have to recompile our Rust code every time our shader code changes. Have you noticed the cargo:rerun-if-changed= prints in yet?

Printing these lines for every shader file won’t actually be visible to the developer, but notifies cargo that it needs to recompile, should the file at the provided path change. We also add such a print for the entire directory as this might notify cargo of newly added/removed files, but this behavior is platform-dependent.

Now that our shader source is in Rust, we can finally compile it:

pub fn create_whitespace_cstring_with_len(len: usize) -> CString {
    // allocate buffer of correct size
    let mut buffer: Vec<u8> = Vec::with_capacity(len + 1);
    // fill it with len spaces
    buffer.extend([b' '].iter().cycle().take(len));
    // convert buffer to CString
    unsafe { CString::from_vec_unchecked(buffer) }

fn shader_from_source(source: &CStr, kind: gl::types::GLenum) -> Result <gl::types::GLuint, String> {
    let id = unsafe { gl::CreateShader(kind) };
    unsafe {
        gl::ShaderSource(id, 1, &source.as_ptr(), std::ptr::null());

    let mut success: gl::types::GLint = 1;
    unsafe {
        gl::GetShaderiv(id, gl::COMPILE_STATUS, &mut success);

    if success == 0 {
        let mut len: gl::types::GLint = 0;
        unsafe {
            gl::GetShaderiv(id, gl::INFO_LOG_LENGTH, &mut len);

        let error = create_whitespace_cstring_with_len(len as usize);

        unsafe {
                error.as_ptr() as *mut gl::types::GLchar,

        let shader_source = source
            .map(|(i, line)| format!("{:03} {}", i + 1, line))
            .collect::< Vec<_> >();
        let shader_source = shader_source.join("\n");

        return Err(format!(
            "{}\n\n--- Shader source ---\n{}",


let vert_shader = shader_from_source(&CString::new(include_shader!("module.vert"))?, gl::VERTEX_SHADER)?;

You might notice that if we receive an error from the GLSL compiler, we append the shader source to the error message. Because now the GLSL code we compile might be very different from the data in our shaders directory, with all the included files and instantiated templates, that the line numbers in the OpenGL error message are very likely to be off, making them even more useless.

And as I mentioned earlier, our generated GLSL code is only available in the OUT_DIR, which can be hard to find.

For this reason, we split the GLSL source into lines, add line numbers to it and add it to the error message. So instead of trying to find our error here:

#version 330 core

uniform mat3 mvp;

layout (location = 0) in vec2 Position;
layout (location = 1) in vec3 UV;

out vec3 uv;

{% include "test.glsl" %}

void main()
    vec2 position = (mvp * vec3(Position, 1.0f)).xy;
    gl_Position = vec4(position, 0.5, 1.0);
    uv = UV;

We can clearly see what went wrong:

thread 'main' panicked at 'Unable to create unit renderer!
0(11) : error C0000: syntax error, unexpected reserved word "this" at token "this"

--- Shader source ---
001 #version 330 core
003 uniform mat3 mvp;
005 layout (location = 0) in vec2 Position;
006 layout (location = 1) in vec3 UV;
008 out vec3 uv;
010 // test.glsl here!
011 this is definitely invalid GLSL code.
013 It is here to prove a point, not to be useful.
014 // End of test.glsl
017 void main()
018 {
019     vec2 position = (mvp * vec3(Position, 1.0f)).xy;
020     gl_Position = vec4(position, 0.5, 1.0);
021     uv = UV;
022 }', client/src/gui/editor/
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


Combining Cargo’s build system and tera to precompile GLSL shader code gives us a lot of flexibility when writing shaders. Code can be organized better, is easier to maintain, easier to integrate with data available in Rust whilst not compromising runtime performance or complexity.

And if your requirements are different, it should be easy to modify this system to work at runtime instead. Simply don’t write your shaders back into the file system, but directly compile them after the tera.render call and maybe add more thorough error checking/handling. And maybe leave a comment showing off your code 😉.

Lastly I’d be very interested in how you write/organize your shaders? Do you use a similar system? Plain GLSL? What drawbacks did you find? (Pretty sure this will confuse some tooling, e.g. syntax highlighters). What would you improve? Has this post helped you out?

Leave a comment, I’d love to hear from you.

Understanding Rust slices

Recently I was passing ownership of a Rust array to a C function from a Box<[T]> and found that I actually didn’t know whether my code would lead to a memory leak.
This lead me down a rabbit hole to understand how exactly slices work in Rust.

The problem

Copying a Rust array to C is pretty easy:

let my_vec: Vec<u8> = vec![1, 2, 3];
my_c_function(my_vec.as_ptr(), my_vec.len());

However, this will not pass ownership of the array to C! When the vec is dropped, it will drop the array, invalidating the pointer. We could just std::mem::forget the vec, however, Vectors may allocate more space than their len(). C interfaces often don’t expect this capacity, so to make sure excess capacity is dropped, we could use a Boxed slice:

let my_vec: Vec<u8> = vec![1, 2, 3]; let slice: Box<[u8]> = my_vec.into_boxed_slice();
my_c_function(slice.as_ptr(), slice.len());

(Note that the memory that is passed to C still needs to be freed by Rust. C and Rust use different allocators, so C’s free function will lead to undefined behavior! This is not today’s topic tough.)

But will leaking this box create a memory leak? After all, the Box is just a pointer to some data on the heap, and slices in rust don’t just store a pointer to a sequence, they also store their length. One way this could be implemented can be seen in below:

Possible memory layout for a Box<[T]>, interpretes a Box as a container storing a single pointer onto the heap. It points to a memory cell on the heap, containing both the pointer to the sequence and it's length.
A possible memory layout for a Box<[T]>.

If the layout above is used, using Box::leak would leave the sequence pointer and length on the heap, causing a memory leak. In the context of interacting with C, memory layout and implementation details quickly become important, and we need to know specifically how a Boxed slice is layed out.

Before we talk about Boxed slices though, what even is a slice?

A dynamically-sized view into a contiguous sequence, [T]. Contiguous here means that elements are laid out so that every element is the same distance from its neighbors.

Conceptually, a slice is easy to grasp. Like the documentation states, it allows you to view into a sequence of memory. The power of slices is that you “view” into sequences of memories that do not belong to you, like parts of the contents of a Vec<T> 1. Slices are one of the best ways to pass around sequences of data that you might read from or write to, but do not need to own (e.g. you’re creating a copy of it to store somewhere else, etc.).

Whilst their concept is simple to understand, their associated syntax can be somewhat confusing. Especially when dealing with foreign function interfaces, where memory layout is suddenly important.

Slice Types

Or maybe types of slices? Anyways, the slice type you are probably most familiar with, is &[T]. So let’s take that type apart shall we!
The syntax &[T] suggests that this slice is an immutable reference to a [T]. Which is absolutely correct. However, what exactly is this [T] type anyways? The rust docs don’t seem to define this type apart from the quoted paragraph earlier, mentioning [T] as a “contiguous sequence”. So let’s just call this a sequence for now2. This does make a lot of sense though, since that’s what we want, a reference to a sequence of values of type T.
But how large is this sequence? If we want to stay safe when accessing it, we absolutely need to know this. So let’s try asking the compiler:

println!("{}", std::mem::size_of::<[u8]>());

Which results in:
error[E0277]: the size for values of type [u8] cannot be known at compilation time
Together with the helpful link to the books chapter on Dynamically sized types (DSTs). Whilst this chapter only talks about string slices (&str and str), the same applies to generic slices. A sequence [T] is a dynamically sized type, so we may only ever reference it. The slices chapter goes into more detail on how this works and reveals that the length of a DST is actually saved in it’s reference/pointer itself. This is a radical break from the way C/C++ handle pointers where a pointer is always just one address in memory, nothing more, nothing less.
However, we can clearly see this isn’t the case for Rust by comparing different kinds of pointers:

println!("{}", std::mem::size_of::<&u8>()); // 8
println!("{}", std::mem::size_of::<[u8]>()); // compiler error
println!("{}", std::mem::size_of::<&[u8]>()); // 16
println!("{}", std::mem::size_of::<&mut [u8]>()); // 16
println!("{}", std::mem::size_of::<*mut [u8]>()); // 16
println!("{}", std::mem::size_of::<*const [u8]>()); // 16
println!("{}", std::mem::size_of::< Box<[u8]> >()); // 16
println!("{}", std::mem::size_of::< Rc<[u8]> >()); // 16

A reference to a normal Sized type like u8 is 8 bytes on my x86-64 machine. However, all types of references/pointers to a [u8] (slices) are twice this big, at 16 bytes. They are made up of the pointer to the data, and it’s length as a usize. This includes references, Box, Rc, and curiously enough: the pointer types *mut and *const. This is often referred to as a “fat pointer”3.

What’s noteworthy about this is, that for this reason pointers to a [T] cannot be passed to a C function, as a pointer in C is always exactly one usize long. This is quite curious, as all other pointer types can easily be passed to the FFI without problems.
However, this actually answers our question of how a Box<[T]> is layed out in memory. A Box<X> is internally a pointer to X, meaning our Box<[T]> includes a *mut [X], which is a fat pointer made up of both the pointer to the sequence, as well as the length of the sequence. The length and pointer therefore both lie on the stack, inside the Box.

The actual memory layout of a Box<[T]>

And because Box::leak, as well as std::mem::forget drop everything on the stack, whilst leaving the heap alone, we really only leak the sequence, nothing else.

Great, now we know than we can safely transfer ownership of a Boxed slice to C without leaking memory! Success!

Actually, whilst you’re here, we can also indulge in some of the curious syntax created by this “fat pointer” implementation.

Slice syntax details

If we take another look at the slice documentation, we can see that the length of a slice is implemented for [T] as fn len(&self). For a reference, that makes a lot of sense, we can just pass in the reference, which knows it’s own length.

However, there is currently4 no such implementation for the pointer types *mut [T] and *const [T]5. So if you run code like this:

let slice_pointer: *mut [u8] = &mut [1, 2, 3];
println!("{}", slice_pointer.len());

So how do we fix this? Well, it’s a pointer, so there isn’t much left to do other than dereferencing it. And because Rust will automagically create a reference for you when calling a member function, we can access len.

let slice_pointer: *mut [u8] = &mut [1, 2, 3];
println!("{}", (*slice_pointer).len());

Ah, right, dereferencing a pointer is unsafe…

let slice_pointer: *mut [u8] = &mut [1, 2, 3];
println!("{}", unsafe { (*slice_pointer).len() });

And finally, it compiles!

This code is pretty strange, we dereference a pointer, which accesses the underlying sequence, to in the end access a value that isn’t stored in that sequence at all, but in the pointer we came from. Let’s just be glad that the compiler can figure this all out for us!
And let’s also celebrate the fact that we actually know something the compiler doesn’t! The above call is not unsafe, as we never really have to dereference the pointer. The length is already stored in it.

So, I hope you learned something today. Maybe you’ll never have to worry about this again, but still, isn’t it nice to have a look under the hood every once in a while? Small language design decisions like this are so interesting, they have so many implications regarding syntax, memory layout, FFI, etc. And they might even cause some confused developer to start a blog 😄️


Thanks to @lcnr and @bjorn3 for their time and expertise on the Rust Zulip stream.