August 2020 – CodeCrash

Recently I was passing ownership of a Rust array to a C function from a Box<[T]> and found that I actually didn’t know whether my code would lead to a memory leak.
This lead me down a rabbit hole to understand how exactly slices work in Rust.

The problem

Copying a Rust array to C is pretty easy:

let my_vec: Vec<u8> = vec![1, 2, 3];
my_c_function(my_vec.as_ptr(), my_vec.len());

However, this will not pass ownership of the array to C! When the vec is dropped, it will drop the array, invalidating the pointer. We could just std::mem::forget the vec, however, Vectors may allocate more space than their len(). C interfaces often don’t expect this capacity, so to make sure excess capacity is dropped, we could use a Boxed slice:

let my_vec: Vec<u8> = vec![1, 2, 3]; let slice: Box<[u8]> = my_vec.into_boxed_slice();
my_c_function(slice.as_ptr(), slice.len());
Box::leak(slice);

(Note that the memory that is passed to C still needs to be freed by Rust. C and Rust use different allocators, so C’s free function will lead to undefined behavior! This is not today’s topic tough.)

But will leaking this box create a memory leak? After all, the Box is just a pointer to some data on the heap, and slices in rust don’t just store a pointer to a sequence, they also store their length. One way this could be implemented can be seen in below:

Possible memory layout for a Box<[T]>, interpretes a Box as a container storing a single pointer onto the heap. It points to a memory cell on the heap, containing both the pointer to the sequence and it's length. — A possible memory layout for a `Box<[T]>`.

If the layout above is used, using Box::leak would leave the sequence pointer and length on the heap, causing a memory leak. In the context of interacting with C, memory layout and implementation details quickly become important, and we need to know specifically how a Boxed slice is layed out.

Before we talk about Boxed slices though, what even is a slice?

A dynamically-sized view into a contiguous sequence, [T]. Contiguous here means that elements are laid out so that every element is the same distance from its neighbors.
https://doc.rust-lang.org/std/primitive.slice.html

Conceptually, a slice is easy to grasp. Like the documentation states, it allows you to view into a sequence of memory. The power of slices is that you “view” into sequences of memories that do not belong to you, like parts of the contents of a Vec<T> ¹. Slices are one of the best ways to pass around sequences of data that you might read from or write to, but do not need to own (e.g. you’re creating a copy of it to store somewhere else, etc.).

Whilst their concept is simple to understand, their associated syntax can be somewhat confusing. Especially when dealing with foreign function interfaces, where memory layout is suddenly important.

Slice Types

Or maybe types of slices? Anyways, the slice type you are probably most familiar with, is &[T]. So let’s take that type apart shall we!
The syntax &[T] suggests that this slice is an immutable reference to a [T]. Which is absolutely correct. However, what exactly is this [T] type anyways? The rust docs don’t seem to define this type apart from the quoted paragraph earlier, mentioning [T] as a “contiguous sequence”. So let’s just call this a sequence for now². This does make a lot of sense though, since that’s what we want, a reference to a sequence of values of type T.
But how large is this sequence? If we want to stay safe when accessing it, we absolutely need to know this. So let’s try asking the compiler:

println!("{}", std::mem::size_of::<[u8]>());

Which results in:
error[E0277]: the size for values of type [u8] cannot be known at compilation time
Together with the helpful link to the books chapter on Dynamically sized types (DSTs). Whilst this chapter only talks about string slices (&str and str), the same applies to generic slices. A sequence [T] is a dynamically sized type, so we may only ever reference it. The slices chapter goes into more detail on how this works and reveals that the length of a DST is actually saved in it’s reference/pointer itself. This is a radical break from the way C/C++ handle pointers where a pointer is always just one address in memory, nothing more, nothing less.
However, we can clearly see this isn’t the case for Rust by comparing different kinds of pointers:

println!("{}", std::mem::size_of::<&u8>()); // 8
println!("{}", std::mem::size_of::<[u8]>()); // compiler error
println!("{}", std::mem::size_of::<&[u8]>()); // 16
println!("{}", std::mem::size_of::<&mut [u8]>()); // 16
println!("{}", std::mem::size_of::<*mut [u8]>()); // 16
println!("{}", std::mem::size_of::<*const [u8]>()); // 16
println!("{}", std::mem::size_of::< Box<[u8]> >()); // 16
println!("{}", std::mem::size_of::< Rc<[u8]> >()); // 16

A reference to a normal Sized type like u8 is 8 bytes on my x86-64 machine. However, all types of references/pointers to a [u8] (slices) are twice this big, at 16 bytes. They are made up of the pointer to the data, and it’s length as a usize. This includes references, Box, Rc, and curiously enough: the pointer types *mut and *const. This is often referred to as a “fat pointer”³.

What’s noteworthy about this is, that for this reason pointers to a [T] cannot be passed to a C function, as a pointer in C is always exactly one usize long. This is quite curious, as all other pointer types can easily be passed to the FFI without problems.
However, this actually answers our question of how a Box<[T]> is layed out in memory. A Box<X> is internally a pointer to X, meaning our Box<[T]> includes a *mut [X], which is a fat pointer made up of both the pointer to the sequence, as well as the length of the sequence. The length and pointer therefore both lie on the stack, inside the Box.

And because Box::leak, as well as std::mem::forget drop everything on the stack, whilst leaving the heap alone, we really only leak the sequence, nothing else.

Great, now we know than we can safely transfer ownership of a Boxed slice to C without leaking memory! Success!

Actually, whilst you’re here, we can also indulge in some of the curious syntax created by this “fat pointer” implementation.

Slice syntax details

If we take another look at the slice documentation, we can see that the length of a slice is implemented for [T] as fn len(&self). For a reference, that makes a lot of sense, we can just pass in the reference, which knows it’s own length.

However, there is currently⁴ no such implementation for the pointer types *mut [T] and *const [T]⁵. So if you run code like this:

let slice_pointer: *mut [u8] = &mut [1, 2, 3];
println!("{}", slice_pointer.len());

So how do we fix this? Well, it’s a pointer, so there isn’t much left to do other than dereferencing it. And because Rust will automagically create a reference for you when calling a member function, we can access len.

let slice_pointer: *mut [u8] = &mut [1, 2, 3];
println!("{}", (*slice_pointer).len());

Ah, right, dereferencing a pointer is unsafe…

let slice_pointer: *mut [u8] = &mut [1, 2, 3];
println!("{}", unsafe { (*slice_pointer).len() });

And finally, it compiles!

This code is pretty strange, we dereference a pointer, which accesses the underlying sequence, to in the end access a value that isn’t stored in that sequence at all, but in the pointer we came from. Let’s just be glad that the compiler can figure this all out for us!
And let’s also celebrate the fact that we actually know something the compiler doesn’t! The above call is not unsafe, as we never really have to dereference the pointer. The length is already stored in it.

So, I hope you learned something today. Maybe you’ll never have to worry about this again, but still, isn’t it nice to have a look under the hood every once in a while? Small language design decisions like this are so interesting, they have so many implications regarding syntax, memory layout, FFI, etc. And they might even cause some confused developer to start a blog 😄️

Acknowledgements

Thanks to @lcnr and @bjorn3 for their time and expertise on the Rust Zulip stream.

Month: August 2020

Understanding Rust slices

The problem

Slice Types

Slice syntax details

Acknowledgements