Impractical(-ish) Encoding: Intro & Futhark Runes

If you’re reading this, maybe you’re someone with a game development hobby, like me. Maybe you’re like me and you’ve participated in game jams, watched a respectable number of GDC talks, and spend most of your day wishing for an interesting problem to solve.

And maybe you’re none of those things. Whoever you are, that’s who I am. And that’s how I got where we are here. I think the problem I solve here is interesting: how could you make sharing data fun?

If you’re reading this, I hope you’re ready to learn some dubiously useful encoding methods, since that’s what I will be sharing in this blog series. For this series, I will be using Rust for the code portion. You can find the reference crate here: on Github. That said, this is not intended to be a Rust-focused endeavor. I’ll try to keep things general and simple enough that you can understand the meat of the encoding method, and maybe implement it yourself in your own language if you’re so inclined.

So, if you’re looking for interesting ways to turn data into slightly more usable data and learn some things along the way, settle in! This series will be focusing on text-based encoding, perfect for sharing over messaging or microblogging applications. In this post, we’re going to be using some ancient symbols supported in Unicode to share arbitrary data in a simple but fun way.

Remember: Encoding is NOT encryption, and encoded values can be easily recovered from the encoded text. Use this encoding with some form of encryption for uses which require some sort of secrecy.

Futhark Encoding

Without further ado, I present to you an encoded message:

ᚲᚠᛖᛁᚷᛞᛒᛈᛊᛗᛁᚠ

For the uninitiated, these strange symbols are mostly from the elder futhark, an ancient writing system associated with magic and mysticism. For reasons we’ll get into in a moment, there are some non-futhark symbols mixed in to round out our alphabet.

It’s been said that being able to look at these symbols and form words from the syllables they represent was akin to magic before literacy was common. The hope with this encoding method is that it will seem just as magical to have a user share some quasi-mystical symbols and see results in whatever program they’re using.

You and me are going to be in on the illusion, though. At it’s core, there’s not much separating the above encoded string and this one, in a much plainer JSON format:

{“comments”: “Hello”, “code”: 42}

In fact, that is the data that’s been encoded in the runes above. Let’s see how we get there!

Before we start: Our data structure

For this demonstration, we’ll use a simple data structure with two fields: comments; and code. We’ll give it a suitable name and define it:

struct TestStruct {
    comments: String,
    code: u32, // A 32-bit positive integer.
}

When we're talking about encoding, it's worth considering what the computer thinks of this structure. For that, let’s strip out a few words it stops caring about very quickly in its process:

struct _a {
    _b: String,
    _c: u32,
}

All of these names are useful for us, the readers of this code, but the computer doesn’t really care about them in Rust or most languages, after some point. When you use TestStruct in your code, all it cares about is the data. All the data that is here is a String (some series of characters) and a number (here, a 32-bit unsigned integer).

First step: Bytes and Such

Digging a little deeper, the data that the computer cares about most often comes in the form of bytes: 8-bit chunks of information, gathered together to form bigger, meaningful chunks of information. In practical terms, that means that everything a computer sees is composed of values between 0 and 255 (2, taken to the 8th power, minus one).

For example, here’s what “Hello” looks like (sometimes) in byte form:

[72, 101, 108, 108, 111]

And what the number 42 looks like:

[42]

Okay, that last one is a little underwhelming. There are some more complicated things to do with “byte-size” and “endian” and “two’s complement” and “IEEE” when we’re talking about computers looking at numbers, but we’re going to firmly ignore those for now. For our purposes, it’s enough to consider that if you had a bigger number, you would use more bytes. For example, 8675309 would be:

[132, 95, 237]

If we were to naively join “Hello” and 8675309’s bytes into one, big structure, it might look like this:

[72, 101, 108, 108, 111, 132, 95, 237]

Easy, right? Just stick the two sets of bytes together and call it a day. Except, what if we were to create a new string, “Hello„_í”? Let’s see:

[72, 101, 108, 108, 111, 132, 95, 237]

Oh, oh dear. It’s the same. This could be a problem, but thankfully there’s also a simple solution: just remember how long the string is. Here’s “Hello” and 8675309 together, once more:

[5, 72, 101, 108, 108, 111, 132, 95, 237]

This is a big win! We know how to translate a data structure in our code to an array of bytes. This is the first thing we need on our journey to our mystical-looking encoding. However, there are so many more things to consider with distilling data structures down to bytes, and this naive and simple approach would quickly fail to serve us in every situation. For that reason, we’re going to lean on an existing binary format: postcard. If you’re not using Rust, you’ll grab a different binary format. There’s an ample supply of them in any major language.

Note: You could use a non-binary format to distill your data structure down, but binary formats are much more compact. That’s important for the steps we have later. For comparison, the same data structure converted to JSON and then to our runes would look like this: ᚪᚡᚪᚱᛖᛒᚩᛈᛈᛃᚥᛁᚷᛟᛒᛉᛗᛗᚺᛚᚨᛒᚠᚾᚲᚨᚥᚡᛞᛟᚾᚱᛇᛒᚺᚷᛞᛟᛒᛇᚲᛗᚺᛚᚨᚤᚺᚷᚩᚨ. That’s more than four times as long as the original.

In Rust, we can easily add this functionality to our structure with a couple derived traits:

#[derive(Serialize, Deserialize)]
struct TestStruct {
    comments: String,
    code: u32,
}

That, and some associated functions to call, is all it will take for us to turn our data into bytes and back again! We’re not to the runes yet, though. For that, let’s reach into our toolbag for an already existing encoding method…

Second step: Getting the “Points”

We have our bytes, but we don’t have 256 runes. If we stretch our futhark a little bit, we can get a nice round 32 runes for our encoding. All we have to do is repackage our data into smaller numbers without losing any information. We do this by cutting up the bits of our bytes and rearranging them into smaller packages.

From a set of 8-bit bytes, we’re going to construct a set of 5-bit (32 is 2 to the 5th power) “points”. Instead of 256 (0-255) different values, we’re going to use just 32 (0-31). This is necessarily going to increase the number of values we’re handling. Our 5 bytes for “Hello” ([72, 101, 108, 108, 111]) will become 8 new values:

[8, 10, 25, 24, 6, 22, 29, 13]

Notice that no value is over 31. Just like an 8-bit value will be between 0 and 255, a 5-bit value will be between 0 and 31. If we go back to our original structure, with comments of “Hello” and code of “42”, distill that to bytes as described above, and then further construct our points for Base32, we get:

[5, 0, 18, 10, 6, 22, 17, 13, 15, 19, 10, 0]

Notice that we don’t see any of the points from “Hello” in this array. That’s because of the string length (5) we put at the front. We’re shifting some bits around between values and they’ve become obscured by that movement. The bits are still there, though, and it’ll be easy to get them back.

We’ve also turned 56 bits (7 8-bit values) into 60 bits (12 5-bit values). The extra bits get discarded when we later decode these points.

The code to do this is a bit intimidating, but I’ll break it down for you here:

pub fn bytes_to_points(bytes: &[u8]) -> Vec<u8> {
    let mut results = Vec::new();
    let mut bits: u16 = 0;
    let mut offset = 0;
    for byte in bytes {
        bits |= (*byte as u16) << offset;
        offset += 8;
        while offset >= 5 {
            results.push((bits & 0x1f) as u8);
            bits >>= 5;
            offset -= 5;
        }
    }
    if offset != 0 {
        results.push(bits as u8);
    }
    results
}

All right, ready? Let's start from the top!

pub fn bytes_to_points(bytes: &[u8]) -> Vec<u8>

We’re defining a function which takes a slice (e.g. array) and returns a Vector (variable-length array). The items in both are u8,which is the smallest unsigned integer the computer knows about, a byte. The result is still a set of values between 0-31, but we have no smaller way to represent them in Rust. Other languages may be more careless with number types, but Rust requires us to decide how much space we want to use for numbers all the time.

let mut results = Vec::new();

We’re going to be constructing this set of 5-bit “points”, one “point” at a time, so let’s get ready to store our results.

let mut bits: u16 = 0; let mut offset = 0;

We’re going to be doing some bit-level math. We create a register where we can remember the bits we have seen so far, bits, and a counter so we know how many more bits we need, offset.

for byte in bytes {
    bits |= (*byte as u16) << offset;
    offset += 8;
    …
}

We’re looping through our bytes and we are:

To understand this step, it’s useful to imagine bits as an array in itself! To start, it looks like this, where the offset remembers the length of the bit array:

Bits (length 0): []

Empty, boring, waiting to be filled. In the first section of this loop, we get our first byte, 5 ([0, 0, 0, 0, 0, 1, 0, 1]) and add it to the left side:

Bits (length 8): [0, 0, 0, 0, 0, 1, 0, 1]

Then comes our next section:

while offset >= 5 {
    results.push((bits & 0x1f) as u8);
    bits >>= 5;
    offset -= 5;
}

While we still have 5 bits or more, we pop the last 5 bits off into a new “point”! As arrays again, we go from this:

Results: [] Bits (length of 8):[0, 0, 0, 0, 0, 1, 0, 1]

To this:

Results: [5] Bits (length of 3): [0, 0, 0]

The last 5 bits become our first “point”, and we do it over again. Next loop, we progress like so:

Add “H” (72) to the left:
Bits (length of 11): [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Remove 5 bits from the right and make it a point:
Results: [5, 0]
Bits (length of 6): [0, 1, 0, 0, 1, 0]
Remove another 5 from the right and make it a point:
Results: [5, 0, 18]
Bits (length of 1): [0]

Just like that, we have our first 3 points: [5, 0, 18]. After we’ve loop through our bytes, we might have some bits left over, as we do above. If we were done looping here, we’d move on to this final part:

if offset != 0 {
    results.push(bits as u8);
}
results

We just make sure we don’t lose any bits and push the last of them to the results, and then return our results. That’s really all there is to it. We are one step closer to runic mastery!

Third step: Touching Home Base(32)

We have our 12 points:

[5, 0, 18, 10, 6, 22, 17, 13, 15, 19, 10, 0]

Aren’t they nice? But where are the runes? Well, to get there, we’re going to leverage an existing encoding scheme: Base32. In simple terms, this involves doing much of what we’ve just handled (turning bytes into 5-bit points) and then translating those points into some easily-readable set of characters.

In one Base32 format, you’ll use the capital letters A through Z for 0-25, and then numbers 1-6 for 26-31. For our 12 points, that looks like this:

FASKGWRNPTKA

Faskgwarniptaka! It certainly sounds like a magic word! Let’s take a look at the simple code that does this for us:

fn simple_generate_runes_ascii(bytes: &[u8]) -> String {
    let alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ123456";
    let points = bytes_to_points(bytes);
    points
        .iter()
        .map(|point| alphabet.bytes().nth(*point as usize).unwrap() as char)
        .collect::<String>()
}

Taking this again, step by step…

fn simple_generate_runes_ascii(bytes: &[u8]) -> String

We’re defining a function that takes our already-generated array of bytes (not points yet), and returning a plain old String.

let alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ123456";

We’re going to define a reference string with all the characters we’ll use for our encoding, in the order of their value. a is first, for 0, and 6 is last, for 31.

let points = bytes_to_points(bytes);

We take our 8-bit bytes and repackage them into 5-bit points.

points
    .iter()
    .map(…)
    .collect::<String>()

We’re taking each point, mapping them to a new value, and then collecting them into a single String.

.map(|point| alphabet.bytes().nth(*point as usize).unwrap() as char)

What are we mapping each point to? Well, we take the bytes from our alphabet, grab the byte corresponding to our point, and then recast it as a character so it can go into our string! We’re almost to our mystical mastery, but we should address the elephant in the room first.

Dark Magic: Impracticalities of Unicode for Encoding

We have our Base32 encoding code and it should be very easy to just swap in the elder futhark and start cooking up spells, right? Well, the elder futhark contains only 24 distinct runes. To get to 32, we can just add a few other runes from other alphabets, and that’s easy enough. Doing so, we end up with an “alphabet” that looks something like this:

ᚠᚢᚦᚨᚱᚲᚷᚹᚺᚾᛁᛃᛇᛈᛉᛊᛏᛒᛖᛗᛚᛜᛞᛟᚡᚣᚤᚥᚧᚩᚪᚫ

Beneath these runes lies a danger, though. Have you stopped to wonder how big this alphabet really is, in bytes? Here, find out with this useful page: byte counter. 96 bytes? Where did that come from?

For our simple, alphanumeric “alphabet”, it’s easy to know what you’re using! UTF-8, the most common encoding used for text, supports English letters and Arabic numerals in just one byte each. However, to support more than just the English-standard character set, UTF-8 starts using more and more bytes. As many as 4 bytes are needed to represent a single Unicode code point, and don’t even get me started on grapheme clusters.

For Futhark Encoding, we’re using some rarely used Unicode code points (all of those ancient runes). And so, our 12-runes become 36 bytes. If you’ve lost track, here’s the score:

We packed our data into 8 bytes. We turned it into 12 points, which we could have encoded with 12 bytes in ASCII. Instead, we converted those Base32 points into 12 Unicode characters, totaling 36 bytes.

That’s a 350% increase in size! How impractical.

In terms for our code, we need to handle all those extra bytes. We’ll use the unicode_segmentation, which gives us the graphemes method. This breaks apart the text by grapheme instead of by byte. We just slot it in in place of the old mapping, as so:

.map(|point| alphabet.graphemes(true).nth(*point as usize).unwrap())

We’ll also add the alphabet as a parameter to generate_runes, and pass in our futhark “alphabet”. We’re almost there, so let’s put it all together!

Final step: Putting it all together

So, we have our struct, a strategy for turning that struct into bytes (postcard), a method for turning those bytes into points and those points into runes. What’s left? Let’s stick it all together:

pub const FUTHARK: &str = "ᚠᚢᚦᚨᚱᚲᚷᚹᚺᚾᛁᛃᛇᛈᛉᛊᛏᛒᛖᛗᛚᛜᛞᛟᚡᚣᚤᚥᚧᚩᚪᚫᛌ";

pub fn create_runes<T: Serialize>(t: &T, alphabet: &str) -> String {
    let data = postcard::to_allocvec(t).unwrap();
    generate_runes(data.as_slice(), alphabet)
}

Pretty straightforward:

Conclusion: Decoding and why though?

We haven’t covered decoding the rules and turning them back into usable data. You can refer to the code in GitHub for how to do that, but you basically do the encoding in reverse:

The better question is why bother with all of this? What’s the point of an encoding method that increases the size of the data by around 350%?

Well, it’s fun.

JSON is great for passing data in text form reliably, but it’s also boring. It can play into an aesthetic of a creative work to have users share runes, instead of logging into a server that exchanges the data for them. I’ve used this in two game jams:

This style of correspondence-style gameplay is what started me down this path of impractical but stylish encoding methods. Hopefully this post might inspire some others to make sharing a fun part of their game or other creative work! Stay tuned for the next encoding method, where we’ll learn what makes this tick:

┍╼───━┐
╽ C+c ╽
┝┯┰┱┭┲┪
┖┻┺┸┶┵┘