Id30 – magnushoff.com

Id30 is an encoding scheme for identifiers I have designed for use in URLs, and implemented in Rust. It is also a good fit for single-use keys meant to be copied to another device, such as login codes sent via SMS or displayed on a TV screen. Id30 looks like the following: bpv3uq, zvaec2, rfmbyz, jwygvk or even 000000 and zzzzzz. Example URLs could be:

https://example.com/item/bpv3uq or
https://example.com/watch?v=zvaec2

While the encoding is simply base 32 with a specific alphabet, decoding additionally handles some confusable characters as well as case insensitivity. The short IDs along with the leniency in decoding makes Id30 well-suited for reading, writing and transmitting verbally. This is an apparent niche use-case that I found worth designing for when I discovered I had users struggling to verbally transmit URLs I had given them containing UUIDs, like https://example.com/item/{4d5edfae-a082-4b8f-a278-0fb1d43d504b} ..! It is easy for the technically savvy to know to refuse typing such strings by hand, but other users aren't necessarily as sophisticated. And it's not always easy to copy and paste a URL across all sorts of devices. In addition to computers, mobile phones and tablets, there are set-top boxes, smart TVs and other smart devices that may need to display or input a URL, or even transmit one by speech recognition or synthesized speech.

To keep the identifiers short, the key space is limited to 30 bits. A base 32 encoding was chosen to allow case insensitivity and correction of confusables, and to work well with binary numbers. Each character then encodes five bits (since 2⁵ = 32), giving six characters for 30 bits. 30 bits fit well within a 32 bit integer, wasting only two bits.

Is a 30 bit key space practical? That's just over 10⁹ identifiers. For comparison, YouTube's count of videos passed this threshold only in 2019 and is now up to four times as much (source). There's the old adage that you should dress for the job you want, not the job you have. I think a lot of good could come of designing web services for the scale you have, or are realistically going to have, instead of the scale you're dreaming of.

Id30 looks its best when used for IDs that are randomly generated, otherwise you'll end up with runs of very similar-looking IDs, like 00014a, 00014b, 00014c, … and so on. There are other reasons for using randomly generated IDs as well: user-facing sequential IDs can leak business sensitive information, such as count of users or other items in the database, and they can exacerbate security breaches. Id30 leans into this and assumes that the IDs are randomly generated. How does that work out with a 30 bit key space? With random generation, a collision could occur at any time, so verifying that a new ID is unused is a prudent step. From the birthday paradox, we learn that the probability of a collision having ever occured reaches 50% as the count of IDs reaches the square root of the key space size: √2³⁰ = 2¹⁵ = 32.768. So until reaching this number of IDs, the likelihood of having had a collision stays below 50%. That's a good anchor point, but what would be more interesting is the function that gives you the expected number of attempts for successfully generating a new ID given the count of IDs already in use. Maybe I'll take the time to figure out that function one day.

Another idea that's been floated to me is to rely on encryption to conceal sequential IDs. The gist of that idea is to use sequential IDs in the database, which is typically helpful for database performance, and symmetric encryption (such as AES) to transform the IDs when exposing them to the user, decrypting them when going the other way. This has some good properties, including that it is stateless. However, it unfortunately does not combine well with Id30, because the ID size in this scheme must equal (or be an integral multiple of) the block size of the cipher. Something to look into another time, I suppose.

Encoding and decoding

Encoding is simple. Given a 30 bit integer, split it into six five-bit chunks and encode each chunk individually by table-lookup:

Given number (base 10)	378038020
In base 2	010110100010000110011100000100
Five-bit chunks (base 2)	01011	01000	10000	11001	11000	00100
Five-bit chunks (base 10)	11	8	16	25	24	4
Five-bit chunks, encoded	b	8	g	t	r	4
Id30	✨ b8gtr4 ✨

This encoding scheme allows fast implementations and gives strings that sort in the same sequence as the integer representation.

Decoding is more involved to allow for alternative encodings, including case insensitivity, and for input validation. Let's first go through the simple case of decoding a valid Id30 in canonical encoding, which is just the above encoding process in reverse:

Given Id30	b8gtr4
Each character	b	8	g	t	r	4
Decoded by lookup table	11	8	16	25	24	4
In base 2	01011	01000	10000	11001	11000	00100
Composed	010110100010000110011100000100
As base 10	378038020

Decoding should detect whether the given ID is canonical or not. We implement this by adding an additional set bit, 0x20 = 32, for all alternative characters in the decoding table. The alternative characters are:

All upper-case letters
o/O as alternatives to 0, for visual similarity
i/I and l/L as alternatives to 1, for visual similarity
s/S as alternatives to f/F, for verbal similarity

Let's say we were given an upper-case version of the same ID as above:

Given Id30	B8GTR4
Each character	B	8	G	T	R	4
Decoded by lookup table	43	8	48	57	56	4
In base 2	101011	001000	110000	111001	111000	000100

Now, let is_canonical be true if, and only if, all the 0x20-bits are zero. We can find is_canonical by simple bit-testing, maybe after combining all the values with binary OR. is_canonical should be used for redirecting users to canonical versions of your URLs, otherwise you will end up serving the same resource from multiple URLs, and that's both frowned upon by search engines and bad for caching. Note that the redirection does not require any database lookup or permissions checking. Since it is a syntactic fix only, it doesn't matter if the target URL is present, absent, behind login or whatever else: first redirect to the canonical URL, then let that URL handler deal with the complexities.

For input validation, first verify that the length of the input is six bytes. If it is, we still have to verify that all the bytes are valid code bytes, ie that they are letters or numbers. A simple implementation is to use a lookup table of size 256 to account for all possible input bytes, fill in all the canonical and alternative encodings, and fill the rest with 0x40 = 64, a hitherto unused bit. If any of the decoded values have this bit set, we reject the input as invalid.

The encoding and decoding tables are included as JSON at the bottom of this post.

Q&A

How does this compare to Sqids?

Sqids is another ID encoding scheme which also yields short strings, but there are numerous small differences beyond that, due to different design decisions.

	Id30	Sqids
Case sensitivity	Case insensitive	Case sensitive
Confusable character handling	Yes	No
Sequences look…	sequential	random¹
Preserves sort order	Yes	No
Built-in word filtering	No	Yes
Key space size	30 bits ≈ 10⁹ IDs	Arbitrary
Encoded ID size	6 chars	Variable
Decoded ID size	4 bytes = 32 bits	Arbitrary
Alphabet	Fixed	Configurable
Complexity	Simpler	More complex

¹ While sequential IDs look random in Sqids, the documentation calls out that this should not be relied on to conceal the sequence: they can still be decoded to reveal the underlying numbers.

But what about rude IDs in Id30?

All six-char sequences of the encoding alphabet are valid Id30s, and some of these will be considered rude or offensive by some subset of the population. The inherent difficulty in this is that what might be considered inappropriate varies by population and time. Handling this is out of scope for Id30, so to keep your IDs safe in this sense, you have to make use of a library designed for that, such as rustrict for Rust, and filter IDs at generation time.

For reference, these are the encoding and decoding tables as JSON:

ENCODING, as a string: "0123456789abcdefghjkmnpqrtuvwxyz"

ENCODING, to ASCII values: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 103, 104, 106, 107, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122]

DECODING: [64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 64, 64, 64, 64, 64, 64, 64, 42, 43, 44, 45, 46, 47, 48, 49, 33, 50, 51, 33, 52, 53, 32, 54, 55, 56, 47, 57, 58, 59, 60, 61, 62, 63, 64, 64, 64, 64, 64, 64, 10, 11, 12, 13, 14, 15, 16, 17, 33, 18, 19, 33, 20, 21, 32, 22, 23, 24, 47, 25, 26, 27, 28, 29, 30, 31, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64]

Thanks to Haakon Nilsen for discussing the drafts to this blog post with me!

Magnus Hovland Hoff, 2024

Encoding table
Value	Encodings
0	`0`	`o`, `O`
1	`1`	`i`, `I`, `l`, `L`
2	`2`
3	`3`
4	`4`
5	`5`
6	`6`
7	`7`
8	`8`
9	`9`
10	`a`	`A`
11	`b`	`B`
12	`c`	`C`
13	`d`	`D`
14	`e`	`E`
15	`f`	`F`, `s`, `S`
16	`g`	`G`
17	`h`	`H`
18	`j`	`J`
19	`k`	`K`
20	`m`	`M`
21	`n`	`N`
22	`p`	`P`
23	`q`	`Q`
24	`r`	`R`
25	`t`	`T`
26	`u`	`U`
27	`v`	`V`
28	`w`	`W`
29	`x`	`X`
30	`y`	`Y`
31	`z`	`Z`