Id30 is an encoding scheme for identifiers I have designed for use in URLs, and
implemented in Rust. It is also a good fit for single-use keys
meant to be copied to another device, such as login codes sent via SMS or
displayed on a TV screen. Id30 looks like the following: bpv3uq
, zvaec2
,
rfmbyz
, jwygvk
or even 000000
and zzzzzz
. Example URLs could be:
- https://example.com/item/bpv3uq or
- https://example.com/watch?v=zvaec2
While the encoding is simply base 32 with a specific alphabet, decoding additionally handles some confusable characters as well as case insensitivity. The short IDs along with the leniency in decoding makes Id30 well-suited for reading, writing and transmitting verbally. This is an apparent niche use-case that I found worth designing for when I discovered I had users struggling to verbally transmit URLs I had given them containing UUIDs, like https://example.com/item/{4d5edfae-a082-4b8f-a278-0fb1d43d504b} ..! It is easy for the technically savvy to know to refuse typing such strings by hand, but other users aren't necessarily as sophisticated. And it's not always easy to copy and paste a URL across all sorts of devices. In addition to computers, mobile phones and tablets, there are set-top boxes, smart TVs and other smart devices that may need to display or input a URL, or even transmit one by speech recognition or synthesized speech.
To keep the identifiers short, the key space is limited to 30 bits. A base 32 encoding was chosen to allow case insensitivity and correction of confusables, and to work well with binary numbers. Each character then encodes five bits (since 2⁵ = 32), giving six characters for 30 bits. 30 bits fit well within a 32 bit integer, wasting only two bits.
Is a 30 bit key space practical? That's just over 10⁹ identifiers. For comparison, YouTube's count of videos passed this threshold only in 2019 and is now up to four times as much (source). There's the old adage that you should dress for the job you want, not the job you have. I think a lot of good could come of designing web services for the scale you have, or are realistically going to have, instead of the scale you're dreaming of.
Id30 looks its best when used for IDs that are randomly generated, otherwise
you'll end up with runs of very similar-looking IDs, like 00014a
, 00014b
,
00014c
, … and so on. There are other reasons for using randomly generated IDs
as well: user-facing sequential IDs can leak business sensitive information,
such as count of users or other items in the database, and they can exacerbate
security breaches. Id30 leans into this and assumes that the IDs are randomly
generated. How does that work out with a 30 bit key space? With random
generation, a collision could occur at any time, so verifying that a new ID is
unused is a prudent step. From the birthday paradox, we learn that the
probability of a collision having ever occured reaches 50% as the count of IDs
reaches the square root of the key space size: √230 = 215
= 32.768. So until
reaching this number of IDs, the likelihood of having had a collision stays
below 50%. That's a good anchor point, but what would be more interesting is the
function that gives you the expected number of attempts for successfully
generating a new ID given the count of IDs already in use. Maybe I'll take the
time to figure out that function one day.
Another idea that's been floated to me is to rely on encryption to conceal sequential IDs. The gist of that idea is to use sequential IDs in the database, which is typically helpful for database performance, and symmetric encryption (such as AES) to transform the IDs when exposing them to the user, decrypting them when going the other way. This has some good properties, including that it is stateless. However, it unfortunately does not combine well with Id30, because the ID size in this scheme must equal (or be an integral multiple of) the block size of the cipher. Something to look into another time, I suppose.
Encoding and decoding
Encoding is simple. Given a 30 bit integer, split it into six five-bit chunks and encode each chunk individually by table-lookup:
Given number (base 10) | 378038020 | |||||
---|---|---|---|---|---|---|
In base 2 | 010110100010000110011100000100 | |||||
Five-bit chunks (base 2) | 01011 | 01000 | 10000 | 11001 | 11000 | 00100 |
Five-bit chunks (base 10) | 11 | 8 | 16 | 25 | 24 | 4 |
Five-bit chunks, encoded | b | 8 | g | t | r | 4 |
Id30 | ✨ b8gtr4 ✨ |
This encoding scheme allows fast implementations and gives strings that sort in the same sequence as the integer representation.
Decoding is more involved to allow for alternative encodings, including case insensitivity, and for input validation. Let's first go through the simple case of decoding a valid Id30 in canonical encoding, which is just the above encoding process in reverse:
Given Id30 | b8gtr4 | |||||
---|---|---|---|---|---|---|
Each character | b | 8 | g | t | r | 4 |
Decoded by lookup table | 11 | 8 | 16 | 25 | 24 | 4 |
In base 2 | 01011 | 01000 | 10000 | 11001 | 11000 | 00100 |
Composed | 010110100010000110011100000100 | |||||
As base 10 | 378038020 |
Decoding should detect whether the given ID is canonical or not. We implement
this by adding an additional set bit, 0x20 = 32
, for all alternative
characters in the decoding table. The alternative characters are:
- All upper-case letters
o
/O
as alternatives to0
, for visual similarityi
/I
andl
/L
as alternatives to1
, for visual similaritys
/S
as alternatives tof
/F
, for verbal similarity
Let's say we were given an upper-case version of the same ID as above:
Given Id30 | B8GTR4 | |||||
---|---|---|---|---|---|---|
Each character | B | 8 | G | T | R | 4 |
Decoded by lookup table | 43 | 8 | 48 | 57 | 56 | 4 |
In base 2 | 101011 | 001000 | 110000 | 111001 | 111000 | 000100 |
Now, let is_canonical
be true
if, and only if, all the 0x20
-bits are zero.
We can find is_canonical
by simple bit-testing, maybe after combining all the
values with binary OR. is_canonical
should be used for redirecting users to
canonical versions of your URLs, otherwise you will end up serving the same
resource from multiple URLs, and that's both frowned upon by search engines and
bad for caching. Note that the redirection does not require any database lookup
or permissions checking. Since it is a syntactic fix only, it doesn't matter if
the target URL is present, absent, behind login or whatever else: first redirect
to the canonical URL, then let that URL handler deal with the complexities.
For input validation, first verify that the length of the input is six bytes.
If it is, we still have to verify that all the bytes are valid code bytes, ie
that they are letters or numbers. A simple implementation is to use a lookup
table of size 256 to account for all possible input bytes, fill in all the
canonical and alternative encodings, and fill the rest with 0x40 = 64
, a
hitherto unused bit. If any of the decoded values have this bit set, we reject
the input as invalid.
The encoding and decoding tables are included as JSON at the bottom of this post.
Q&A
How does this compare to Sqids?
Sqids is another ID encoding scheme which also yields short strings, but there are numerous small differences beyond that, due to different design decisions.
Id30 | Sqids | |
---|---|---|
Case sensitivity | Case insensitive | Case sensitive |
Confusable character handling | Yes | No |
Sequences look… | sequential | random¹ |
Preserves sort order | Yes | No |
Built-in word filtering | No | Yes |
Key space size | 30 bits ≈ 10⁹ IDs | Arbitrary |
Encoded ID size | 6 chars | Variable |
Decoded ID size | 4 bytes = 32 bits | Arbitrary |
Alphabet | Fixed | Configurable |
Complexity | Simpler | More complex |
¹ While sequential IDs look random in Sqids, the documentation calls out that this should not be relied on to conceal the sequence: they can still be decoded to reveal the underlying numbers.
But what about rude IDs in Id30?
All six-char sequences of the encoding alphabet are valid Id30s, and some of these will be considered rude or offensive by some subset of the population. The inherent difficulty in this is that what might be considered inappropriate varies by population and time. Handling this is out of scope for Id30, so to keep your IDs safe in this sense, you have to make use of a library designed for that, such as rustrict for Rust, and filter IDs at generation time.
For reference, these are the encoding and decoding tables as JSON:
ENCODING
, as a string: "0123456789abcdefghjkmnpqrtuvwxyz"
ENCODING
, to ASCII values: [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102, 103, 104, 106, 107, 109, 110, 112, 113, 114, 116, 117, 118, 119, 120, 121, 122]
DECODING
: [64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 64, 64, 64, 64, 64, 64, 64, 42, 43, 44, 45, 46, 47, 48, 49, 33, 50, 51, 33, 52, 53, 32, 54, 55, 56, 47, 57, 58, 59, 60, 61, 62, 63, 64, 64, 64, 64, 64, 64, 10, 11, 12, 13, 14, 15, 16, 17, 33, 18, 19, 33, 20, 21, 32, 22, 23, 24, 47, 25, 26, 27, 28, 29, 30, 31, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64]
Thanks to Haakon Nilsen for discussing the drafts to this blog post with me!