A base32 checksum
(Cue: in the voice of a badly-acted TV commercial.)
Do you always mistakenly wire money to or from the wrong account?
I used to do that all day, everyday!
But then, I found base32check, and my life turned upside-down.
base32check is there for me on my daily purchases, watching out for my
typing errors online and offline! And now, for some reason,
I smell of roses and there are sparkling AfterEffects™ all around.
Download the base32check whitepaper. Symptoms may include an accelerated heart
rate, a shortness of breath, and acute depression. Please consult your
healthcare professional in case of severe brain injury.
That was weird, let’s be boring instead
When credit cards sprung into our lives, they were designed to prevent that.
The last digit served no other purpose than to act as a checksum, computed
from the other digits, using an algorithm by Hanz Luhn from 1959.
In his words:
It is commonly known that in copying a number comprised of a plurality of
digits it often happens that an error occurs by transposing two of the
digits.
Luhn’s goal was to detect all single-digit errors and most transpositions,
with a single check digit. However, it missed some transpositions, such as
09 / 90.
(Add a to be valid).
Whether a single check digit could also detect all adjacent transpositions
was an open question, until 1969, when Verhoeff designed another
algorithm that did just that.
Furthermore, he gathered the results of a study into a thorough analysis
of the causes of human errors.
Meanwhile, the International Standards Organization published
ISO/IEC 7064 in 1983. It includes algorithms for single- and
double-digit checksums of numerical and alphanumerical inputs, with the goal
to detect all single substitutions and most adjacent transpositions.
They intended it for use in future standards.
For instance, IBAN (the International Bank Account Number),
published in 1997, needed a checksum to lower the risk of accidentally
transfering funds to the wrong account number.
The designers created IBAN as a sequence of alphanumeric characters.
Surprisingly, for the check digits, they relied on ISO/IEC 7064’s MOD 97-10,
which was designed only for numeric inputs.
Our analysis indicates that this was a BAD MOVE™.
A root promise of MOD 97-10 as specified in ISO/IEC 7064 is to detect
all single substitution errors. Well, not anymore! Not with that
alphabetic-to-numeric conversion. For instance, IIIIIII80
conflicts with
IIIII1I80
, UDODGOP17
with UDODG0P17
.
Furthermore, while it technically allows mixed case, the uppercase letter always
conflicts with its lowercase, essentially making lowercase too risky to
use. Uppercase all inputs.
On the other hand, MOD 97-10 checksums do not produce letters, which leaves
some of the information content of the checksum on the table, that could
otherwise strengthen the checksum.
A Blank Slate
If we were to redesign the global banking system, how would we improve this?
The goals we want for our imaginary account number are:
- Be easy to transmit. So, text.
- Be short, to simplify copying.
- Contain enough space to scale globally. We don’t want to run out of
account numbers as the population grows. This, along with the previous
constraint, means that we need to encode as many bits of information as
we can within each character. base64 maybe? That’d be 6 bits per char.
- Minimize human transcription errors. base64 is notably poor in this
respect, because of the 0/O and I/l/1/i visual confusion. On the other
hand, base32 has more information density than just digits, but none of
the ambiguity. Finally, while base64 contains a dash or a dot that can
be cut off by text processors, base32 will be selected as a single word
even when double-clicking.
- Allow early detection of errors. We don’t want to inform the user
of their mistake asynchronously. So the algorithm should be easy to
implement on the user-interface.
Obviously, I am not the first to look into this. Notably, Bitcoin
created a solution for this. How does it work, and how does it fare?
In Bitcoin, the account number is roughly a (cryptographic) hash of the user’s
public key, cut and concatenated with a checksum that also uses a cryptographic
hash.
Bitcoin account numbers are usually represented in text as base58. Base58 is
very similar to base64, except that it removes the 0/O and l/I ambiguity,
but still has the i/1, Z/7, B/8 and S/5 confusion. Besides, transcribing or
communicating mixed-case letters is tedious. Each letter needs mentioning
whether it is uppercase or lowercase. That is why fourwordsalluppercase is a
great WiFi password.
Bitcoin’s account number checksum uses roughly 5 characters (four bytes of
base58 is 32 ÷ log2(58) ≈ 5). It is fairly big, which at least ensures that
there won’t be mistakes… unless the account number goes unchecked. If that is
the case, neither the sender nor the recipient may use the funds transferred.
Sadly, the necessity of a bignum library for base58 decoding, and SHA256 as the
checksum algorithm, makes verifying the checksum laborious to implement.
Few user interfaces embed a validator as a result.
Finally, even if we wanted to replace the checksum with an easier-to-use check
digit, it is not immediately easy to design a base58 check character.
base32 avoids many of these inconveniences, especially in lowercase, where
the varied heights are easy to distinguish and where there are no S/5 and Z/7
ambiguities.
However, there isn’t a particularly good, dedicated, check character algorithm
for it — yet.
Let’s create one.
base32check1
Luhn, Verhoeff, Damm — some pretty nifty designs are already on the shelves.
Sadly, they mostly focus on digits, which have poor information capacity (3.3
bits per char).
While they can be extended to larger alphabets, it is not trivial nor
standard. (Ah, the joy of tweaking the Verhoeff parameters.) Worse, they are no
longer the coolest kids on the block.
I found a 2016 paper referencing a 2014 one (which I couldn’t get
my hands on. Curse the IEEE overseers and their labyrinthic subscription
models! Please pay a $84.50 membership fee and the paper is $14.95. Oh and
would you mind donating because IEEE helps improve the human condition? Also
good news, we offer free shipping for PDF documents.)
No joke, but I digress. The 2014 paper has no value anyway, as the principle is
described in full in the 2016 paper, available online.
The main insight is this: if your check digit an+1 solves Σ aᵢ·Pⁱ = 0
on a finite field where P is the companion matrix of a primitive polynomial,
it suddenly detects substitutions, transpositions, and twin substitutions with
up to R characters between them, where R grows with the cardinal. Magic!
(We will dig more into exactly what errors it detects later.)
You must pick two numbers carefully to design your checksum: the cardinal and
the primitive.
- The cardinal is the number of elements in the finite field. It is
therefore also the number of possible checksums. For a single-character check
digit, it must be smaller than the size of your alphabet.
- The primitive does not matter all that much. The most significant part
about it is that you need to compute one. Unfortunately, brute-force is usually
the simplest way to find some.
There are two possible cardinals: prime numbers and prime powers. Prime
cardinals are neat, because operations are scalar. On the other hand, additions
and multiplications on prime power cardinal fields are matricial.
I first wanted a single-character checksum. In order to use all the bits that
this checksum had at its disposal, I needed it to support the full base32
alphabet. That meant the cardinal needed to be 32. Sadly, 32 is not a prime; so
it needed to be a prime power. If 2 is the prime, 2⁵ is the cardinal.
Since the cardinal is not a prime, P is a matrix. Here, its size is 5×5.
To build it, we can pick any order-5 primitive polynomial; I chose 1+x²+x⁵.
To get the companion matrix, set a diagonal of 1s, and set the rightmost column
with the polynomial coefficients, starting with x⁰:
⎛ 0 0 0 0 1 ⎞
⎜ 1 0 0 0 0 ⎟
⎜ 0 1 0 0 1 ⎟
⎜ 0 0 1 0 0 ⎟
⎝ 0 0 0 1 0 ⎠
Each aᵢ term is a representation of a base32 character as a GF(2⁵) (finite field
of order 2⁵) polynome: a is 0, which we represent as (0 0 0 0 0); b is 1 or (0 0
0 0 1); f is 1+x² or (0 0 1 0 1).
Validating a check character is straightforward: compute Σ aᵢ·Pⁱ; verify that
the result is a zero vector. It does involve writing custom matrix
operations to ensure that they are all computed modulo the cardinal.
Unsurprisingly, the performance is not all that great:
$ node test/perf
10000 runs.
base32check1: 243.500ms
There is a neat trick to speed up computation on a finite field where
the cardinal is a power of 2: instead of implementing vectors as arrays, use
integers whose binary sequence is the list of polynome coefficients. For
instance, 1+x² (which represents f) is 0b00101
, or 5.
Then addition is XOR, and matrix multiplication is a conditional XOR on the
successive bits of the current row. With aᵢ and bᵢ vectors of size 5:
⎛ a₀ ⎞ ⎛ b₀ ⎞ ⎛ Σ a₀[i]×bᵢ ⎞
⎜ a₁ ⎟ ⎜ b₁ ⎟ ⎜ Σ a₁[i]×bᵢ ⎟
⎜ a₂ ⎟ × ⎜ b₂ ⎟ = ⎜ Σ a₂[i]×bᵢ ⎟
⎜ a₃ ⎟ ⎜ b₃ ⎟ ⎜ Σ a₃[i]×bᵢ ⎟
⎝ a₄ ⎠ ⎝ b₄ ⎠ ⎝ Σ a₄[i]×bᵢ ⎠
It is 7× faster.
$ node test/perf
10000 runs.
base32check1: 34.120ms
Now, how do you compute the check character?
First, compute S = Σ aᵢ·Pⁱ for the data you have.
Then, we need to solve S + c·Pn+1 = 0 for c.
Let’s compute c·Pn+1 = -S.
Opposites are equal in GF(2ᵏ), so -S = S.
Now, we need to inverse Pn+1.
To do that efficiently, we need the following insight: since a primitive element
is a generator of its finite field, its powers loop around through all non-zero
values. Therefore, P2ᵏ-1 = 1, and so,
Pn+1·P2ᵏ-n-2 = 1. This gives us the inverse of
Pn+1, which we can get by generating all the powers of P when
initializaing the system.
Then we have c = S·P2ᵏ-n-2.
(Add to be valid).
base32check2
I chose to also design a 2-character checksum alongside the first, on the
off-chance that the first was not strong enough. (After all, the single-char
checksum needed to make the most of 5 bits, while MOD 97-10’s two check digits
had a roomier 6.6 bits to play with.)
There are 22×5 = 1024 possible combinations of two characters of
base32. Using the same design as above, it required working on 10×10 matrices. I
chose not to do that for the following reasons:
- The initialization matrix starts to look as scary as Verhoeff’s!
- There was opportunity for a design without matrices by instead picking a prime
cardinal, which would enormously reduce implementation complexity.
- Using a prime instead of a prime power improves your multi-substitution
detection. It doesn’t matter for the 1-character checksum, because improving
your 2-substitution score at the expense of your single-substitution score is
a net loss.
So, I picked the largest prime below 1024, 1021, to limit the number of
sequences of two characters with the same value in GF(p). It does mean that
replacing any 75 by aa on even positions will give the same checksum, and the
same is true of 76 / ab and 77 / ac.
I brute-forced a primitive element, and again I took the largest, 1011, although
this time, there is no justification, just sheer superstition. (I tried
computing the statistics of detection of all primitives, and they seemed
equivalent.)
The computation of the check characters is identical. Since the matrices are now
1×1, all operations are scalar. Counter-intuitively, it is faster, although
still in the same ballpark as our optimized base32check1 design above:
$ node test/perf
10000 runs.
base32check2: 23.701ms
(Add to be valid).
The last question we may have is: how well does it compare?
Results
Error type | Frequency | MOD 11-10 | MOD 97-10 | MOD 37-36 | MOD 1271-36 | base32check1 | base32check2
|
---|
1sub | 79.05% | 7.6% | 0.291% | 0% | 0% | 0% | 0%
|
0-trans | 10.21% | 9.503% | 0.405% | 0.195% | 0% | 0% | 0%
|
0-2sub | 1.92% | 10.167% | 1.084% | 2.892% | 0% | 3.175% | 0.025%
|
5sub | 1.81% | 10.083% | 0.996% | 2.769% | 0.086% | 3.101% | 0.101%
|
3sub | 1.4% | 9.975%% | 1.048% | 2.714% | 0.072% | 3.087% | 0.088%
|
6sub | 1.34% | 9.958% | 1.013% | 2.775% | 0.074% | 3.099% | 0.099%
|
4sub | 0.97% | 10.042% | 1.033% | 2.737% | 0.069% | 3.133% | 0.1%
|
1-trans | 0.82% | 11.689% | 0.298% | 1.931% | 0% | 0% | 0%
|
2sub | 0.81% | 9.909% | 0.971% | 2.689% | 0.05% | 3.022% | 0.055%
|
0-twin | 0.55% | 9.858% | 0.351% | 1.805% | 0% | 0% | 0%
|
phonetic | 0.49% | 11.199% | 0% | 11.055% | 0% | 0% | 0%
|
1-2sub | 0.36% | 9.982% | 1% | 2.765% | 0.04% | 3.229% | 0.208%
|
1-twin | 0.29% | 12.871% | 0.331% | 3.755% | 0% | 0% | 0%
|
Format | N/A | 1-digit | 2-digit | 1-alnum | 2-alnum | 1-base32 | 2-base32
|
---|
Detection rate | N/A | 91.994% | 99.629% | 99.654% | 99.995% | 99.732% | 99.993%
|
---|
Det. factor | N/A | 0.729 | 0.807 | 1.581 | 1.382 | 1.709 | 1.380
|
---|
Error types:
n
sub: a sequence of n
single-character substitutions, eg. a..b
/
c..d
is a 2sub.
m
-trans: a single transposition between characters that have m
other
characters distancing them, eg. a.c
/ c.a
is a 1-trans.
m
-n
sub: a sequence of n
single-character substitutions that have m
other characters distancing them, eg. a.b
/ c.d
is a 1-2sub.
m
-twin: two identical characters that are substituted to the same
character, with m
other characters distancing them, eg. a.a
/ b.b
is
a 1-twin.
- phonetic:
1a
/ a0
with a between 1 and 9, eg. thirteen / thirty.
The frequency column indicates how often a human makes a particular type of
error, based on Verhoeff’s findings.
We use it to compute an approximate detection rate, relying on the
statistical proportion of random inputs for which the checksum does not
detect a given error type. You can see the proportion of undetected errors
for each checksum in the table above.
We added an interesting figure, the detection factor. After all,
we want our checksum to make the most of the space it takes. Each bit should
contribute to improving the detection rate. Since we expect a law of diminishing
returns for every additional bit, the detection rate of a given design should
improve with the number of checksum bits by 1 - 2-k·b, with b the
number of checksum bits, and k the detection factor, which we compute as
-log2(1-detection) ÷ b.
Each check digit is a base32 character, so they each cost 5 bits of
information, except for the alphanumerical ones, which cheat by producing
values that cannot be mapped to base32 characters without increasing the
error rate. Those take log2(36) ≈ 5.17 bits.
(The detection factor is not a perfect model, as the ISO/IEC 7064
algorithms don’t all have the same factor, despite being the same design with
tweaked parameters. That said, MOD 1007-32 is between base32check2 and MOD
1271-36.)
You will notice two positive things:
- Both base32check algorithms have a similar or superior detection rate than
all others for a given number of check characters. (MOD 1271-36 is essentially
identical, apart from the 0/O, 1/I and 8/B confusion risk, which is
unfortunately not computed here, for lack of empirical data.)
- base32check1 has the best overall detection factor. base32check2 has a
comparable one to MOD 1271-36. It is slightly lower, implying that it could
potentially use its bits better. Indeed, there are a few tweaks that could
help, although I would prefer to await more statistical human error data
before changing the design, as the detection rate varies significantly if
we label the 3sub thru 6sub as insertion and deletion errors instead, which
they probably are; Verhoeff’s study does not distinguish them.
While writing the algorithm, I worried that perhaps I would not be able to
beat the detection rate of IBAN with a single check character (half the
amount!). The whole point of designing base32check2 was on the off-chance
that it didn’t. Fortunately, even base32check1 is better than IBAN’s MOD
97-10, mostly thanks to IBAN’s character conversion procedure.
User Study
The computed scores depend on second-hand information from Verhoeff’s study. It
may not map accurately to other alphabets, or alphabet sizes.
Moreover, it does not help determine which alphabet is least error-prone for
human transcription.
Down to the core, we want to minimize the number of times you send money to the
wrong account.
prob_oops = 1 - prob_fine
prob_fine = Σ prob_error·prob_detect_error + (1 - prob_error)
error
types
prob_error = 1 - (1 - prob_char_error) ^ #chars
prob_error
is the probability that the human committed at least one error of
that type over the whole text.
prob_char_error
is the probability of committing that type of error on a
given character. It depends on the type of error, and α
, the size of the
alphabet used: the more symbols there are, the more they look alike.
prob_detect_error
is the probability that the error gets detected. It
depends on the type of error. It is 1-prob_collision
, the probability that
the checksum yields the same result for two inputs separated by an error of a
give type.
#chars
: payload size. To transmit b bits of information, it is
⌈b ÷ log2(α)⌉ + checksum size
.
We can get the probability of a single-character substitution, for instance,
from the probability p
of a substitution on a text holding b
bits:
prob_char_error = 1 - 2 ^ (log2(1 - p) ÷ ⌈b ÷ log2(α)⌉)
So, all we need, to find an optimal α
and checksum size
, is a user study
estimating the probability of human mistakes.
We will look at A = 7 alphabets:
- base10 (digits) is found in most identifiers, including credit card
numbers. Checksums include Luhn, MOD 11-10 and MOD 97-10,
- base16 is found in most binary representations, including IPv6. A
checksum for it is represented in the paper we based our work on,
- base32 (our main focus),
- base36, seen in IBAN, has MOD 37-36 and MOD 1271-36,
- base58 as used by Bitcoin addresses; we don’t have a checksum for it, but
one could be designed based on either MOD algorithms or the paper’s, although the latter would be more complicated, as it would use a
combination of GF(2) and GF(29),
- base64, mostly to see how much base58 improves from it, can also easily
have a checksum designed for it,
- base1024, a fresh design, used nowhere, where 10 bits are represented as 3
alternating consonants and vowels, which means base32check2 would work just
fine. It has roughly the same information density as base10, which makes it an
interesting comparison. Would all base10 identifiers be better served by being
shown as base1024 instead?
Let’s ask each participant to submit S = 50 entries.
To get error estimations accurate to the percent, we need 100÷(S÷A) = 14
participants; we need ten times that to remove most of the estimation error.
All participant information is anonymous and sent securely.
Give it a try! Contribute to the study! Statistics will appear as you make
progress. Plus, it is pretty relaxing.
References