A base32 checksum

(Cue: in the voice of a badly-acted TV commercial.)

Do you always mistakenly wire money to or from the wrong account? I used to do that all day, everyday!

But then, I found base32check, and my life turned upside-down. base32check is there for me on my daily purchases, watching out for my typing errors online and offline! And now, for some reason, I smell of roses and there are sparkling AfterEffects™ all around.

Download the base32check whitepaper. Symptoms may include an accelerated heart rate, a shortness of breath, and acute depression. Please consult your healthcare professional in case of severe brain injury.

That was weird, let’s be boring instead

When credit cards sprung into our lives, they were designed to prevent that. The last digit served no other purpose than to act as a checksum, computed from the other digits, using an algorithm by Hanz Luhn from 1959. In his words:

It is commonly known that in copying a number comprised of a plurality of digits it often happens that an error occurs by transposing two of the digits.

Luhn’s goal was to detect all single-digit errors and most transpositions, with a single check digit. However, it missed some transpositions, such as 09 / 90.

Valid (Add a 6 to be valid).

Whether a single check digit could also detect all adjacent transpositions was an open question, until 1969, when Verhoeff designed another algorithm that did just that. Furthermore, he gathered the results of a study into a thorough analysis of the causes of human errors.

Meanwhile, the International Standards Organization published ISO/IEC 7064 in 1983. It includes algorithms for single- and double-digit checksums of numerical and alphanumerical inputs, with the goal to detect all single substitutions and most adjacent transpositions.

They intended it for use in future standards. For instance, IBAN (the International Bank Account Number), published in 1997, needed a checksum to lower the risk of accidentally transfering funds to the wrong account number.

The designers created IBAN as a sequence of alphanumeric characters. Surprisingly, for the check digits, they relied on ISO/IEC 7064’s MOD 97-10, which was designed only for numeric inputs.

Our analysis indicates that this was a BAD MOVE™.

A root promise of MOD 97-10 as specified in ISO/IEC 7064 is to detect all single substitution errors. Well, not anymore! Not with that alphabetic-to-numeric conversion. For instance, IIIIIII80 conflicts with IIIII1I80, UDODGOP17 with UDODG0P17.

Valid (Add a 92 to be valid).

Furthermore, while it technically allows mixed case, the uppercase letter always conflicts with its lowercase, essentially making lowercase too risky to use. Uppercase all inputs.

On the other hand, MOD 97-10 checksums do not produce letters, which leaves some of the information content of the checksum on the table, that could otherwise strengthen the checksum.

A Blank Slate

If we were to redesign the global banking system, how would we improve this?

The goals we want for our imaginary account number are:

Be easy to transmit. So, text.
Be short, to simplify copying.
Contain enough space to scale globally. We don’t want to run out of account numbers as the population grows. This, along with the previous constraint, means that we need to encode as many bits of information as we can within each character. base64 maybe? That’d be 6 bits per char.
Minimize human transcription errors. base64 is notably poor in this respect, because of the 0/O and I/l/1/i visual confusion. On the other hand, base32 has more information density than just digits, but none of the ambiguity. Finally, while base64 contains a dash or a dot that can be cut off by text processors, base32 will be selected as a single word even when double-clicking.
Allow early detection of errors. We don’t want to inform the user of their mistake asynchronously. So the algorithm should be easy to implement on the user-interface.

Obviously, I am not the first to look into this. Notably, Bitcoin created a solution for this. How does it work, and how does it fare?

In Bitcoin, the account number is roughly a (cryptographic) hash of the user’s public key, cut and concatenated with a checksum that also uses a cryptographic hash.

Bitcoin account numbers are usually represented in text as base58. Base58 is very similar to base64, except that it removes the 0/O and l/I ambiguity, but still has the i/1, Z/7, B/8 and S/5 confusion. Besides, transcribing or communicating mixed-case letters is tedious. Each letter needs mentioning whether it is uppercase or lowercase. That is why fourwordsalluppercase is a great WiFi password.

Bitcoin’s account number checksum uses roughly 5 characters (four bytes of base58 is 32 ÷ log2(58) ≈ 5). It is fairly big, which at least ensures that there won’t be mistakes… unless the account number goes unchecked. If that is the case, neither the sender nor the recipient may use the funds transferred.

Sadly, the necessity of a bignum library for base58 decoding, and SHA256 as the checksum algorithm, makes verifying the checksum laborious to implement. Few user interfaces embed a validator as a result.

Finally, even if we wanted to replace the checksum with an easier-to-use check digit, it is not immediately easy to design a base58 check character.

base32 avoids many of these inconveniences, especially in lowercase, where the varied heights are easy to distinguish and where there are no S/5 and Z/7 ambiguities. However, there isn’t a particularly good, dedicated, check character algorithm for it — yet.

Let’s create one.

base32check1

Luhn, Verhoeff, Damm — some pretty nifty designs are already on the shelves. Sadly, they mostly focus on digits, which have poor information capacity (3.3 bits per char).

While they can be extended to larger alphabets, it is not trivial nor standard. (Ah, the joy of tweaking the Verhoeff parameters.) Worse, they are no longer the coolest kids on the block.

I found a 2016 paper referencing a 2014 one (which I couldn’t get my hands on. Curse the IEEE overseers and their labyrinthic subscription models! Please pay a $84.50 membership fee and the paper is $14.95. Oh and would you mind donating because IEEE helps improve the human condition? Also good news, we offer free shipping for PDF documents.)

No joke, but I digress. The 2014 paper has no value anyway, as the principle is described in full in the 2016 paper, available online.

The main insight is this: if your check digit a_n+1 solves Σ aᵢ·Pⁱ = 0 on a finite field where P is the companion matrix of a primitive polynomial, it suddenly detects substitutions, transpositions, and twin substitutions with up to R characters between them, where R grows with the cardinal. Magic!

(We will dig more into exactly what errors it detects later.)

You must pick two numbers carefully to design your checksum: the cardinal and the primitive.

The cardinal is the number of elements in the finite field. It is therefore also the number of possible checksums. For a single-character check digit, it must be smaller than the size of your alphabet.
The primitive does not matter all that much. The most significant part about it is that you need to compute one. Unfortunately, brute-force is usually the simplest way to find some.

There are two possible cardinals: prime numbers and prime powers. Prime cardinals are neat, because operations are scalar. On the other hand, additions and multiplications on prime power cardinal fields are matricial.

I first wanted a single-character checksum. In order to use all the bits that this checksum had at its disposal, I needed it to support the full base32 alphabet. That meant the cardinal needed to be 32. Sadly, 32 is not a prime; so it needed to be a prime power. If 2 is the prime, 2⁵ is the cardinal.

Since the cardinal is not a prime, P is a matrix. Here, its size is 5×5. To build it, we can pick any order-5 primitive polynomial; I chose 1+x²+x⁵. To get the companion matrix, set a diagonal of 1s, and set the rightmost column with the polynomial coefficients, starting with x⁰:

⎛ 0 0 0 0 1 ⎞
⎜ 1 0 0 0 0 ⎟
⎜ 0 1 0 0 1 ⎟
⎜ 0 0 1 0 0 ⎟
⎝ 0 0 0 1 0 ⎠

Each aᵢ term is a representation of a base32 character as a GF(2⁵) (finite field of order 2⁵) polynome: a is 0, which we represent as (0 0 0 0 0); b is 1 or (0 0 0 0 1); f is 1+x² or (0 0 1 0 1).

Validating a check character is straightforward: compute Σ aᵢ·Pⁱ; verify that the result is a zero vector. It does involve writing custom matrix operations to ensure that they are all computed modulo the cardinal.

Unsurprisingly, the performance is not all that great:

$ node test/perf
10000 runs.
base32check1: 243.500ms

There is a neat trick to speed up computation on a finite field where the cardinal is a power of 2: instead of implementing vectors as arrays, use integers whose binary sequence is the list of polynome coefficients. For instance, 1+x² (which represents f) is 0b00101, or 5.

Then addition is XOR, and matrix multiplication is a conditional XOR on the successive bits of the current row. With aᵢ and bᵢ vectors of size 5:

⎛ a₀ ⎞   ⎛ b₀ ⎞   ⎛ Σ a₀[i]×bᵢ ⎞
⎜ a₁ ⎟   ⎜ b₁ ⎟   ⎜ Σ a₁[i]×bᵢ ⎟
⎜ a₂ ⎟ × ⎜ b₂ ⎟ = ⎜ Σ a₂[i]×bᵢ ⎟
⎜ a₃ ⎟   ⎜ b₃ ⎟   ⎜ Σ a₃[i]×bᵢ ⎟
⎝ a₄ ⎠   ⎝ b₄ ⎠   ⎝ Σ a₄[i]×bᵢ ⎠

It is 7× faster.

$ node test/perf
10000 runs.
base32check1: 34.120ms

Now, how do you compute the check character?

First, compute S = Σ aᵢ·Pⁱ for the data you have. Then, we need to solve S + c·Pⁿ⁺¹ = 0 for c.

Let’s compute c·Pⁿ⁺¹ = -S. Opposites are equal in GF(2ᵏ), so -S = S.

Now, we need to inverse Pⁿ⁺¹. To do that efficiently, we need the following insight: since a primitive element is a generator of its finite field, its powers loop around through all non-zero values. Therefore, P^2ᵏ-1 = 1, and so, Pⁿ⁺¹·P^2ᵏ-n-2 = 1. This gives us the inverse of Pⁿ⁺¹, which we can get by generating all the powers of P when initializaing the system.

Then we have c = S·P^2ᵏ-n-2.

Valid (Add A to be valid).

base32check2

I chose to also design a 2-character checksum alongside the first, on the off-chance that the first was not strong enough. (After all, the single-char checksum needed to make the most of 5 bits, while MOD 97-10’s two check digits had a roomier 6.6 bits to play with.)

There are 2^2×5 = 1024 possible combinations of two characters of base32. Using the same design as above, it required working on 10×10 matrices. I chose not to do that for the following reasons:

The initialization matrix starts to look as scary as Verhoeff’s!
There was opportunity for a design without matrices by instead picking a prime cardinal, which would enormously reduce implementation complexity.
Using a prime instead of a prime power improves your multi-substitution detection. It doesn’t matter for the 1-character checksum, because improving your 2-substitution score at the expense of your single-substitution score is a net loss.

So, I picked the largest prime below 1024, 1021, to limit the number of sequences of two characters with the same value in GF(p). It does mean that replacing any 75 by aa on even positions will give the same checksum, and the same is true of 76 / ab and 77 / ac.

I brute-forced a primitive element, and again I took the largest, 1011, although this time, there is no justification, just sheer superstition. (I tried computing the statistics of detection of all primitives, and they seemed equivalent.)

The computation of the check characters is identical. Since the matrices are now 1×1, all operations are scalar. Counter-intuitively, it is faster, although still in the same ballpark as our optimized base32check1 design above:

$ node test/perf
10000 runs.
base32check2: 23.701ms

Valid (Add AA to be valid).

The last question we may have is: how well does it compare?

Results

Error type	Frequency	MOD 11-10	MOD 97-10	MOD 37-36	MOD 1271-36	base32check1	base32check2
1sub	79.05%	7.6%	0.291%	0%	0%	0%	0%
0-trans	10.21%	9.503%	0.405%	0.195%	0%	0%	0%
0-2sub	1.92%	10.167%	1.084%	2.892%	0%	3.175%	0.025%
5sub	1.81%	10.083%	0.996%	2.769%	0.086%	3.101%	0.101%
3sub	1.4%	9.975%%	1.048%	2.714%	0.072%	3.087%	0.088%
6sub	1.34%	9.958%	1.013%	2.775%	0.074%	3.099%	0.099%
4sub	0.97%	10.042%	1.033%	2.737%	0.069%	3.133%	0.1%
1-trans	0.82%	11.689%	0.298%	1.931%	0%	0%	0%
2sub	0.81%	9.909%	0.971%	2.689%	0.05%	3.022%	0.055%
0-twin	0.55%	9.858%	0.351%	1.805%	0%	0%	0%
phonetic	0.49%	11.199%	0%	11.055%	0%	0%	0%
1-2sub	0.36%	9.982%	1%	2.765%	0.04%	3.229%	0.208%
1-twin	0.29%	12.871%	0.331%	3.755%	0%	0%	0%
Format	N/A	1-digit	2-digit	1-alnum	2-alnum	1-base32	2-base32
Detection rate	N/A	91.994%	99.629%	99.654%	99.995%	99.732%	99.993%
Det. factor	N/A	0.729	0.807	1.581	1.382	1.709	1.380

Error types:

nsub: a sequence of n single-character substitutions, eg. a..b / c..d is a 2sub.
m-trans: a single transposition between characters that have m other characters distancing them, eg. a.c / c.a is a 1-trans.
m-nsub: a sequence of n single-character substitutions that have m other characters distancing them, eg. a.b / c.d is a 1-2sub.
m-twin: two identical characters that are substituted to the same character, with m other characters distancing them, eg. a.a / b.b is a 1-twin.
phonetic: 1a / a0 with a between 1 and 9, eg. thirteen / thirty.

The frequency column indicates how often a human makes a particular type of error, based on Verhoeff’s findings.

We use it to compute an approximate detection rate, relying on the statistical proportion of random inputs for which the checksum does not detect a given error type. You can see the proportion of undetected errors for each checksum in the table above.

We added an interesting figure, the detection factor. After all, we want our checksum to make the most of the space it takes. Each bit should contribute to improving the detection rate. Since we expect a law of diminishing returns for every additional bit, the detection rate of a given design should improve with the number of checksum bits by 1 - 2^-k·b, with b the number of checksum bits, and k the detection factor, which we compute as -log2(1-detection) ÷ b.

Each check digit is a base32 character, so they each cost 5 bits of information, except for the alphanumerical ones, which cheat by producing values that cannot be mapped to base32 characters without increasing the error rate. Those take log2(36) ≈ 5.17 bits.

(The detection factor is not a perfect model, as the ISO/IEC 7064 algorithms don’t all have the same factor, despite being the same design with tweaked parameters. That said, MOD 1007-32 is between base32check2 and MOD 1271-36.)

You will notice two positive things:

Both base32check algorithms have a similar or superior detection rate than all others for a given number of check characters. (MOD 1271-36 is essentially identical, apart from the 0/O, 1/I and 8/B confusion risk, which is unfortunately not computed here, for lack of empirical data.)
base32check1 has the best overall detection factor. base32check2 has a comparable one to MOD 1271-36. It is slightly lower, implying that it could potentially use its bits better. Indeed, there are a few tweaks that could help, although I would prefer to await more statistical human error data before changing the design, as the detection rate varies significantly if we label the 3sub thru 6sub as insertion and deletion errors instead, which they probably are; Verhoeff’s study does not distinguish them.

While writing the algorithm, I worried that perhaps I would not be able to beat the detection rate of IBAN with a single check character (half the amount!). The whole point of designing base32check2 was on the off-chance that it didn’t. Fortunately, even base32check1 is better than IBAN’s MOD 97-10, mostly thanks to IBAN’s character conversion procedure.

User Study

The computed scores depend on second-hand information from Verhoeff’s study. It may not map accurately to other alphabets, or alphabet sizes.

Moreover, it does not help determine which alphabet is least error-prone for human transcription.

Down to the core, we want to minimize the number of times you send money to the wrong account.

prob_oops = 1 - prob_fine

prob_fine =   Σ  prob_error·prob_detect_error + (1 - prob_error)
            error
            types

prob_error = 1 - (1 - prob_char_error) ^ #chars

prob_error is the probability that the human committed at least one error of that type over the whole text.
prob_char_error is the probability of committing that type of error on a given character. It depends on the type of error, and α, the size of the alphabet used: the more symbols there are, the more they look alike.
prob_detect_error is the probability that the error gets detected. It depends on the type of error. It is 1-prob_collision, the probability that the checksum yields the same result for two inputs separated by an error of a give type.
#chars: payload size. To transmit b bits of information, it is ⌈b ÷ log2(α)⌉ + checksum size.

We can get the probability of a single-character substitution, for instance, from the probability p of a substitution on a text holding b bits:

prob_char_error = 1 - 2 ^ (log2(1 - p) ÷ ⌈b ÷ log2(α)⌉)

So, all we need, to find an optimal α and checksum size, is a user study estimating the probability of human mistakes.

We will look at A = 7 alphabets:

base10 (digits) is found in most identifiers, including credit card numbers. Checksums include Luhn, MOD 11-10 and MOD 97-10,
base16 is found in most binary representations, including IPv6. A checksum for it is represented in the paper we based our work on,
base32 (our main focus),
base36, seen in IBAN, has MOD 37-36 and MOD 1271-36,
base58 as used by Bitcoin addresses; we don’t have a checksum for it, but one could be designed based on either MOD algorithms or the paper’s, although the latter would be more complicated, as it would use a combination of GF(2) and GF(29),
base64, mostly to see how much base58 improves from it, can also easily have a checksum designed for it,
base1024, a fresh design, used nowhere, where 10 bits are represented as 3 alternating consonants and vowels, which means base32check2 would work just fine. It has roughly the same information density as base10, which makes it an interesting comparison. Would all base10 identifiers be better served by being shown as base1024 instead?

Let’s ask each participant to submit S = 50 entries. To get error estimations accurate to the percent, we need 100÷(S÷A) = 14 participants; we need ten times that to remove most of the estimation error.

All participant information is anonymous and sent securely.

Give it a try! Contribute to the study! Statistics will appear as you make progress. Plus, it is pretty relaxing.

References

base32check source code.
User study source code.