The espadrine blog. Tech deep dives, discoveries, and analyses. https://espadrine.github.io/blog/feed.xml 2022-01-27T16:36:40Z https://espadrine.github.io/blog/posts/sometimes-rewriting-in-another-language-works.html Sometimes, rewriting in another language works 2022-01-27T16:36:40Z Sometimes, rewriting in another language works

Computing…

The constraint should be 5 characters:
o = letter at right spot, x = wrong spot, . = not in the word.

This month, the game of Wordle has exploded in popularity. The principle is simple: every day, a secret 5-letter word must be guessed by everyone on the planet from the comfort of their confinement. We each have 6 guesses: every time we make one, the website lists all correctly positioned letters, all incorrectly positioned ones, and the letters that are not in the solution.

## Strategy §

Playing the game, there are many strategies you quickly pick up.

1. English letter frequencies help choose guesses that divide the search space faster.
2. You can make guesses that don’t follow previously reveiled constraints: a completely different guess can narrow down the possibilities much more.
3. Using common vowels on the first guess and common consonants on the second, helps narrow down possibilities very fast.

Many bloggers have tried their hand at optimal strategies. At the time, While skimming through them, it felt like each approach had issues. That motivated me to add to the pile of suboptimal techniques with my own, hopefully less flawed, algorithm.

## Principles §

I like to start tackling optimality problems by betting on brute force. It sometimes works right off the bat, and when it does not, I learn a lot from it on what to improve.

Many players focus on letter frequency, but while it can feel like brute force, it does not brute-force the right solution. Indeed, the guess with the most likely letters does not segment the search space fairly. A lot of words have the most likely letters, after all.

One (admittedly suboptimal, but superior) target, is to take the list of possible solutions, and find the guess which, on average, reduces the search space the most, by eliminating the most candidate words.

This is where we get our hands dirty with the nitty-gritty details.

There are two different lists in the JS code of the game’s webpage.

1. One contains the 2315 potential solutions,
2. and the other contains 10657 other words that will never be solutions, but that can still be used as guesses. Red herrings, if you will.

You may think the red herrings are useless, but in fact, they can achieve a greater elimination rate.

So, the algorithm needs to loop over all guesses.

For each one, it must compute the number of eliminated solutions, for each possible solution.

To do that, it must loop over all possible solutions, compute the constraint information that Wordle would produce for that guess, and then loop over all possible solutions again to count how many those would reject.

There are five letters in each guess, producing five constraints to check, and some constraints require looping over all letters of the solution.

Overall, the number of operations is (10657+2315) × 2315 × 2315 × 5 × 5. That is 1,737,996,667,500 steps.

That number is in that weird range of brute-force expectations: it is not unreachable, but it is not trivial either. Converting to nanoseconds, it is about half an hour; but we can just feel our gut yelling that each operation will take a bit more than a nanosecond.

## Optimism is healthy, right? §

Off I went, candidly implementing the algorithm in Julia.

(I like Julia for its type system and extensive standard library, which makes it my go-to desk calculator. It also cares about performance, relying on an LLVM-based JIT compiler.)

Since I knew I was in for a long compute time, I added a fancy duration estimator based on the average time the top loop took.

Sadly, this is what I saw:

$julia play-optimally.jl Ranking guesses... 1/12972 words (24950 min left)  Hang on, that is not half an hour… It is closer to 17 days! At that point, I really had three options: 1. Optimizing, 2. Parallelizing, 3. Rewriting in a faster language. ## Pessimism is healthy, too § I felt I could rewrite it in Rust within two hours. My mental model of how the compiled assembly should be, made me believe that Julia’s compiler was doing something wrong, and that Rust would indeed be much faster. However, I had no certainty, and a common maxim states that rewriting is usually less fruitful than redesigning with better algorithms. Parallelizing was another option, but I did not jump with joy at the idea of bringing my laptop to its knees by squeezing each of my 8 threads for… let me calculate… upwards of 50 continuous hours, at best?! So I made a compromise: I would spend 2 hours trying to fight the Julia compiler and figure out how to get it going fast. But the volume of possible compiler bailouts is large, and the sweet spot of JIT performance is a needle in a haystack. Reader, you know where the story is going: those two hours were fruitless, and the next two successfully rewrote it all in Rust. $ time cargo run
Ranking guesses... 11/12972 words (441 min left)


And just like that, I was granted the best Wordle guess in 7h, 15% of the time that parallelizing Julia would have given me.

(Edit: using time cargo run --release even brings it all the way down to 30 minutes!)

(Edit 2: a Julia thread investigated the code. A big part of the time difference came from a mistake in the Julia code, which caused the algorithm to search through all possible guesses as potential solutions, instead of only the accepted solutions. That brings it down to about 4h on the same machine.

The other improvements which allow Julia to match Rust performance involve using byte arrays instead of strings, using integer types with a fixed number of bits, and avoiding non-constant globals.

Along with a few other fascinating tricks that I recommend you read, the Julia version takes a stunning 20 minutes, better than the simple Rust version (which, to be fair, can likely achieve it too with similar tricks.))

## A warning on what the moral is §

Sure, the main conclusion is this:

Sometimes, rewriting in a more suitable language will get you the answer faster than any time spent on optimization or parallelization can give you.

(Edit: and, as someone from the Julia thread puts it, sometimes you might not make the same bug in both implementations!)

Had I rewritten the code directly instead of spending two hours vainly optimizing Julia code, I would have got the answer two hours earlier.

But more importantly, this is not about the language itself. It is about the quality of the compiler’s optimizers. Rewriting it in JS would have given a similar speedup.

Julia has outstanding qualities, and it can be fast. It just has not had the sheer magnitude of optimizational grind that V8 and Rust have accumulated over the years.

And in case you were hoping to know why I called this algorithm suboptimal: consider that a first guess could leave only 5 possible solutions which all next guesses would narrow down to 2; whilst a better first guess could leave 6 possible solutions which the next guesses narrow down to 1.

Algorithms do matter in the end too.

]]>
https://espadrine.github.io/blog/posts/mean-range-of-bell-curve-distributions.html Mean Range of Bell Curve Distributions 2021-05-02T14:52:56Z Mean Range of Bell Curve Distributions

Abstract: When sampling several data points from a known statistical distribution, a valuable indication of the spread is the range of the set of values obtained. Since the sampling is probabilistic, the best estimate we can hope for is the expected value of the range. That mean range, along with the expected maximum and minimum values of the sampling set, are traditionally difficult to compute with existing means. We present a method to perform that computation, and its implications on the correct computation of the balls-into-bins problem.

Randomly placing this number of balls: = 340282366920938463463374607431768211456,
in this number of bins: = 340282366920938463463374607431768211456,
causes the least filled bin to contain 0 balls, whilst the most filled bin contains 33 balls, which is 33 more.

Introduction: Currently, there does not seem to be any exact, computable formula for the expected range of obtaining samples from a distribution. The best known formula is:

$n \int_{0}^{1} x(G)[G^{n-1} - (1-G)^{n-1}] \, \mathrm{d}G$

which requires integrating with respect to the the inversed distribution.

In the case of discrete random variables, such as the Binomial distribution, a solution can be computed in a finite number of steps: $\sum_{x=1}^{N-t} [G(x+t)-G(x-1)]^n - [G(x+t)-G(x)]^n - [G(x+t-1)-G(x-1)]^n + [G(x+t-1)-G(x)]^n, t > 0$.

However, the number of iterations scales with $n$, and I wish to compute exact results for values of $n$ that cannot be iterated through within the timespan of the universe, such as $n = 2^{256}$.

As a consequence, solving the problem of the balls into bins is done inexactly, through approximations such as $\frac{\log(n)}{\log(\log(n))}$. We wish to compute exact results here.

## 1. Generic Derivation §

Consider a distribution with probability density function $\varphi$. Its associated random variable, $X$, can be either real-valued or discrete.

We observe a sample of $N$ independent values taken from that distribution.

The question we ask is: What is the range of values that have a probability ≥ $\gamma$ (across samplings of $N$ values) of appearing in the sample? For instance, for a mean range, one would pick $\gamma = \frac{1}{2}$.

Despite being potentially continuous, we can research the probability that a given value appears at least once in the sample. That is $1 - P_{excluded}$, where $P_{excluded}$ is the probability that the value does not appear in the sample.

In turn, given that the sample is independently drawn each time, the probability that a value is not drawn once, is $P_{excluded} = (1 - \varphi(x))^N$.

Thus, the probability that a given value is in the sample, is $1 - (1 - \varphi(x))^N$. By definition, that probability is equal to $\gamma$.

We can therefore derive that the values $x$ that are in range, follow the equation:

$\varphi(x) \geq 1 - \sqrt[N]{1 - \gamma}$

When $\varphi$ is a bell curve distribution, the corresponding equality has two solutions for $x$.

## 2. Application to the Normal §

Some bell distributions are more easily invertible. Thankfully, this is true of the Normal distribution, which will enable us to produce good estimations for all distributions, thanks to the central limit theorem.

First, let us derive the exact Normal solution. We have $\varphi(x) : \mathcal{N}(\mu, \sigma^2)$:

$\varphi(x) = \frac{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sqrt{2\sigma^2\pi}}$

Thus the solution to the general inequality is:

$x \in \left[ \mu \pm \sqrt{-2\sigma^2 \ln(\sqrt{2\sigma^2\pi}(1-\sqrt[N]{1-\gamma}))} \right]$

From this, we can compute the maximum and minimum exactly, along with the mean range, which follows this formula:

$2\sqrt{-2\sigma^2 \ln(\sqrt{2\sigma^2\pi}(1-\sqrt[N]{1-\gamma}))}$

## 3. Application to the Binomial §

The PDF of a binomial distribution $\beta(x) : \mathcal{B}(m, p)$, the probability of a number $x$ of positive events among $m$ events with probability $p$ of positivity, follows this equation:

$\beta(x) = {m \choose x} p^x (1-p)^{m-x}$

While $x$ is a discrete integer, the distribution of $\mathcal{B}(m, p)$ is also bell-shaped. Thus the generic derivation above can also be applied.

Two issues arise when using that derivation, however:

• Unlike the Normal, the binomial coefficient cannot be elegantly inverted, which prevents us from producing an exact formula.
• For large values of $m - x$ (around $2^{128}$), calculating that binomial coefficient exactly is too computationally expensive to yield a result within a lifetime.

We can however devise an algorithmic method by which we obtain an exact answer regardless.

The first issue can be solved by computing $\beta(x)$ for all values of $x$ until the bell curve plummets back below $\tau = 1-\sqrt[N]{1-\gamma}$. However, that method is impractical when $x_{max}$ is too large.

Instead of going through each value of $x$, our algorithm can search for the right value through increasingly accurate approximations, similar to the way Newton’s method works.

This convergence works by:

1. Using the best model we have of the distribution,
2. Gathering information from the estimated root,
3. Updating the model to be even more precise,
4. Iterating, similar to an interpolation search, until eventually, we find two consecutive integers $x_{max}$ and $x_{max}+1$ where the first is above the limit (obtained from the generic derivation), and the other is not.

The two challenges in implementing this algorithm are:

• Problem 1: Evaluating $\beta(x)$ is too expensive for large $x$ using integer arithmetic operations,
• Problem 2: Establishing a good and computable model for the distribution, and updating it in such a way that ensures eventual and fast convergence.

### 3.1. Evaluating the PDF §

We use the classic solution: first, convert the binomial coefficient formula to use the Gamma function.

$\beta(x) = \frac{\Gamma(m+1)}{\Gamma(x+1)\Gamma(m-x+1)} p^x (1-p)^{m-x}$

Then, to avoid handling large gamma results, we rely on an exact computation of the log-gamma. We can use an arbitrary-precision library to ensure we get an error below the integer result we end up with. (To find the right precision to set for the algorithm, we can rely on exponential binary search.)

$\beta(x) = e^{ \ln\Gamma(m+1) - \ln\Gamma(x+1) - \ln\Gamma(m-x+1) + x \ln(p) + (m-x) \ln(1-p) }$

### 3.2. Converging to the range extrema §

Given the shape of the PDF, and its reflectional symmetry, we can bound the expected maximum sample to be between the mean and the end of the curve.

$mp \leq x_{max} \leq m$

We set those bounds as $x_{low}$ and $x_{high}$, and estimate the value of $x_{max}$ from its Gaussian approximation:

$\hat{x}_{max} = \left\lceil mp + \sqrt{-2mp(1-p) \ln(\sqrt{2mp(1-p)\pi}(1-\sqrt[N]{1-\gamma}))} \right\rceil$

We can then compute the accurate value of $\beta(\hat{x}_{max})$. If that value is below $\tau$, we are too far: we set the upper bound $x_{high}$ to our $\hat{x}_{max}$ estimate. Otherwise, we set $x_{low}$ to it instead. Plot of 𝔅(200, 10-1) in red, and its approximating Gaussian in blue. τ is shown in green. Note how the Normal approximation is off by one on the minimum, but the shape of its curve is a good fit locally, apart from being horizontally off.

Then, we must improve our estimated model.

Newton’s method is insufficient, because it does not guarantee convergence, and because its convergence is comparatively slow as a result of the flatness of the curve.

Binary search, taking the average of $x_{low}$ and $x_{high}$, produces a reliable convergence in $O(\log(m))$, but it does not use our existing knowledge of the shape of the curve.

The normal curve is quite a good approximation, especially with large values. (With small values, the convergence is fast anyway.)

However, past the first estimation, the normal curve is too far from where the binomial curve intersects $\tau$. Thus we must slide it, either to the left or to the right, so that it coincides laterally with the real point $\{\hat{x}_{max}, \beta(\hat{x}_{max})\}$ whose abscissa is an estimate of $x_{max}$.

That new curve is another Gaussian distribution, with a mean that solves the equation $\varphi_{\mu, \sigma^2}(\hat{x}_{max}) = \beta(\hat{x}_{max})$:

$\mu = \hat{x}_{max} - \sqrt{-2\sigma^2\ln( \beta(\hat{x}_{max}) \sqrt{2\sigma^2\pi} )}$

However, there is no guarantee that it will intersect $\tau$ between $x_{low}$ and $x_{high}$. As a fallback, if it is out of bounds, we ignore the normal estimate and use the average of $x_{low}$ and $x_{high}$, just like binary search.

Once the bounds $x_{low}$ and $x_{high}$ have converged into adjacent integers, we have found $x_{max} = x_{low}$.

As for $x_{min}$, the process is symmetrically identical, except it occurs within the bounds:

$0 \leq x_{min} \leq mp$

and using the following, reminiscent mean update:

$\mu = \hat{x}_{min} + \sqrt{-2\sigma^2\ln( \beta(\hat{x}_{min}) \sqrt{2\sigma^2\pi} )}$

The algorithmic complexity of the convergence is in $O(\log(m))$ worst-case, degrading to binary search, but is empirically $O(\log(\log(m)))$ on average: Plot of the number of iterations necessary to reach convergence, when computing the maximum of a sample of 2i elements from a Binomial 𝔅(2i, 2-2i) distribution, in blue. In red is log2(i), which matches the shape of convergence speed asymptotically. Not shown is the number of iterations for binary search. It would be a straight diagonal: 2i takes i iterations.

## 4. Balls Into Bins §

This result allows exact computation of solutions for a well-known problem: “Given $N$ balls each randomly placed into $R$ bins, how many do the most and least filled bin have?”

The problem is a sampling of $N$ values of the Binomial distribution $\mathcal{B}(N, \frac{1}{R})$. Thus, the mean maximum and minimum are its solutions.

The widget at the top of the page gives an instant and exact result for this problem, for values below $2^{1024}$.

### 4.1. Hash tables §

One use for this problem is in assessing the worst-case complexity of hash table operations to match operational requirements. Indeed, the hash output is meant to be uniformly distributed; in other words, a PRF: one such example is SipHash.

Since the implementation of hash collisions typically require linear probing, library developers strive for a bounded maximum number of hashes that map to the same table entry. Typically, they use a load factor: if more than 87.5% of the table is filled, the table size is doubled and rehashed.

The widget above can help show that this approach does not yield a bounded maximum, by inputting 0.875*2^i balls into 2^i bins:

Table size Max bucket size
28 4
216 7
232 11
264 19

As you can see, the growth is very slow, which satisfies engineering constraints. If there was some imperative to be bounded below a certain value, the table algorithm could use the principles laid out in this article to dynamically compute the load factor that keeps the maximum bucket size below the imposed limit.

(A notable exception to this result is Cuckoo Hashing, whose maximum bucket size has a different formula.)

### 4.2. Hash chains §

Another situation where this problem finds relevance is in cryptography. First, in the field of collision-resistant functions. In a hash chain, the root hash has a single hash as input. The 256-bit input of the SHA-256 primitive randomly maps to its 256-bit output. There will be one particular hash output that 57 distinct inputs produce. The pigeonhole principle dictates that this removes possible outputs; and indeed, about 38% of the $2^{256}$ possible output hashes cannot be produced. In other words, if you take a random 256-bit hex string, it will not be a valid output in one case out of three.

Indeed, the probability that a bin has no balls after the first link in the chain is $\beta_{n = 2^{b},\,p = 2^{-b}}(0) = (1 - 2^{-b})^{2^{b}} \xrightarrow{\, b \rightarrow \infty \,} \frac{1}{e}$ for a $b$-bit hash. On the $i$-th chain of the link, the same phenomenon strikes again, and only $h_i = 1 - (1 - 2^{-b})^{2^{b}h_{i-1}}$ remain (with $h_0 = 1$ since we start with 100%).

Of course, after that initial 38% loss, the subsequent losses are lesser, but $h_i \xrightarrow{\, i \rightarrow \infty \,} 0$. After just 100 iterations, only 2% of possible hashes remain. After the typical 10k iterations of PBKDF2, only 0.02% remain.

It is not a vulnerability per se: it only removes about 13 bits off a 256-bit space, or 7 bits of security against collision resistance. Technically, when the number of iterations reaches $2^{\frac{b}{2}}$, the probability formula breaks down; a hash chain will loop around after an average of $2^{\frac{b}{2}}$ operations. This does showcase yet how simple designs can yield subtle consequences.

### 4.3. Block ciphers §

Consider an ideal $b$-bit block cipher (PRP) using a $b$-bit key, as with AES-128. We are an attacker examining a $b$-bit ciphertext block generated from the encryption of a low-entropy plaintext block of the same size.

While it is true that for a given key, each plaintext block produces a single ciphertext block and vice-versa, for a given ciphertext block, each key maps to a random plaintext block. Keys can be seen as balls, and plaintext blocks as bins.

Thus, conversely, about $\frac{100}{e} \approx 37\%$ of plaintext blocks have zero keys that decrypt the ciphertext to them. Thus, if the plaintext block contained a single bit of information, such as a presidential vote in a majoritarian election, and if finding the number of valid keys was computationally feasible, the adversary could decrypt the vote with probability $\frac{1}{e}$.

Yet again, it is not a vulnerability, since the only currently-known way to do so is to brute-force the key, but it creates an unintuitive attack that a larger key would dismiss.

Similarly, in an indistinguishability under chosen-plaintext attack (IND-CPA) against an ideal $b$-bit block cipher with a $2b$-bit key, as AES-256 is assumed to be, the set of possible couples of plaintext/ciphertext blocks has $2^{b+b} = 2^{2b}$, while the set of keys has $2^{2b}$. Since the latter randomly map to the former, when sending two plaintexts and receiving the encryption of one of them, there is no possible key that encrypts the secretly discarded plaintext to the provided ciphertext, with probability $\frac{1}{e}$. Assuming again that we can simply compute whether there exists no valid key that encrypts a plaintext to a ciphertext in polynomial time, we would expect to need a mere three plaintext queries before we can tell which of the two plaintexts sent every time, the first or the second, is the one that gets encrypted, with an advantage of 1.

Also, we can now say precisely that, in a know-plaintext attack (KPA) with AES-256, the single known plaintext/ciphertext pair can have at most 57 valid keys that do encrypt the plaintext block to the ciphertext block, on average. That is the number of balls in the most filled of $2^{256}$ bins receiving $2^{256}$ balls. Having that many keys is unlikely however: 58% of all valid pairs will have a single valid key, 29% will have two, 10% will have three, and so forth.

Finally, still in AES-256, for a given ciphertext block, the plaintext block with the most keys decrypting to it ($K_{max}$) has about 600 quintillion more keys than the plaintext block with the least keys ($K_{min}$), making it more likely. That may superficially seem like a disadvantage of using larger keys than the block size. However, the advantage gained compared to the least likely plaintext block is only an increased probability by $\frac{K_{max}-K_{min}}{2^{256}} = 2^{-187}$, which is not only extremely negligible, but also smaller than in AES-128 ($\frac{K_{max_{128}}-K_{min_{128}}}{2^{128}} = 2^{-123}$).

## Conclusion §

We described an exact formula for the mean range and extrema of a Normal distribution. We used it to produce an algorithm that can compute the extrema of any bell curve distribution, with the example of the Binomial distribution. We optimized that algorithm to run in $O(\log(\log(m)))$ average time, and discussed a few exact consequences that used to be unconfirmed approximations.

Bibliography:

]]>