Mean Range of Bell Curve Distributions

Abstract: When sampling several data points from a known statistical distribution, a valuable indication of the spread is the range of the set of values obtained. Since the sampling is probabilistic, the best estimate we can hope for is the expected value of the range. That mean range, along with the expected maximum and minimum values of the sampling set, are traditionally difficult to compute with existing means. We present a method to perform that computation, and its implications on the correct computation of the balls-into-bins problem.

Randomly placing this number of balls: = 340282366920938463463374607431768211456,
in this number of bins: = 340282366920938463463374607431768211456,
causes the least filled bin to contain 0 balls, whilst the most filled bin contains 33 balls, which is 33 more.

Introduction: Currently, there does not seem to be any exact, computable formula for the expected range of obtaining samples from a distribution. The best known formula is:

n01x(G)[Gn1(1G)n1]dGn \int_{0}^{1} x(G)[G^{n-1} - (1-G)^{n-1}] \, \mathrm{d}G

which requires integrating with respect to the the inversed distribution.

In the case of discrete random variables, such as the Binomial distribution, a solution can be computed in a finite number of steps: x=1Nt[G(x+t)G(x1)]n[G(x+t)G(x)]n[G(x+t1)G(x1)]n+[G(x+t1)G(x)]n,t>0\sum_{x=1}^{N-t} [G(x+t)-G(x-1)]^n - [G(x+t)-G(x)]^n - [G(x+t-1)-G(x-1)]^n + [G(x+t-1)-G(x)]^n, t > 0.

However, the number of iterations scales with nn, and I wish to compute exact results for values of nn that cannot be iterated through within the timespan of the universe, such as n=2256n = 2^{256}.

As a consequence, solving the problem of the balls into bins is done inexactly, through approximations such as log(n)log(log(n))\frac{\log(n)}{\log(\log(n))}. We wish to compute exact results here.

1. Generic Derivation

Consider a distribution with probability density function φ\varphi. Its associated random variable, XX, can be either real-valued or discrete.

We observe a sample of NN independent values taken from that distribution.

The question we ask is: What is the range of values that have a probability ≥ γ\gamma (across samplings of NN values) of appearing in the sample? For instance, for a mean range, one would pick γ=12\gamma = \frac{1}{2}.

Despite being potentially continuous, we can research the probability that a given value appears at least once in the sample. That is 1Pexcluded1 - P_{excluded}, where PexcludedP_{excluded} is the probability that the value does not appear in the sample.

In turn, given that the sample is independently drawn each time, the probability that a value is not drawn once, is Pexcluded=(1φ(x))NP_{excluded} = (1 - \varphi(x))^N.

Thus, the probability that a given value is in the sample, is 1(1φ(x))N1 - (1 - \varphi(x))^N. By definition, that probability is equal to γ\gamma.

We can therefore derive that the values xx that are in range, follow the equation:

φ(x)11γN\varphi(x) \geq 1 - \sqrt[N]{1 - \gamma}

When φ\varphi is a bell curve distribution, the corresponding equality has two solutions for xx.

2. Application to the Normal

Some bell distributions are more easily invertible. Thankfully, this is true of the Normal distribution, which will enable us to produce good estimations for all distributions, thanks to the central limit theorem.

First, let us derive the exact Normal solution. We have φ(x):N(μ,σ2)\varphi(x) : \mathcal{N}(\mu, \sigma^2):

φ(x)=e(xμ)22σ22σ2π\varphi(x) = \frac{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sqrt{2\sigma^2\pi}}

Thus the solution to the general inequality is:

x[μ±2σ2ln(2σ2π(11γN))]x \in \left[ \mu \pm \sqrt{-2\sigma^2 \ln(\sqrt{2\sigma^2\pi}(1-\sqrt[N]{1-\gamma}))} \right]

From this, we can compute the maximum and minimum exactly, along with the mean range, which follows this formula:

22σ2ln(2σ2π(11γN))2\sqrt{-2\sigma^2 \ln(\sqrt{2\sigma^2\pi}(1-\sqrt[N]{1-\gamma}))}

3. Application to the Binomial

The PDF of a binomial distribution β(x):B(m,p)\beta(x) : \mathcal{B}(m, p), the probability of a number xx of positive events among mm events with probability pp of positivity, follows this equation:

β(x)=(mx)px(1p)mx\beta(x) = {m \choose x} p^x (1-p)^{m-x}

While xx is a discrete integer, the distribution of B(m,p)\mathcal{B}(m, p) is also bell-shaped. Thus the generic derivation above can also be applied.

Two issues arise when using that derivation, however:

We can however devise an algorithmic method by which we obtain an exact answer regardless.

The first issue can be solved by computing β(x)\beta(x) for all values of xx until the bell curve plummets back below τ=11γN\tau = 1-\sqrt[N]{1-\gamma}. However, that method is impractical when xmaxx_{max} is too large.

Instead of going through each value of xx, our algorithm can search for the right value through increasingly accurate approximations, similar to the way Newton’s method works.

This convergence works by:

  1. Using the best model we have of the distribution,
  2. Gathering information from the estimated root,
  3. Updating the model to be even more precise,
  4. Iterating, similar to an interpolation search, until eventually, we find two consecutive integers xmaxx_{max} and xmax+1x_{max}+1 where the first is above the limit (obtained from the generic derivation), and the other is not.

The two challenges in implementing this algorithm are:

3.1. Evaluating the PDF

We use the classic solution: first, convert the binomial coefficient formula to use the Gamma function.

β(x)=Γ(m+1)Γ(x+1)Γ(mx+1)px(1p)mx\beta(x) = \frac{\Gamma(m+1)}{\Gamma(x+1)\Gamma(m-x+1)} p^x (1-p)^{m-x}

Then, to avoid handling large gamma results, we rely on an exact computation of the log-gamma. We can use an arbitrary-precision library to ensure we get an error below the integer result we end up with. (To find the right precision to set for the algorithm, we can rely on exponential binary search.)

β(x)=elnΓ(m+1)lnΓ(x+1)lnΓ(mx+1)+xln(p)+(mx)ln(1p)\beta(x) = e^{ \ln\Gamma(m+1) - \ln\Gamma(x+1) - \ln\Gamma(m-x+1) + x \ln(p) + (m-x) \ln(1-p) }

3.2. Converging to the range extrema

Given the shape of the PDF, and its reflectional symmetry, we can bound the expected maximum sample to be between the mean and the end of the curve.

mpxmaxmmp \leq x_{max} \leq m

We set those bounds as xlowx_{low} and xhighx_{high}, and estimate the value of xmaxx_{max} from its Gaussian approximation:

x^max=mp+2mp(1p)ln(2mp(1p)π(11γN))\hat{x}_{max} = \left\lceil mp + \sqrt{-2mp(1-p) \ln(\sqrt{2mp(1-p)\pi}(1-\sqrt[N]{1-\gamma}))} \right\rceil

We can then compute the accurate value of β(x^max)\beta(\hat{x}_{max}). If that value is below τ\tau, we are too far: we set the upper bound xhighx_{high} to our x^max\hat{x}_{max} estimate. Otherwise, we set xlowx_{low} to it instead.

Plot of overlaid Binomial and Normal distributions

Plot of 𝔅(200, 10-1) in red, and its approximating Gaussian in blue. τ is shown in green. Note how the Normal approximation is off by one on the minimum, but the shape of its curve is a good fit locally, apart from being horizontally off.

Then, we must improve our estimated model.

Newton’s method is insufficient, because it does not guarantee convergence, and because its convergence is comparatively slow as a result of the flatness of the curve.

Binary search, taking the average of xlowx_{low} and xhighx_{high}, produces a reliable convergence in O(log(m))O(\log(m)), but it does not use our existing knowledge of the shape of the curve.

The normal curve is quite a good approximation, especially with large values. (With small values, the convergence is fast anyway.)

However, past the first estimation, the normal curve is too far from where the binomial curve intersects τ\tau. Thus we must slide it, either to the left or to the right, so that it coincides laterally with the real point {x^max,β(x^max)}\{\hat{x}_{max}, \beta(\hat{x}_{max})\} whose abscissa is an estimate of xmaxx_{max}.

That new curve is another Gaussian distribution, with a mean that solves the equation φμ,σ2(x^max)=β(x^max)\varphi_{\mu, \sigma^2}(\hat{x}_{max}) = \beta(\hat{x}_{max}):

μ=x^max2σ2ln(β(x^max)2σ2π)\mu = \hat{x}_{max} - \sqrt{-2\sigma^2\ln( \beta(\hat{x}_{max}) \sqrt{2\sigma^2\pi} )}

However, there is no guarantee that it will intersect τ\tau between xlowx_{low} and xhighx_{high}. As a fallback, if it is out of bounds, we ignore the normal estimate and use the average of xlowx_{low} and xhighx_{high}, just like binary search.

Once the bounds xlowx_{low} and xhighx_{high} have converged into adjacent integers, we have found xmax=xlowx_{max} = x_{low}.

As for xminx_{min}, the process is symmetrically identical, except it occurs within the bounds:

0xminmp0 \leq x_{min} \leq mp

and using the following, reminiscent mean update:

μ=x^min+2σ2ln(β(x^min)2σ2π)\mu = \hat{x}_{min} + \sqrt{-2\sigma^2\ln( \beta(\hat{x}_{min}) \sqrt{2\sigma^2\pi} )}

The algorithmic complexity of the convergence is in O(log(m))O(\log(m)) worst-case, degrading to binary search, but is empirically O(log(log(m)))O(\log(\log(m))) on average:

Plot of iteration count

Plot of the number of iterations necessary to reach convergence, when computing the maximum of a sample of 2i elements from a Binomial 𝔅(2i, 2-2i) distribution, in blue. In red is log2(i), which matches the shape of convergence speed asymptotically.

Not shown is the number of iterations for binary search. It would be a straight diagonal: 2i takes i iterations.

4. Balls Into Bins

This result allows exact computation of solutions for a well-known problem: “Given NN balls each randomly placed into RR bins, how many do the most and least filled bin have?”

The problem is a sampling of NN values of the Binomial distribution B(N,1R)\mathcal{B}(N, \frac{1}{R}). Thus, the mean maximum and minimum are its solutions.

The widget at the top of the page gives an instant and exact result for this problem, for values below 210242^{1024}.

4.1. Hash tables

One use for this problem is in assessing the worst-case complexity of hash table operations to match operational requirements. Indeed, the hash output is meant to be uniformly distributed; in other words, a PRF: one such example is SipHash.

Since the implementation of hash collisions typically require linear probing, library developers strive for a bounded maximum number of hashes that map to the same table entry. Typically, they use a load factor: if more than 87.5% of the table is filled, the table size is doubled and rehashed.

The widget above can help show that this approach does not yield a bounded maximum, by inputting 0.875*2^i balls into 2^i bins:

Table size Max bucket size
28 4
216 7
232 11
264 19

As you can see, the growth is very slow, which satisfies engineering constraints. If there was some imperative to be bounded below a certain value, the table algorithm could use the principles laid out in this article to dynamically compute the load factor that keeps the maximum bucket size below the imposed limit.

(A notable exception to this result is Cuckoo Hashing, whose maximum bucket size has a different formula.)

4.2. Hash chains

Another situation where this problem finds relevance is in cryptography. First, in the field of collision-resistant functions. In a hash chain, the root hash has a single hash as input. The 256-bit input of the SHA-256 primitive randomly maps to its 256-bit output. There will be one particular hash output that 57 distinct inputs produce. The pigeonhole principle dictates that this removes possible outputs; and indeed, about 38% of the 22562^{256} possible output hashes cannot be produced. In other words, if you take a random 256-bit hex string, it will not be a valid output in one case out of three.

Indeed, the probability that a bin has no balls after the first link in the chain is βn=2b,p=2b(0)=(12b)2bb1e\beta_{n = 2^{b},\,p = 2^{-b}}(0) = (1 - 2^{-b})^{2^{b}} \xrightarrow{\, b \rightarrow \infty \,} \frac{1}{e} for a bb-bit hash. On the ii-th chain of the link, the same phenomenon strikes again, and only hi=1(12b)2bhi1h_i = 1 - (1 - 2^{-b})^{2^{b}h_{i-1}} remain (with h0=1h_0 = 1 since we start with 100%).

Of course, after that initial 38% loss, the subsequent losses are lesser, but hii0h_i \xrightarrow{\, i \rightarrow \infty \,} 0. After just 100 iterations, only 2% of possible hashes remain. After the typical 10k iterations of PBKDF2, only 0.02% remain.

It is not a vulnerability per se: it only removes about 13 bits off a 256-bit space, or 7 bits of security against collision resistance. Technically, when the number of iterations reaches 2b22^{\frac{b}{2}}, the probability formula breaks down; a hash chain will loop around after an average of 2b22^{\frac{b}{2}} operations. This does showcase yet how simple designs can yield subtle consequences.

4.3. Block ciphers

Consider an ideal bb-bit block cipher (PRP) using a bb-bit key, as with AES-128. We are an attacker examining a bb-bit ciphertext block generated from the encryption of a low-entropy plaintext block of the same size.

While it is true that for a given key, each plaintext block produces a single ciphertext block and vice-versa, for a given ciphertext block, each key maps to a random plaintext block. Keys can be seen as balls, and plaintext blocks as bins.

Thus, conversely, about 100e37%\frac{100}{e} \approx 37\% of plaintext blocks have zero keys that decrypt the ciphertext to them. Thus, if the plaintext block contained a single bit of information, such as a presidential vote in a majoritarian election, and if finding the number of valid keys was computationally feasible, the adversary could decrypt the vote with probability 1e\frac{1}{e}.

Yet again, it is not a vulnerability, since the only currently-known way to do so is to brute-force the key, but it creates an unintuitive attack that a larger key would dismiss.

Similarly, in an indistinguishability under chosen-plaintext attack (IND-CPA) against an ideal bb-bit block cipher with a 2b2b-bit key, as AES-256 is assumed to be, the set of possible couples of plaintext/ciphertext blocks has 2b+b=22b2^{b+b} = 2^{2b}, while the set of keys has 22b2^{2b}. Since the latter randomly map to the former, when sending two plaintexts and receiving the encryption of one of them, there is no possible key that encrypts the secretly discarded plaintext to the provided ciphertext, with probability 1e\frac{1}{e}. Assuming again that we can simply compute whether there exists no valid key that encrypts a plaintext to a ciphertext in polynomial time, we would expect to need a mere three plaintext queries before we can tell which of the two plaintexts sent every time, the first or the second, is the one that gets encrypted, with an advantage of 1.

Also, we can now say precisely that, in a know-plaintext attack (KPA) with AES-256, the single known plaintext/ciphertext pair can have at most 57 valid keys that do encrypt the plaintext block to the ciphertext block, on average. That is the number of balls in the most filled of 22562^{256} bins receiving 22562^{256} balls. Having that many keys is unlikely however: 58% of all valid pairs will have a single valid key, 29% will have two, 10% will have three, and so forth.

Finally, still in AES-256, for a given ciphertext block, the plaintext block with the most keys decrypting to it (KmaxK_{max}) has about 600 quintillion more keys than the plaintext block with the least keys (KminK_{min}), making it more likely. That may superficially seem like a disadvantage of using larger keys than the block size. However, the advantage gained compared to the least likely plaintext block is only an increased probability by KmaxKmin2256=2187\frac{K_{max}-K_{min}}{2^{256}} = 2^{-187}, which is not only extremely negligible, but also smaller than in AES-128 (Kmax128Kmin1282128=2123\frac{K_{max_{128}}-K_{min_{128}}}{2^{128}} = 2^{-123}).

Conclusion

We described an exact formula for the mean range and extrema of a Normal distribution. We used it to produce an algorithm that can compute the extrema of any bell curve distribution, with the example of the Binomial distribution. We optimized that algorithm to run in O(log(log(m)))O(\log(\log(m))) average time, and discussed a few exact consequences that used to be unconfirmed approximations.

Bibliography:


Comments on Reddit.