Mean Range of Bell Curve Distributions
Abstract:
When sampling several data points from a known statistical distribution,
a valuable indication of the spread is the range of the set of values obtained.
Since the sampling is probabilistic,
the best estimate we can hope for is the expected value of the range.
That mean range,
along with the expected maximum and minimum values of the sampling set,
are traditionally difficult to compute with existing means.
We present a method to perform that computation,
and its implications on the correct computation of the balls-into-bins problem.
Introduction:
Currently, there does not seem to be any exact, computable formula
for the expected range of obtaining samples from a distribution.
The best known formula is:
n∫01x(G)[Gn−1−(1−G)n−1]dG
which requires integrating with respect to the the inversed distribution.
In the case of discrete random variables, such as the Binomial distribution,
a solution can be computed in a finite number of steps:
∑x=1N−t[G(x+t)−G(x−1)]n−[G(x+t)−G(x)]n−[G(x+t−1)−G(x−1)]n+[G(x+t−1)−G(x)]n,t>0.
However, the number of iterations scales with n,
and I wish to compute exact results for values of n
that cannot be iterated through within the timespan of the universe,
such as n=2256.
As a consequence,
solving the problem of the balls into bins is done inexactly,
through approximations such as log(log(n))log(n).
We wish to compute exact results here.
1. Generic Derivation §
Consider a distribution with probability density function φ.
Its associated random variable, X, can be either real-valued or discrete.
We observe a sample of N independent values taken from that distribution.
The question we ask is:
What is the range of values that have a probability ≥ γ
(across samplings of N values) of appearing in the sample?
For instance, for a mean range, one would pick γ=21.
Despite being potentially continuous, we can research the probability
that a given value appears at least once in the sample.
That is 1−Pexcluded, where Pexcluded is the probability
that the value does not appear in the sample.
In turn, given that the sample is independently drawn each time,
the probability that a value is not drawn once,
is Pexcluded=(1−φ(x))N.
Thus, the probability that a given value is in the sample,
is 1−(1−φ(x))N.
By definition, that probability is equal to γ.
We can therefore derive that the values x that are in range,
follow the equation:
φ(x)≥1−N1−γ
When φ is a bell curve distribution,
the corresponding equality has two solutions for x.
2. Application to the Normal §
Some bell distributions are more easily invertible.
Thankfully, this is true of the Normal distribution,
which will enable us to produce good estimations for all distributions,
thanks to the central limit theorem.
First, let us derive the exact Normal solution.
We have φ(x):N(μ,σ2):
φ(x)=2σ2πe−2σ2(x−μ)2
Thus the solution to the general inequality is:
x∈[μ±−2σ2ln(2σ2π(1−N1−γ))]
From this, we can compute the maximum and minimum exactly,
along with the mean range, which follows this formula:
2−2σ2ln(2σ2π(1−N1−γ))
3. Application to the Binomial §
The PDF of a binomial distribution β(x):B(m,p),
the probability of a number x of positive events
among m events with probability p of positivity,
follows this equation:
β(x)=(xm)px(1−p)m−x
While x is a discrete integer,
the distribution of B(m,p) is also bell-shaped.
Thus the generic derivation above can also be applied.
Two issues arise when using that derivation, however:
- Unlike the Normal, the binomial coefficient cannot be elegantly inverted,
which prevents us from producing an exact formula.
- For large values of m−x (around 2128),
calculating that binomial coefficient exactly
is too computationally expensive to yield a result within a lifetime.
We can however devise an algorithmic method
by which we obtain an exact answer regardless.
The first issue can be solved by computing β(x) for all values of x
until the bell curve plummets back below τ=1−N1−γ.
However, that method is impractical when xmax is too large.
Instead of going through each value of x,
our algorithm can search for the right value
through increasingly accurate approximations,
similar to the way Newton’s method works.
This convergence works by:
- Using the best model we have of the distribution,
- Gathering information from the estimated root,
- Updating the model to be even more precise,
- Iterating, similar to an interpolation search,
until eventually, we find two consecutive integers
xmax and xmax+1 where the first is above the limit
(obtained from the generic derivation),
and the other is not.
The two challenges in implementing this algorithm are:
- Problem 1: Evaluating β(x) is too expensive for large x
using integer arithmetic operations,
- Problem 2: Establishing a good and computable model for the distribution,
and updating it in such a way that ensures eventual and fast convergence.
3.1. Evaluating the PDF §
We use the classic solution:
first, convert the binomial coefficient formula to use the Gamma function.
β(x)=Γ(x+1)Γ(m−x+1)Γ(m+1)px(1−p)m−x
Then, to avoid handling large gamma results,
we rely on an exact computation of the log-gamma.
We can use an arbitrary-precision library
to ensure we get an error below the integer result we end up with.
(To find the right precision to set for the algorithm,
we can rely on exponential binary search.)
β(x)=elnΓ(m+1)−lnΓ(x+1)−lnΓ(m−x+1)+xln(p)+(m−x)ln(1−p)
3.2. Converging to the range extrema §
Given the shape of the PDF, and its reflectional symmetry,
we can bound the expected maximum sample to be between the mean
and the end of the curve.
mp≤xmax≤m
We set those bounds as xlow and xhigh,
and estimate the value of xmax from its Gaussian approximation:
x^max=⌈mp+−2mp(1−p)ln(2mp(1−p)π(1−N1−γ))⌉
We can then compute the accurate value of β(x^max).
If that value is below τ, we are too far:
we set the upper bound xhigh to our x^max estimate.
Otherwise, we set xlow to it instead.
Then, we must improve our estimated model.
Newton’s method is insufficient,
because it does not guarantee convergence,
and because its convergence is comparatively slow
as a result of the flatness of the curve.
Binary search, taking the average of xlow and xhigh,
produces a reliable convergence in O(log(m)),
but it does not use our existing knowledge of the shape of the curve.
The normal curve is quite a good approximation,
especially with large values.
(With small values, the convergence is fast anyway.)
However, past the first estimation,
the normal curve is too far from where the binomial curve intersects τ.
Thus we must slide it, either to the left or to the right,
so that it coincides laterally
with the real point {x^max,β(x^max)}
whose abscissa is an estimate of xmax.
That new curve is another Gaussian distribution,
with a mean that solves the equation
φμ,σ2(x^max)=β(x^max):
μ=x^max−−2σ2ln(β(x^max)2σ2π)
However, there is no guarantee that it will intersect τ
between xlow and xhigh.
As a fallback, if it is out of bounds, we ignore the normal estimate
and use the average of xlow and xhigh,
just like binary search.
Once the bounds xlow and xhigh
have converged into adjacent integers,
we have found xmax=xlow.
As for xmin, the process is symmetrically identical,
except it occurs within the bounds:
0≤xmin≤mp
and using the following, reminiscent mean update:
μ=x^min+−2σ2ln(β(x^min)2σ2π)
The algorithmic complexity of the convergence is in O(log(m)) worst-case,
degrading to binary search, but is empirically O(log(log(m))) on average:
4. Balls Into Bins §
This result allows exact computation of solutions for a well-known problem:
“Given N balls each randomly placed into R bins,
how many do the most and least filled bin have?”
The problem is a sampling of N values
of the Binomial distribution B(N,R1).
Thus, the mean maximum and minimum are its solutions.
The widget at the top of the page gives an instant and exact result
for this problem, for values below 21024.
4.1. Hash tables §
One use for this problem is in assessing the worst-case complexity
of hash table operations to match operational requirements.
Indeed, the hash output is meant to be uniformly distributed;
in other words, a PRF: one such example is SipHash.
Since the implementation of hash collisions typically require linear probing,
library developers strive for a bounded maximum number of hashes
that map to the same table entry. Typically, they use a load factor:
if more than 87.5% of the table is filled,
the table size is doubled and rehashed.
The widget above can help show that
this approach does not yield a bounded maximum,
by inputting 0.875*2^i
balls into 2^i
bins:
Table size | Max bucket size
|
---|
28 | 4
|
216 | 7
|
232 | 11
|
264 | 19
|
As you can see, the growth is very slow,
which satisfies engineering constraints.
If there was some imperative to be bounded below a certain value,
the table algorithm could use the principles laid out in this article
to dynamically compute the load factor
that keeps the maximum bucket size below the imposed limit.
(A notable exception to this result is Cuckoo Hashing,
whose maximum bucket size has a different formula.)
4.2. Hash chains §
Another situation where this problem finds relevance is in cryptography.
First, in the field of collision-resistant functions.
In a hash chain, the root hash has a single hash as input.
The 256-bit input of the SHA-256 primitive randomly maps to its 256-bit output.
There will be one particular hash output that 57 distinct inputs produce.
The pigeonhole principle dictates that this removes possible outputs;
and indeed, about 38% of the 2256 possible output hashes
cannot be produced.
In other words, if you take a random 256-bit hex string,
it will not be a valid output in one case out of three.
Indeed, the probability that a bin has no balls
after the first link in the chain is
βn=2b,p=2−b(0)=(1−2−b)2bb→∞e1
for a b-bit hash.
On the i-th chain of the link, the same phenomenon strikes again,
and only hi=1−(1−2−b)2bhi−1 remain
(with h0=1 since we start with 100%).
Of course, after that initial 38% loss, the subsequent losses are lesser,
but hii→∞0.
After just 100 iterations, only 2% of possible hashes remain.
After the typical 10k iterations of PBKDF2, only 0.02% remain.
It is not a vulnerability per se:
it only removes about 13 bits off a 256-bit space,
or 7 bits of security against collision resistance.
Technically, when the number of iterations reaches 22b,
the probability formula breaks down;
a hash chain will loop around after an average of 22b operations.
This does showcase yet how simple designs can yield subtle consequences.
4.3. Block ciphers §
Consider an ideal b-bit block cipher (PRP) using a b-bit key,
as with AES-128.
We are an attacker examining a b-bit ciphertext block
generated from the encryption of a low-entropy plaintext block of the same size.
While it is true that for a given key,
each plaintext block produces a single ciphertext block and vice-versa,
for a given ciphertext block, each key maps to a random plaintext block.
Keys can be seen as balls, and plaintext blocks as bins.
Thus, conversely, about e100≈37% of plaintext blocks
have zero keys that decrypt the ciphertext to them.
Thus, if the plaintext block contained a single bit of information,
such as a presidential vote in a majoritarian election,
and if finding the number of valid keys was computationally feasible,
the adversary could decrypt the vote with probability e1.
Yet again, it is not a vulnerability,
since the only currently-known way to do so is to brute-force the key,
but it creates an unintuitive attack that a larger key would dismiss.
Similarly, in an indistinguishability under chosen-plaintext attack (IND-CPA)
against an ideal b-bit block cipher with a 2b-bit key,
as AES-256 is assumed to be,
the set of possible couples of plaintext/ciphertext blocks
has 2b+b=22b, while the set of keys has 22b.
Since the latter randomly map to the former,
when sending two plaintexts and receiving the encryption of one of them,
there is no possible key that encrypts the secretly discarded plaintext
to the provided ciphertext, with probability e1.
Assuming again that we can simply compute whether there exists no valid key
that encrypts a plaintext to a ciphertext in polynomial time,
we would expect to need a mere three plaintext queries before we can tell
which of the two plaintexts sent every time, the first or the second,
is the one that gets encrypted, with an advantage of 1.
Also, we can now say precisely that,
in a know-plaintext attack (KPA) with AES-256,
the single known plaintext/ciphertext pair can have at most 57 valid keys
that do encrypt the plaintext block to the ciphertext block, on average.
That is the number of balls
in the most filled of 2256 bins receiving 2256 balls.
Having that many keys is unlikely however:
58% of all valid pairs will have a single valid key, 29% will have two,
10% will have three, and so forth.
Finally, still in AES-256, for a given ciphertext block,
the plaintext block with the most keys decrypting to it (Kmax)
has about 600 quintillion more keys
than the plaintext block with the least keys (Kmin),
making it more likely.
That may superficially seem like a disadvantage
of using larger keys than the block size.
However, the advantage gained compared to the least likely plaintext block
is only an increased probability by
2256Kmax−Kmin=2−187,
which is not only extremely negligible, but also smaller than in AES-128
(2128Kmax128−Kmin128=2−123).
Conclusion §
We described an exact formula for the mean range
and extrema of a Normal distribution.
We used it to produce an algorithm
that can compute the extrema of any bell curve distribution,
with the example of the Binomial distribution.
We optimized that algorithm to run in O(log(log(m))) average time,
and discussed a few exact consequences that used to be unconfirmed approximations.
Bibliography:
- Hartley, H. O., & David, H. A. (1954). Universal Bounds for Mean Range and Extreme Observation. The Annals of Mathematical Statistics, 25(1), 85–99. doi:10.1214/aoms/1177728848
- Abdel-Aty, S. H. (1954). Ordered variables in discontinuous distributions. Statistica Neerlandica, 8(2), 61–82. doi:10.1111/j.1467-9574.1954.tb00442.x
- Gonnet, G. H. (1981). Expected Length of the Longest Probe Sequence in Hash Code Searching. Journal of the ACM, 28(2), 289–304. doi:10.1145/322248.322254
- Lamport, L. (1981). Password authentication with insecure communication.
Communications of the ACM, 24(11), 770–772. doi:10.1145/358790.358797
Comments on Reddit.