Why Randomness Becomes Predictable

The hidden mathematics that makes machine learning possible

Mar 05, 2026

Randomness feels chaotic at the level of individual events.

Flip a coin once and the result is pure uncertainty. Flip it twice and nothing meaningful has changed. But flip it 10,000 times, and something remarkable happens: the outcome becomes so stable that deviations from 50% start to look like violations of physics.

Somehow, randomness organises itself into predictability.

This transformation, from microscopic chaos to macroscopic stability, is one of the most important ideas in modern statistics and the mathematical tools that explain it are called concentration inequalities.

They are the quiet engine behind reliable data science. Whenever we train a model on finite data and expect it to behave sensibly in the real world, we are implicitly relying on them.

To understand why, we need to answer a simple question:

Why do averages behave so well?

The Strange Stability of Averages

Suppose we observe independent random variables:

\(X_1, X_2, \dots, X_n\)

with common mean mu.

Their average is

\(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\)

Intuition tells us that X-bar should approach mu as n grows.

This is the Law of Large Numbers.

But the law itself is not enough for real-world work.

It tells us convergence happens eventually, but says nothing about how fast.

In practice we need something sharper:

How likely is it that the average deviates significantly from its true value?

Concentration inequalities answer exactly this question.

They bound probabilities of the form

\(P(|\bar{X}_n - \mu| \ge \epsilon)\)

and reveal something surprising:

The probability of large deviations collapses extremely quickly.

In fact, in many cases it shrinks exponentially with the sample size.

This is the deep reason why large datasets behave so reliably.

A Ladder of Guarantees

The mathematics of concentration evolved gradually. Each step answers the same question with increasing precision.

Think of it as a ladder of guarantees.

Markov’s Inequality: The Bare Minimum

Markov’s inequality requires almost no information about a random variable.

If X is non-negative, then

\(P(X \ge a) \le \frac{E[X]}{a}\)

This bound is astonishingly general—it works for any distribution with a finite mean.

But the guarantee is weak. It does not improve with sample size.

Markov tells us deviations are possible, but not how quickly they disappear.

Chebyshev’s Inequality: Using Variance

If we know the variance (sigma squared), we can do better.

Chebyshev’s inequality states

\(P(|X - \mu| \ge t) \le \frac{\sigma^2}{t^2}\)

Applied to sample averages, this implies

\(P(|\bar{X}_n - \mu| \ge \epsilon) \le \frac{\sigma^2}{n\epsilon^2}\)

Now we see the first hint of stability: deviations shrink at rate 1/n.

The average becomes increasingly reliable.

But in modern machine learning, 1/n is still too slow. For high-confidence guarantees we need something dramatically stronger.

The Exponential Surprise

The real breakthrough arrives with Chernoff bounds.

Instead of polynomial decay, they produce exponential decay.

This means probabilities shrink like

\(\exp(-c n)\)

rather than 1/n.

The difference is enormous.

If a bound scales like 1/n, doubling the data halves the error probability.

If it scales like e^(-cn), doubling the data squares the reliability.

Randomness does not merely average out. It becomes violently suppressed.

The Chernoff Trick

How does this exponential stability appear?

Through a clever transformation.

Suppose we want to bound

\(P(X \ge a)\)

Chernoff’s idea is to move the problem into the exponent.

For any s>0,

\(X \ge a \quad \Rightarrow \quad e^{sX} \ge e^{sa}\)

Now apply Markov’s inequality to the transformed variable:

\(P(X \ge a) \le \frac{E[e^{sX}]}{e^{sa}}\)

The quantity E[e^{sX}], below, is the moment generating function (MGF).

\(M_X(s) = E[e^{sX}]\)

The MGF is a remarkable object: it encodes every moment of the distribution inside a single function.

Mean, variance, skewness, kurtosis, all of it lives there.

The Chernoff bound works by choosing the value of s that minimises the expression above. That optimisation produces exponentially small probabilities.

A simple trick, tilting the distribution exponentially, unlocks dramatically stronger guarantees.

Independence: The Hidden Ingredient

Exponential concentration does not appear automatically.

It emerges from independence.

If variables are independent, their MGFs multiply:

\(M_{\sum X_i}(s) = \prod_i M_{X_i}(s)\)

This multiplicative structure is what produces the exponential decay.

Intuitively, independence means that random fluctuations cannot reinforce each other indefinitely. Each new sample acts as a stabilising force.

Randomness stops compounding and starts cancelling.

This is the moment where chaos begins to cohere.

The Sub-Gaussian Universe

Some random variables concentrate particularly well.

These are called sub-Gaussian variables.

A random variable Z is sub-Gaussian if its MGF satisfies

\(E[e^{\beta Z}] \le e^{C\beta^2}\)

for some constant C.

This inequality implies the tails decay at least as fast as a Gaussian distribution.

Many common variables fall into this category:

bounded variables
Gaussian variables
Rademacher variables
averages of independent noise

Sub-Gaussian variables live in a world where large deviations are extremely unlikely.

They represent the ideal environment for statistical learning.

To measure tail behavior formally, we use Orlicz norms:

For sub-Gaussian variables:

\(\|\cdot\|_{\psi_2}\)

For sub-exponential variables:

\(\|\cdot\|_{\psi_1} \)

An important subtlety arises here.

If Z is sub-Gaussian, then Z^2 is sub-exponential.

This matters in machine learning, because squared errors, central to loss functions, naturally produce heavier tails than the original noise.

Fortunately, sub-exponential variables still exhibit exponential concentration.

Predictability survives, even in noisier settings.

Hoeffding vs Bernstein

When working with independent variables, two inequalities dominate practice.

Hoeffding’s Inequality

Hoeffding applies when variables are bounded:

\(a \le X_i \le b\)

It produces the bound

\(P(\bar{X}_n - \mu \ge \epsilon) \le \exp(-2n\epsilon^2/(b-a)^2)\)

The beauty of Hoeffding is its simplicity: only the range matters.

But this simplicity comes with a cost. The inequality ignores the internal structure of the distribution.

If even a tiny fraction of probability mass lies far from the mean, the range must expand to include it. The bound becomes pessimistic.

Bernstein’s Inequality

Bernstein improves the situation by incorporating variance:

\(P(\bar{X}_n - \mu \ge \epsilon) \le \exp\left( -\frac{n\epsilon^2}{2\sigma^2 + 2M\epsilon/3} \right)\)

Now the behaviour of the distribution matters.

If most probability mass sits near the mean, the variance (sigma squared) will be small even if the maximum range M is large.

In such cases Bernstein produces dramatically tighter bounds.

The choice between Hoeffding and Bernstein reflects a recurring theme:

More information about the distribution yields stronger guarantees.

Why Machine Learning Works

At the heart of machine learning lies a deceptively simple problem.

We train models on finite samples but expect them to perform well on an entire population.

Why should this be possible?

Formally, we compare two quantities:

Training error (empirical risk):

\(\hat{R}(f) = \frac{1}{n}\sum_{i=1}^n \ell(f(X_i),Y_i)\)

True error (population risk):

\(R(f) = E[\ell(f(X),Y)]\)

The difference between them is the generalisation gap.

Concentration inequalities guarantee that this gap is small with high probability.

They show that empirical averages of loss functions remain close to their expectations, even across entire classes of models.

This insight powers modern learning theory.

Tools like Rademacher complexity and entropy bounds extend concentration arguments from single functions to vast hypothesis spaces. They determine the sample complexity required to learn reliably.

Without concentration, the connection between training data and real-world performance would collapse.

Machine learning would be impossible.

When Data Misbehaves

Real data is not always well behaved.

Heavy-tailed distributions can sabotage the stability of averages. A few extreme observations can distort the sample mean dramatically.

To combat this, statisticians developed robust estimators.

One elegant example is the Median-of-Means method.

The idea is simple:

Split the data into several blocks.
Compute the mean of each block.
Take the median of those means.

This small modification restores exponential concentration even under heavy tails.

Instead of collapsing under rare outliers, the estimator remains stable.

Randomness can still be tamed.

The Deeper Lesson

At first glance randomness appears hostile to prediction.

Individual outcomes fluctuate unpredictably. Noise seems unavoidable.

But concentration inequalities reveal a deeper truth.

Randomness is chaotic locally but rigid collectively.

As independent observations accumulate, large deviations become astronomically unlikely. Disorder averages itself into structure.

This is why large datasets behave so reliably.

It is why scientific experiments converge to stable measurements.

And it is why machine learning models trained on noisy samples can still capture real patterns.

The mathematics of concentration shows that randomness is not merely noise.

Under the right conditions, it becomes a powerful stabilising force.

Chaos, surprisingly, coheres.

Padraig's Substack

Discussion about this post

Ready for more?