This time, using the estimators we devised, let’s achieve perfection, and let’s do that with algorithms generic enough to be applicable to most games.

Free will lets anyone do anything. But if you have an objective in life, some choices get you closer to it than others. Can we study that difference mathematically?

We can certainly try! Earth is a big anthill, and we all have agency in this world. So let’s confront an agent to the randomness of its environment, and study the decisions it makes.

One pretty darn good tool for it in our mathematical box sprouts from Markov chains. In Markov chains, an entity changes state in its environment based on transition probabilities between states, like a leaf in the wind.

The leaf decides nothing. An agent, on the other hand, makes a decision,
which will lead it into a different state.
Maybe the agent is a bit irrational or uncertain,
and they may not take the same action in the same circumstances.
To study those decisions, we define them by the agent’s **Policy**, $\pi(a|s)$:
it is the probability that the agent takes a given action in a given state.

Then, just like a Markov chain,
the environment randomly moves the agent to another state,
based on **State Transition** probabilities $\tau(s'|s,a)$
which depend on the current state and the action of the agent.

By changing state, the environment may be more desirable to the agent.
We modelize that as a **Reward**, $r(s,s')$:
a measurable value obtained by transitioning from one state to the next.
It may be meaningless, but the higher, the better.
In the case of Wordle, we want to minimize the number of guesses:
thus we give ourselves a negative, -1 reward every time we make a guess.

That is all there is to a **Markov Decision Process**^{[MDP]}!
MDPs are more generic than Markov chains,
and more applicable to real-life situations like game optimization.

But how can we study agentic decisionmaking? Surely some policies are better than others. The optimal policy, perfect play, is written as $\pi^*(a|s)$, because it is the star of the show. It is the policy that maximizes the State Value, which is…

The **State Value** under a policy, $v_{\pi}(s)$,
is the expected rewards accumulated from the current state
onward to the infinite future if the agent acts according to the policy.

The **Action Value** under a policy, $q_{\pi}(s,a)$,
is a surprisingly useful metric,
defined as the expected rewards accumulated
from the current state onward to the infinite future,
if the agent makes a given action and then acts according to the policy.

Why is it useful? Well, thanks to a formidable theorem
called the **Bellman optimality equation**,
which recursively derives the action value of the optimal policy:

You may be starting to connect the dots!

- To find the optimal policy, you simply need to define your policy
as
*picking the choice with a maximal optimal action value*. - To find the optimal action value, you simply need to improve its estimation
by
*computing the Bellman optimality equation from the subactions taken by your policy*.

Suddenly, all you need to focus on is predicting your cumulative reward better.

You start with an initial prediction for the action value of each action you may take. At worst, you can give them equal value. You pick one action randomly using your policy, and simulate what happens after. That simulation samples an action value. Keep in mind that it is not the optimal action value:

- The policy you used for the simulation is not optimal yet, and
- The reward may have variance, because both the policy and the state transition are probabilistic.

Still, you have a new approximate action value, which you can use to improve your estimation, using the Bellman optimality equation.

Pretty soon, you end up with a tree of simulated futures.
Since we randomly picked actions to search through that tree,
we call this algorithm
**Monte-Carlo Tree Search**^{[MCTS]}.
With every new simulation, our predicted action values get more accurate,
and from that, we can execute the optimal policy.

```
function simulate!(state)
action = sample_action(state)
next_state, reward = sample_next_state(state, action)
sampled_action_value = reward + next_state.best_action.value
state.visits += 1
action.visits += 1
action.value = streamed_mean(action.value, sampled_action_value, action.visits)
state.best_action = state.actions[argmax(action.value for action in state.action)]
end
function sample_next_state(state, action)
next_state, reward = sample_transition(state, action)
if next_state.visits == 0
next_action = sample_action(next_state)
next_action.value = estimate_action_value(next_state, next_action)
next_state.best_action = next_action
else
simulate!(next_state)
end
return next_state, reward
end
```

Notice we use `estimate_action_value()`

,
which is implemented by the entropic estimator
we described in the previous article.
That weaker estimator, hidden within the stronger MCTS estimator,
like a Russian doll of estimations, helps bootstrap its accuracy.
Fear not how few dolls there are, for we will cheerfully add more later.

As a quick aside, the `streamed_mean`

is a neat algorithm from
Welford^{[WEL]} for computing a mean one value at a time:

```
function streamed_mean(old_mean, new_value, new_count)
return old_mean + (new_value - old_mean) / new_count
end
```

Another aside: in Wordle and many other games, we can brute-force the state transition function, since it simply is the set of $3^5=243$ possible constraints associated with a guess (eg. ⬛🟨⬛⬛🟩 the second letter is misplaced, the fifth is right, the others are wrong):

```
function simulate!(state)
action = sample_action(state)
sampled_action_value = 0.0
for next_state, prob_trans, reward in enumerate_next_states(action)
# This now looks exactly like the Bellman equation:
sampled_action_value += (reward + next_state.best_action.value) * prob_trans
end
state.visits += 1
action.visits += 1
action.value = streamed_mean(action.value, sampled_action_value, action.visits)
state.best_action = state.actions[argmax(action.value for action in state.action)]
end
```

But we haven’t yet explained how we select an action to simulate. If our policy is to always take the action with maximal estimated value, and the one with the highest initial estimation is actually below its true action value, then it will increase every time, thus it will always be the highest, and so we will never test out any other action, even though there may well be one that is much better!

Walking the tradeoff between
picking the action that is most likely to be the best,
and trying out other actions that might turn out to be even better,
is a classic problem called the **exploration-exploitation dilemma**.
The literature often focuses on a simplified setup,
the **multi-armed bandit**. The bandit is a casino slot machine:
you insert a coin, and pull one of its arms.
Each arm has a fixed, distinct probability of yielding a payout.
As a player, you wish to press the arm with the highest expected payout,
but you can only tell which one it is
through the frequency and amount of the payouts you observe.
You want to minimize the regret of pressing the wrong arm.

There are hacks aplenty to achieve a reasonably low regret!

The oldest trick in the book is to pick an action uniformly at random,
with a small probability $\epsilon$,
and to otherwise pick the action with the highest estimated value.
It is called the **epsilon-greedy**^{[GREED]} policy for obvious reasons.
For non-obvious reasons,
as long as you decay the value of $\epsilon$ with the right schedule,
it keeps cumulative regret logarithmic^{[REG]}
as more simulations are performed!
Knowing which schedule to use is a dark art, however;
in other words: “I just tried a few values and this one looked OK I guess.”

But epsilon-greedy can have quite a finicky behaviour,
especially when there are a huge number of arms,
which is the case with Wordle:
we can submit one of 14855 words as guesses.
That is a larger action space than in the game of Go,
which is already a challenge at ~250 actions per move^{[AG]}.

How can we do better? Here’s an insight: the problem stems from not knowing the exact action value. We can use the uncertainty we have on it to calculate the action we pick.

The most popular trick simply computes a confidence interval,
and selects the action with the highest upper bound.
Just like the previous trick, its popular name is self-explanatory:
**Upper Confidence Bound (UCB)**.
However, how can you compute the confidence interval
on a random variable whose distribution you don’t even know?

You can derive a formula from **Hoeffding’s inequality**^{[HOEFF]},
a fun theorem that gives an upper bound for a probability
related to the sample mean $\hat{\mu} = \frac{\sum Q}{n}$
and the true mean $\mu = \mathbb{E}[\frac{\sum Q}{n}]$
of $n$ independent random variables $Q_i$ such that $L_i \le Q_i \le U_i$
(with $0 \le i \lt n$):

If all $Q_i$ have the same bounds and a symmetric distribution, we can derive from it this inequality:

$p \stackrel{\text{def}}{=} \Pr\left(\mu \ge \hat{\mu}+\delta\right) \le e^{\frac{-2n\delta^2}{(U-L)^2}}$From that, you can compute a distribution-agnostic upper bound based on an arbitrary confidence interval size (eg. $p = 0.999$):

$\delta = (U-L) \times \sqrt{\frac{-\log(p)}{2n}}$We can now plug this formula into an action selector:

```
function sample_action(state)
return argmax(action_value_upper_bound_hoeffding, state.actions)
end
function action_value_upper_bound_hoeffding(action)
# I just tried a few values and this one looked OK I guess.
p_value = 0.999
upper_action_value = 0
lower_action_value = -6 # We lose the game after 6 failed guesses.
factor = (upper_action_value - lower_action_value) * sqrt(-log(p_value)/2)
return action.value + factor * (choice.visits+1)^-0.5
end
```

It may look OK at a glance, but sadly, allocating all 14855 actions in each state will blow our RAM budget. For now, when allocating a new state, we will only load in the best 100 actions, based on our action value estimator. You may justly object! Some action beyond that limit might be optimal. But we are already swimming in heuristics anyway at this point. However, I promise we will fix this oversight later.

Another famous action sampler appeared in the AlphaGo paper^{[AG]}, dubbed **PUCT**
(Predictor-based Upper Confidence Bound for Trees).
It is also in the UCB family,
but it makes use of a policy estimator $\hat{\pi}(a|s)$,
and of the number of simulations in other actions $n_{s,a}$.
The upper bound it computes has the following formula:

```
function sample_action(state)
sum_exp_value = 0.0
for action in state.actions
sum_exp_value += exp(action.value_estimate)
end
return argmax(action ->
action_value_upper_bound_puct(state, action, sum_exp_value),
state.actions)
end
function action_value_upper_bound_puct(state, action, sum_exp_value)
# I just tried a few values and this one looked OK I guess.
coeff = 1.0
# We set the policy as the softmax of the initial action value estimate.
policy = exp(action.value_estimate) / sum_exp_value
return action.value + coeff * policy * sqrt(state.visits) / (1 + action.visits)
end
```

On the plus side,
it was good enough that it beat the best Go player in the world.
On the other hand, while there is some recent analysis^{[PUCT]} linking the formula
to regularized policy optimization theory,
as best as we can tell, it was hand-tweaked haphazardly,
not the result of a mathematical derivation.
That implies that there may well be superior designs!

For the sake of comparing action samplers, let’s throw in one of my own invention, which I call the Laplace sampler, inspired by the rule of succession. The idea is: how much would a new simulated action value with a fixed offset affect the estimation? We compute that crudely:

```
function sample_action(state)
return argmax(action_value_upper_bound_laplace, state.actions)
end
function action_value_upper_bound_laplace(action)
# I just tried a few values and this one looked OK I guess.
delta = 0.1
return (action.value * action.visits + (action.value + delta)) / (action.visits+1)
end
```

As it turns out, they can all can find the optimal policy,
barely, through sheer brute-force.
However, you may have noticed that, every time,
there was an arbitrary value we had to plug into the equation,
which had no justification, and which we had to guess barbarically.
Those are what we call **hyperparameters**:
values that our optimizer model needs,
so that it can find the parameters of the final model
(the optimal policy).

We have two hyperparameters right now:

- The
**action sampler parameter**(the $\epsilon$ schedule for epsilon-greedy, the p-value for Hoeffding, the coefficient for PUCT, the value delta for Laplace…); - The number of
**top actions**we preselect to limit RAM use.

The values we picked might work for Wordle, but does it truly work in the general case? Does it always find the optimal policy?

I can tell you right now that there are some values for which the algorithm never finds the optimal policy. It is pretty obvious for the “top N actions” hyperparameters: if those choices don’t include the optimal action, we will never find it. It is also true of the other hyperparameters.

The consequence is dire: we lose the guarantee that, with enough simulations,
the algorithm eventually yields the optimal strategy.
How can we tell, then, that we have likely found perfection,
apart from comparing to compute-intensive non-generic exact solvers^{[MIT]}?

Can we have a more generic solution?

By far the worst hyperparameter is the exclusion of all actions except the 100 most promising. But there really are too many actions for the RAM to hold across all simulated states.

How do we humans handle gigantic action spaces? Personally, I look at the most promising action, and when I feel like I need to study other approaches, I imagine another option and its outcome.

We can have a similar algorithm:

- When we
**select**an action to simulate, we only pick within a subset of actions $\hat{\mathscr{A}}$ kept in the state’s memory. It contains all actions that we have simulated in the past, and one action that is the most promising among the rest: the one with the highest estimated action value. - When we simulate a given state for the
**first time**, we initialize the subset of actions with a single action: the most promising one of them all. - Every time we simulate the
**least simulated action**, we also add a new action to our list: among all the actions we don’t yet have in our state, we pick the one that is most promising. That way, it can one day be explored.

That is not enough. As we ran the optimizer, a major issue reared its ugly head. You see, the entropic estimator is biased. It is too optimistic about its action value.

🔍 Let’s say it believes starting with “sater” *(that which satiates)*
will win in 2.9870 guesses on average.
That seems better than “raise”, which it believes takes 2.9925.

🤏 So the action selector picks “sater”.

📈After simulating it, though, it revises its estimation to 3.3142 guesses.

🔍 Now, “raise” looks better (still at 2.9925).

🤏 So the action selector simulates “raise”.

📈 That in turn will revise it to 3.3498…

🔍 which makes both look worse than “roate” *(to learn by repetition)* estimated at 2.9942.

You see where this is going: no matter how good the action selector is, it will always pick the action that has not been visited yet, because that one always has a better estimated value than those that have been visited and whose estimation has thus been revised. Therefore, despite our efforts to limit RAM use by only holding a subset, we will quickly end up with that subset containing all actions.

What we wish we had was an estimator that was independent of the number of simulations.

So let’s remove the bias from the one we have. All it takes after all is to keep the history of estimations, and to compute, for each possible action in our subset, the difference between an estimation after a given number of simulations, and the true action value (approximated by the latest estimation for that action).

To sum up, we are building a tower of estimators of increasing accuracy:

- The
**base predictor**, $\hat{q}_1(s,a)$, uses the entropic estimator. - The
**tree predictor**relies on the Bellman equation and the subactions explored after $v$ visits: $\hat{q}_2(s,a,v) = \begin{cases} \hat{q}_1(s,a) &\text{if } v=0 \\ \sum_{s'} \tau(s'|s,a) \times (r(s,s') + \max_{a'}(\hat{q}_3(s',a'))) &\text{otherwise} \end{cases}$ - The
**debiased predictor**can be compared between actions of varying number of simulations $v$: $\begin{array}{ll} \hat{q}_3(s,a,v) &= \hat{q}_2(s,a,v) - \mathbb{E}[\hat{q}_2(s,a,v) - q_{\pi^*}(s,a)] \\ &\approx \hat{q}_2(s,a,v) - \frac{\sum_{a'\in\hat{\mathscr{A}}_s} \hat{q}_2(s,a',v) - \hat{q}_2(s,a')}{|\hat{\mathscr{A}}_s|} \end{array}$ (We note $\hat{q}_2(s,a)$ and $\hat{q}_3(s,a)$ to be the estimation with the largest $v$.)

That does the trick! We get convergence in a reasonable timeframe.

Laplace finds optimal play first, after 17735 iterations. PUCT is next, with 20539 iterations. Hoeffding comes in last, with 22954 iterations.

(Keep in mind, though, that this is very sensitive to their hyperparameters; we spent as much time searching for a reasonable one on each, so luck played a bit of a role.)

Now we point our sight to the other hyperparameter: the one used in UCB and similar simulation samplers.

What would the ideal action picker look like? There are a few properties we can determine:

*If we know that one action is optimal for sure*, we must pick it.*If we know that one action is suboptimal for sure*, we must never pick it.

We must not waste simulation time on worse actions.*If we know that two actions are equally optimal*, we must pick them with equal probability.

(This one may be less intuitive, but picture a game of Rochambeau: if picking either rock, paper, or scissors, is not done with equal probability, then that is a vulnerability that the adversary can exploit.)

Inferring from those properties, a straightforward principle shapes up:
*If an action has X% chance of being optimal,
it ought to be selected with X% probability*.
That is the spirit behind **Thompson sampling**
^{[THOM]}.

It would effectively be like sampling from our policy. But how can we obtain our policy from the data we have so far?

Ultimately, the reason that we don’t simply select the action whose estimated action value is highest, is because we are uncertain on its exact value. If we know exactly the probability distribution of all action values, we can derive the probability that each action is optimal: either through mathematical derivation (if the distribution allows it), or through stochastic averaging: by sampling each action value from its known distribution many times, and counting how many times each action comes out on top.

Stochastic averaging is particularly interesting for us: not only does it forgo complex formulae, it also works all the way to a single sample! Sure, with a single sample, we won’t have a good estimation of the policy, since all actions will have a 0% probabilty except one at 100%. But the one that comes out on top can be explored right away: it already follows the principle of Thompson sampling! You can even parallelize the search, by picking the top K actions, and exploring each of them in their own thread.

Still, we need to have a model for the probability distribution
of each action value. We can stay simple: a **Gaussian**,
with the estimated action value as its mean.

We now need its **variance**.
What does it represent? We are uncertain about the right value,
and the more simulations we have on an action, the more precise it becomes.
The earlier predictions that we kept in the action’s history
show just how wrong we were about the real value.
The error we made can be used to compute **the mean squared error**,
which is the variance we seek.

Since we get more precise with more simulations, it makes sense to make the variance depend on the number of simulations. For each number, we can compute the mean squared error across actions:

$\begin{array}{ll} \sigma^2(s,v) &= \mathbb{E}\left[(\hat{q}_3(s,\sdot,v) - q_{\pi^*}(s,\sdot))^2\right] \\ &\approx \frac{\sum_{a \in \mathscr{A}} (\hat{q}_3(s,a,v) - \hat{q}_3(s,a))^2}{|\hat{\mathscr{A}}_s|-1} \end{array}$Is this system good though?

As it turns out, it blows other approaches out of the water, quickly finding optimal play in a couple hours:

Thompson sampling finds optimal play after 870 iterations, less than a tenth of the time!

Storing all of the history of our predictions in each action gobbles a gargantuan amount of memory. We couldn’t actually run the algorithm that produced the chart above, until we implemented a few more tricks.

Mathematical wizardry saved our beef in the last article, so perhaps it can help here too!

We can aggregate the value bias information across actions within the state from which those actions are taken. There is a specific bias between the point where we have no simulation, to the estimate after the first simulation; there is a bias from the first simulation, to the estimate after two simulations, and so forth. Let’s average each of those biases across actions: then the bias we care about is the sum of per-simulation biases across all remaining simulations!

$\begin{array}{ll} \hat{q}_3(s,a,v) &= \hat{q}_2(s,a,v) - \mathbb{E}\left[\hat{q}_2(s,\sdot,v) - q_{\pi^*}(s,\sdot)\right] \\ &= \hat{q}_2(s,a,v) - \mathbb{E}\left[\sum_{i=v}^{\infty} \hat{q}_2(s,\sdot,i) - \hat{q}_2(s,\sdot,i+1)\right] \\ &= \hat{q}_2(s,a,v) - \sum_{i=v}^{\infty} \mathbb{E}\left[\hat{q}_2(s,\sdot,i) - \hat{q}_2(s,\sdot,i+1)\right] \\ &\approx \hat{q}_2(s,a,v) - \sum_{i=v}^{v_{\text{max}}-1} \frac{\sum_{a'\in\hat{\mathscr{A}}_s} \hat{q}_2(s,a',i) - \hat{q}_2(s,a',i+1)}{|\hat{\mathscr{A}}_s|} \end{array}$That way, all we need to store are the per-simulation biases,
updated after each new simulation
using the streamed mean we saw before^{[WEL]}:

```
function update_per_visit_bias!(state, n_simulations, old_action_value, new_action_value)
action_count = number_of_actions_with_at_least(state, n_simulations)
state.per_simulation_bias[n_simulations] = streamed_mean(state.per_simulation_bias[n_simulations], new_action_value - old_action_value, action_count)
end
```

The mathematical derivation is a bit subtler here.

For the sake of succinctness, let’s define a function for the difference between two debiased estimations:

$\Delta_{\hat{q}_3}(s,a,v) \stackrel{\text{def}}{=} \hat{q}_3(s,a,v) - \hat{q}_3(s,a,v+1)$Since $\hat{q}_3(s,a,v) \xrightarrow{v \rarr \infty} q_{\pi^*}(s,a)$, we have:

$\hat{q}_3(s,a,v) - \sum_{i=v}^{\infty} \Delta_{\hat{q}_3}(s,a,i) = q_{\pi^*}(s,a)$ $\hat{q}_3(s,a,v) - q_{\pi^*}(s,a) = \sum_{i=v}^{\infty} \Delta_{\hat{q}_3}(s,a,i)$We can now derive:

$\begin{array}{ll} \sigma^2(s,v) &= \mathbb{E}\left[(\hat{q}_3(s,\sdot,v) - q_{\pi^*}(s,\sdot))^2\right] \\ &= \mathbb{E}\left[ (\sum_{i=v}^{\infty} \Delta_{\hat{q}_3}(s,\sdot,i))^2 \right] \\ \end{array}$Since $\hat{q}_3$ is debiased, $\mathbb{E}\left[\Delta_{\hat{q}_3}(s,\sdot,v)\right] = 0$. Let’s squint on the previous formula a bit and unearth a variance expression:

$\sigma^2(s,v) = \text{var}\left( \sum_{i=v}^{\infty} \Delta_{\hat{q}_3}(s,\sdot,i) \right)$The changes in action value estimates are mostly uncorrelated to the number of simulations, so we can proudly say:

$\begin{array}{ll} \sigma^2(s,v) &= \sum_{i=v}^{\infty} \text{var}\left(\Delta_{\hat{q}_3}(s,\sdot,i)\right) \\ &\approx \sum_{i=v}^{v_{\text{max}}-1} \frac{\sum_{a \in \mathscr{A}} \Delta_{\hat{q}_3}(s,a,i)^2}{|\hat{\mathscr{A}}|-1} \\ \end{array}$We are now free to only keep the variance of action value changes after a simulation, in the state itself, without polluting each action. It is easy to update the variance, by keeping the sum of squares in the state: the update is then just adding in the square of the delta. Whenever we want the variance, we divide it by the number of actions that reached this number of simulations.

That in turn fully allows us to remove action value histories, and the RAM thanks us for it!

The algorithm’s code is open-source.

A major aspect of this optimizer is how generic it is. All it needs is an action value predictor, and it will converge to optimal play on its own. Overall, the structure of this optimizer is like this:

I call it the Action Value Tree Search (AVTS). It is far from the first generic MDP optimizer. AlphaGo is a famous example, although I’d argue it is a bit more complex, as it hinges on plugging in both a policy estimator and a value estimator (whereas for us, we only plug in an action value estimator):

It uses PUCT as its simulation action picker,
which as we mentioned seems a bit flawed,
which other groups within Deepmind^{[PUCT]}, and even the original authors^{[GMZ]}, investigated in other papers.

The main issue with relying on a policy estimator is that improving it from a single simulation of the future has no obvious formula.

Indeed, the policy learning algorithm used by AlphaGo
instead makes hundreds of simulations,
each of which updates an action value
which changes how soon the action will be simulated again.
That in turn changes the number of times the action has been visited
among all simulations, to which the refined policy value is proportional.
That is how AlphaGo Zero’s^{[AGZ]}
(and AlphaZero’s^{[AZ]},
and MuZero’s^{[MZ]})
tree policy estimator (in the diagram above) works:

If instead of a policy estimator, the building block we use is an action value estimator, then the formula for refining its value is straightforward and works even with a single simulation. Additionally, it allows the type of elegant Thompson sampling we described, which directly produces a more optimal policy estimation.

Using a policy network stems from a belief that learning the comparative value of actions is easier than learning their impact. I wonder, though, given how good our neural-network-based predictors get, if we can accomplish better results with direct action value estimation.

Moreover, policy estimators are built to compute probabilities over the entire set of a fixed number of possible actions. Soon enough, machine learning systems will operate on virtually unbounded sets of choices. Isn’t your world fairly open-ended, after all? When you go fetch groceries, do you ponder the policy value of driving to each building in the world, no matter how remote?

In order to ground them to true experiences and break the hallucination curse, I believe the next breed of fine-tuning will need to simulate random futures within their latent world model, and optimize for the action values they predict, without exhaustively estimating policies.

- [MDP]: Bellman, R. E. (1957b). A Markov decision process. Journal of Mathematics and Mechanics, 6(5):679–684
- [MCTS]: Rémi Coulom. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. 5th International Conference on Computer and Games, May 2006, Turin, Italy. inria-00116992
- [WEL]: Welford, B. P. (1962). “Note on a method for calculating corrected sums of squares and products”. Technometrics. 4 (3): 419–420. doi:10.2307/1266577
- [GREED]: Watkins, C.J.C.H., 1989. “Learning from Delayed Rewards”; cf. chapter 7, page 94.
- [REG]: Auer, P., Cesa-Bianchi, N. and Fischer, P., 2002. “Finite-time analysis of the multiarmed bandit problem”. Machine learning, 47, pp.235-256; cf. Theorem 3.
- [AG]: Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. “Mastering the game of Go with deep neural networks and tree search”. Nature, 529(7587), pp.484-489.
- [AGZ]: Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017. Mastering the game of go without human knowledge. nature, 550(7676), pp.354-359.
- [AZ]: Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T. and Lillicrap, T., 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
- [MZ]: Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T. and Lillicrap, T., 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839), pp.604-609.
- [HOEFF]: Hoeffding, Wassily (1963). “Probability inequalities for sums of bounded random variables”. Journal of the American Statistical Association. 58 (301): 13–30; cf. Theorem 2. doi:10.1080/01621459.1963.10500830
- [PUCT]: Grill, J.B., Altché, F., Tang, Y., Hubert, T., Valko, M., Antonoglou, I. and Munos, R., 2020, November. “Monte-Carlo tree search as regularized policy optimization”. In International Conference on Machine Learning (pp. 3769-3778). PMLR.
- [MIT]: Bertsimas, D. and Paskov, A., 2022. An exact and interpretable solution to wordle. Preprint, submitted September, 20.
- [THOM]: Thompson, W.R., 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4), pp.285-294.
- [GMZ]: Danihelka, I., Guez, A., Schrittwieser, J. and Silver, D., 2021, October. Policy improvement by planning with Gumbel. In International Conference on Learning Representations.

It would be nice if, instead of language-specific tweaks, we found an algorithm that made the runtime plummet from half an hour to a second.

It would be nicer if we created an algorithm that achieved perfect play.

It would be even neater if that algorithm was so generic, that it could solve a whole class of similar games.

The strategy we left off in the previous article had merit: it computed the average number of possible solutions left over after making a particular guess. Given the set of solutions $\mathscr{S}$, we compute for each guess $g$:

$\mathscr{R}(g) \stackrel{\text{def}}{=} \frac{ \sum_{s \in \mathscr{S}} |\{s'|s' \in \mathscr{S}, \mathscr{C}(s'|g)=\mathscr{C}(s|g)\}| }{|\mathscr{S}|}$where $\mathscr{C}(s|g)$ is the constraint that Wordle unveils if the secret solution is $s$ and the guessed word is $g$. For instance, if the secret word is “craft”, and we input the guess “salet”, Wordle will yield the constraint ⬛🟨⬛⬛🟩 (second letter misplaced, last letter correct, other letters absent from the secret word).

We can rephrase the formula using the Iverson bracket (which is 1 if the condition is true, 0 otherwise):

$\mathscr{R}(g) = \frac{ \sum_{s \in \mathscr{S}} \sum_{s' \in \mathscr{S}} ⟦\mathscr{C}(s'|g)=\mathscr{C}(s|g)⟧ }{|\mathscr{S}|}$Let’s now do a magic trick. We can insert a sum over constraints (defining $\mathscr{C}$ as the set of constraints) if that sum only contains a single term. It is the case below, since a guess always produces the same constraint for a given solution:

$\mathscr{R}(g) = \frac{ \sum_{s \in \mathscr{S}} \sum_{c \in \mathscr{C}} ⟦\mathscr{C}(s|g)=c⟧ \sum_{s' \in \mathscr{S}} ⟦\mathscr{C}(s'|g)=\mathscr{C}(s|g)⟧ }{|\mathscr{S}|}$We can now swap the summations:

$\mathscr{R}(g) = \frac{ \sum_{c \in \mathscr{C}} \sum_{s \in \mathscr{S}} ⟦\mathscr{C}(s|g)=c⟧ \sum_{s' \in \mathscr{S}} ⟦\mathscr{C}(s'|g)=c⟧ }{|\mathscr{S}|}$Finally, we notice that the innermost sum is independent of $s$, so we can factorize it:

$\mathscr{R}(g) = \frac{ \sum_{c \in \mathscr{C}} \left(\sum_{s \in \mathscr{S}} ⟦\mathscr{C}(s|g)=c⟧\right)^2 }{|\mathscr{S}|}$That is much faster to compute: there are only $|\mathscr{C}| = 3^5=243$ possible constraints, but there are $|\mathscr{S}| = 3158$ word solutions. What used to take $|\mathscr{S}|^2$ operations now takes $|\mathscr{S}| + |\mathscr{C}|$. The computation that used to take hours in the previous article, now takes a mere second:

```
function average_remaining_solutions_after_guess(guess::Vector{UInt8}, solutions::Vector{Vector{UInt8}})::Float64
counts = zeros(Int, 243)
for solution in solutions
@inbounds counts[constraints(guess, solution) + 1] += 1
end
return sum(abs2, counts) / length(solutions)
end
```

Hat tip to Douglas Bates from the Julia community for figuring that out!

From there, building an estimation of the number of remaining guesses $n$ is smooth sailing. We can assume that we maintain a constant ratio $\rho$ of removed solutions after each guess. We have $|\mathscr{S}|$ solutions currently, and at the end we have only 1. So before the last guess, we had $\rho$ solutions, and before the guess before that, we had $\rho^2$… All the way back to now, where we have $\rho^{n-1} = |\mathscr{S}|$. Hence $n = 1 + \frac{\log|\mathscr{S}|}{\log(\rho)}$.

```
function estimate_guesses_remaining_statistically(guess::Vector{UInt8}, solutions::Vector{Vector{UInt8}})::Float64
avg_remaining = average_remaining_solutions_after_guess(guess, solutions)
nsols = length(solutions)
# Probability that the guess wins directly,
# avoiding having to do another guess.
prob_sol = if guess in solutions
1 / nsols # If this pick is a winner, there are no more guesses to make.
else
0
end
expected_guesses = 1 + log(nsols) / log(nsols / avg_remaining)
return prob_sol * 1 + (1-prob_sol) * expected_guesses
end
```

Of course, there are multiple ways to estimate the number of remaining guesses. Information theory provides a framework for analysing just how much information we gain from the constraint that Wordle reveals. In information theory, we deal with symbols and study the probability that they appear.

In our case, we have one symbol for each constraint. The probability that a constraint appears is $p(c) = \frac{|\{s|s \in \mathscr{S}, \mathscr{C}(s|g)=c\}|}{|\mathscr{S}|}$. The amount of information gained, in bits, by seeing a constraint is the entropy of Wordle:

$\mathscr{H} \stackrel{\text{def}}{=} -\sum_{c \in \mathscr{C}} p(c) \log_2 p(c)$At the end, once there are zero bits of information, we know the solution for sure, but we still have to submit it as a guess in order to win. Before that, we were 2 guesses away from a win, and assuming we gain as many bits of information on each guess, we had $\mathscr{H}$ bits of information. Before that, we were 3 guesses away from the end, and had $2 \mathscr{H}$ bits of information, and so forth. At the start, the amount of information we had was $\mathscr{I} \stackrel{\text{def}}{=} \log_2|\mathscr{S}| = (n-1) \times \mathscr{H}$. Hence we obtain:

$n = \frac{-\log_2|\mathscr{S}|}{\sum_{c \in \mathscr{C}} p(c) \log_2 p(c)} + 1$This approach is inspired by Grant Sanderson, and has the redeeming quality of being efficient to compute even without casting mathematical spells on the formula:

```
function estimate_guesses_remaining_entropically(guess::Vector{UInt8}, solutions::Vector{Vector{UInt8}})::Float64
counts = zeros(Int, 243) # Number of times the constraint appears across solutions.
for solution in solutions
@inbounds counts[constraints(guess, solution) + 1] += 1
end
nsols = length(solutions)
entropy = 0.0
for count in counts
if count == 0
continue
end
# Probability of a constraint appearing after we make this guess.
prob = count / nsols
entropy -= prob * log2(prob)
end
expected_guesses = log2(nsols) / entropy + 1
# Probability that the guess wins directly,
# avoiding having to do another guess.
prob_sol = if guess in solutions
1 / nsols
else
0
end
return prob_sol * 1 + (1-prob_sol) * expected_guesses
end
```

Are those estimations good, though?

Estimators always offer a tradeoff. What matters at the end is not how good it is on its own, but how well it melds within the overall struture.

The first thought you might have on how to use the estimator, is to pick its best suggested guess at each step of the game. Here is what we would get:

Estimator | Expected number of guesses to win |
---|---|

Statistical | 3.6450 |

Entropic | 3.5687 |

That is better than the human average of about 4!

It also seems to prefer entropic calculations. That could be due to chance, however.

(By the way, beware of comparing any of the figures I give to other articles.
Bot performances depend on the word lists,
which the *New York Times* has changed multiple times,
making the game slightly harder.
Meanwhile, human statistics often have selection bias,
considering that people that fail to solve in 6 guesses
cannot give the true number of guesses they would require,
on top of the obvious bias that people only publish good results.)

Heuristics give a good-enough strategy. We want to go further than “good enough” though. We target optimal play.

With that in mind, the true intent of the estimator is to predict the performance of perfect play. Let’s compare the estimation to the true optimal figure.

Estimator | Mean squared error |
---|---|

Statistical | 0.9932 |

Entropic | 0.1044 |

The entropic estimation wins again; that said, surprisingly, if we restrict the analysis to the best 100 guesses, which are most likely to be useful, we get a reversed picture, which may matter later:

Estimator | Mean squared error |
---|---|

Statistical | 0.0266 |

Entropic | 0.2533 |

Here’s a sneak peek at how each estimator fares within a self-optimizing algorithm.

That algorithm iterates until it finds the optimal choices throughout the whole game, for all possible solutions. What we plot is the optimizer’s best-known playing strategy. It is conservative, not an estimation: it starts out on iteration zero (not depicted) at 3158 guesses (brute-forcing through all possible words). Whenever it has yet to explore a guess, it assumes brute-force, thus offering an upper bound on the optimal strategy which eventually converges to it.

We want that algorithm to converge to the optimal strategy fast, and the number of iterations is a good proxy for how long it takes to achieve a certain level.

The statistical estimator starts out with an advantage, likely because it is more accurate for better guesses. But that makes it overconfident on its accuracy; meanwhile the entropic estimator’s evaluated uncertainty encourages it to search through a wider array of options, and it eventually overtakes the statistical one permanently, finding the optimal strategy first.

Was it obvious from the get-go that the entropic estimator would end up on top? It feels to me like there is a bit of luck involved:

- Its formula gives way to a fast algorithm without mathematical tour-de-force;
- Its higher accuracy overall ends up being beneficial, despite being less accurate on the guesses that matter, likely because it predicts pairwise rankings better;
- By sheer luck, on the current wordlist, its top guess is the optimal guess.

In the next article, let’s dig further into the weeds of building a fully game-agnostic self-optimizing algorithm that finds the best strategy fast, using the entropic estimator as a foundation.

]]>“With more careful calculations, one can win; with less, one cannot”
— Sun Tzu, *The Art of War*.

Making extrapolations is crucial to avoid wasting our computing power on slow convergence. After all, if you had to walk to the Everest, you wouldn’t eyeball it: you would use a GPS.

Sometimes you have to look away from the GPS and onto the road, though. Sometimes things don’t extrapolate through simple formulae. It was true for XIXth-century physicists with the ultraviolet catastrophe; it is true for LLMs too. What we estimate to be true near the center can deviate widely in the far lands…

Smaller models have fewer multiplications. Thus they run faster. Thus they train faster. However, the theory goes, they eventually reach the limit of their capacity for knowledge, and their learning slows, while that of a larger model, with a larger capacity, will overtake them and reach better performance past a given amount of training time.

While estimating how to get the best bang for the buck during training, both OpenAI and DeepMind attempted to draw the Pareto frontier. They don’t state explicitly that they use that theory to draw it; the closest quote that hints at this hidden assumption is from OpenAI:

We expect that larger models should always perform better than smaller models. […] A model with fixed size will be capacity-limited.

This presumption is the bedrock of how they compute the Pareto frontier. In the Chinchilla work, figure 2 shows the training loss of a large number of training runs for models with varying size. At a first glance, those curves follow the theory: the smaller models initially have a lower loss (good), but eventually it slows down, and gets overtaken by the curve from a larger model (bad).

In that chart, they drew grey dots every time they pinpointed the smaller model starting to lose out to a larger model. The grey line, the Pareto frontier, is how they computed their scaling laws.

The problem with this assumption is that we have no idea what would happen if we let the smaller model train for longer, since they stopped its training as soon as it was overtaken.

Enter the LLaMA paper.

Earlier this year, Meta trained four models with varying sizes. Unlike other works, they trained each of them for a very large amount of time; even the smaller ones.

They published the training run curves:

- Each curve first plummets in a
**power law**, - and then seemingly enters a
**nearly-linear**decrease in loss (corresponding to a fairly constant rate of knowledge acquisition). - At the very tip of the curve, they all break this line by
**flattening**slightly.

Right off the bat, I want to tackle a subtle misconception that people can have related to the end-of-curve flattening. They are all trained with gradient descent using a variable learning rate (which is, roughly, a hyperparameter for how much to go in the direction of the gradient). To get a good training, they had to constantly decrease the learning rate, so that it can detect ever-subtler patterns in the source material. The formula they use for that decrease is the most widely used: the cosine schedule.

As you can see from the graph, towards the end of the training run, the cosine schedule stops decreasing the learning rate at the speed which yielded such a good, near-linear training loss curve. The slowdown in learning is an artefact of that. The model does not necessarily cease to have the capacity to learn at the same near-linear rate! In fact, if we had more text to give it, we would have stretched the cosine schedule, so its learning rate would have continued to go down at the same rate.

The model’s fitness landscape does not depend on the amount of data we can feed its training; so the change in learning rate decrease is not well-justified.

That is not the main point of this article, though.

The training loss curve can be misleading in another way.
Sure, they are all trained on the same data;
but they don’t go through that data at the same speed.
What we want to know is **not** how sample-efficient the model is
(on this front, the larger model clearly learns more from what it saw).
Let’s picture instead a race:
all those models start at the same time,
and we want to know which one crosses the finish line first.
In other words, when throwing a fixed amount of compute at the training,
who learns the most in that time?

Thankfully, we can combine the loss curves with another piece of data that Meta provided: the amount of time that each model took to train.

Model | GPU-hours | Tokens/second |
---|---|---|

LLaMA1-7B | 82432 | 3384.3 |

LLaMA1-13B | 135168 | 2063.9 |

LLaMA1-33B | 530432 | 730.5 |

LLaMA1-65B | 1022362 | 379.0 |

*(Code for generating the graph here.)*

Let’s first mention that the whole Chinchilla graph that we saw, covers only a small sliver on the left of this graph. In that sliver, we see the same behaviour that Chinchilla documents. Look at the 7B, for instance (which in the Chinchilla graph would actually be among the top two curves in terms of size): it initially drops its loss much faster than the bigger models, then slows down, and the 13B model overtakes it and reaches 1.9 first.

But then, comes a far-lands, unexpected twist: the 7B enters a near-linear regime, with a steep downward trend, and seems on its way to maybe overpass the 13B again? It is hard to tell on that graph what would happen if the 7B was trained for longer.

However, the same behaviour seemed to be true between the 13B and the 33B, where the initial Chinchilla slowdown also gives way to a near-linear regime, at which point the 13B goes down fast! It is only surpassed by the 33B unfairly, by granting the latter more than double the compute time.

And the same slowdown-then-speedup occurs between the 33B and the 65B,
to such an extent that the 33B never actually gets overtaken by the 65B.
What the graph shows breaks OpenAI’s and Chinchilla’s assumption:
**the bigger model hasn’t won** (yet).
The slowdown they detected is not actually caused by reaching some capacity limit!

Still, that 7B line is a bit unsatisfactory. If only Meta had trained it for longer…

Suspense over: they did! They released LLaMA 2 this week!

We also, again, got the training times:

Model | GPU-hours | Tokens/second |
---|---|---|

LLaMA2-7B | 184320 | 3031.9 |

LLaMA2-13B | 368640 | 1515.9 |

LLaMA2-34B | 1038336 | 533.7 |

LLaMA2-70B | 1720320 | 322.1 |

Immediately, at a glance, we notice that the training curves don’t match those of LLaMA 1, even when the models are identical. As it turns out, LLaMA 2 was trained on double the context size, and a longer cosine schedule, which unfortunately has negatively impacted all model sizes. However, smaller models have been impacted worse than larger ones. As a result, the 34B model, which in LLaMA 1 remained always better than the 65B model at any training time spent, now dips slightly above the 70B model, before overtaking it:

More importantly, comparing the training speeds strongly confirms our suspicions from LLaMA 1:

- First, they are faster than bigger models,
- Then, they slow down, and are overtaken by larger models (as per Chinchilla),
- BUT THEN, they enter the near-linear regime, in which smaller models have a steeper descent into superior knowledge, and they overtake larger models yet again!

A fascinating consequence ties into making the right choices
when starting a training run:
contrary to popular belief, **larger models yield worse results**.
If you had to pick a parameter size and dataset, you might be better off opting
for a 7B model and training for 7 epochs on trillions of tokens.

Look at the near-linear regime of the 7B model, and extrapolate its line to when the 70B model stopped: had the 70B computation been spent on the 7B instead, it would potentially have reached a lower perplexity!

Another thing we notice from LLaMA 2 is that the learning slowdown at the end of the LLaMA 1 curves was indeed an artefact of the cosine schedule. That slowdown is completely absent from the LLaMA 2 training run at the corresponding mark of 1 trillion tokens read.

In fact, maybe the reason that, at that same mark, the LLaMA 2 7B model has a
worse quality than the LLaMA 1 7B model had,
may be because *its cosine schedule is stretched*!

Let’s go back to the Chinchilla paper to argue that point. In appendix A, figure A1, they show an ablation study for various cosine schedule parameters (phrased another way: various ways to stretch the learning rate curve).

They make the point that the lowest loss is achieved when the curve is not stretched. That is supported by the graphs, but we notice something off. After reading 6 million tokens, the training loss at the top is below 2.8; meanwhile, at the same mark, the training loss of the bottom model is above. Yet the only difference between the models is the cosine schedule! Because the bottom model was slated to go through more training data, the “unstretched” cosine schedule was computed for a bigger number of steps, which effectively stretches it. If the learning rate had instead followed the schedule assigned to fewer training steps, it would have had a better loss for the same amount of training time.

More broadly, that raises a question that I leave open: if the cosine schedule is not optimal, how should the shape of its tail be instead?

]]>(For context, Hotz raised $5M to improve RX 7900 XTX support and sell a $15K prebuilt consumer computer that runs 65B-parameter LLMs. A plethora of driver crashes later, he almost gave up on AMD.)

There’s quite a few issues to overcome, though. While that GPU is great (Stable Diffusion iteration speed per GPU cost is top-tier), a cursory study would be flawed: public GPU benchmarks like TechPowerUp, TomsHardware, etc. give:

**RX 7900 XTX:**123 TFLOPS**RTX 4090:**82.58 TFLOPS

Where do the figures come from?

While there is no official breakdown, only official figures, people widely compute it this way:

- For
**NVIDIA**: Boost Clock (THz) × CUDA Cores × 2 (since the FMA instruction does two floating-point operations (a multiplication and an addition) in 1 CUDA core cycle). - For
**AMD**on RDNA3: Boost Frequency (THz) × Stream processors × 2 (dual issue) × 4 (dot product), as RDNA3 has`V_DUAL_DOT2ACC_F32_F16`

, which does two dot products (a×b+c×d+e, 4 operations), in 1 processor cycle.

Name | Price | Processors | Frequency | TFLOPS (FP16) | Perf/€ |
---|---|---|---|---|---|

RX 7900 XTX | €1110 | 6144 | 2.5 GHz | 122.88 | .1107 |

RX 7900 XT | €942 | 5376 | 2.4 GHz | 103.22 | .1096 |

RTX 4090 | €1770 | 16384 | 2.52 GHz | 82.58 | .0467 |

RTX 3060 | €314 | 3584 | 1.78 GHz | 12.76 | .0405 |

RTX 3080 | €905 | 8704 | 1.71 GHz | 29.76 | .0329 |

RTX 3090 | €1500 | 10496 | 1.70 GHz | 35.68 | .0238 |

That is an unjust comparison, though, because AMD’s instruction is more niche than FMA (hitting this performance sweet spot is thus uncommon), and because both of those GPUs have other tricks up their sleeves, yielding superior FLOPS.

The big one on NVIDIA are Tensor cores. With them, you can run an instruction that does a 4×4 to 4×8 matrix multiplication (page 25) in 1 cycle within a single Tensor Core (32 CUDA cores).

2×4^2×8 (matmul ops) ÷ 1 (cycles) = 256 ops/TC/cycle.

(There is some variation between NVIDIA GPUs on which matrix sizes are supported and on how many cycles the instruction takes, and NVIDIA keeps major aspects of their instruction set secret, but on recent 30- and 40-series, this 256 number seems fairly constant.)

That actually puts the RTX 4090 at 256 × 512 (Tensor Cores) × 2.52 (GHz) ÷ 1K (GHz per teracycle/s) = 330 TFLOPS in FP16… Much higher than the 123 TFLOPS that impressed Hotz on the RX 7900 XTX!

But AMD now has the same trick.
In RDNA3, with WMMA, the RX 7900 XTX has an instruction,
`V_WMMA_F16_16X16X16_F16`

that do two 16×16 matrix multiplications in 32 cycles,
in a single Compute Unit (two sets of 32 threads).

2×16^3 (matmul ops) × 2 ÷ 32 (cycles) = 512 ops/CU/cycle.

This uses the same underlying silicon circuits as `V_DUAL_DOT2ACC_F32_F16`

:
the architecture lays out the matrices in Vector General-Purpose Registers.
Each cell of the output matrix is computed by multiplying
one row from input matrix A with one column from input matrix B,
two input cells at a time
(two adjacent input A row cells packed inside the same VGPR,
and two adjacent input B column cells packed together inside another VGPR),
so they can be used by the packed dot product single-cycle instruction.
Within that same instruction, encoded in VOPQ
(a SIMD-like system to execute one operation
on an even register while it executes on an odd one at the same time),
an adjacent output cell also multiplies through its first two input cells
at the same time using dual issue.

The input row has size 16, so those two output cells are completed in 8 cycles. Each two adjacent output cells in their diagonal are computed with 16 parallel threads (on separate stream processors) within the same 8 cycles. We have done two diagonals (32 output cells); there are 14 diagonals left. Inside that Compute Unit, we still have 16 stream processors that we can use; they can handle two more output diagonals within the same 8 cycles.

Once our first four diagonals are computed, we sequentially compute the next 4 diagonals in the next 8 cycles. So forth for the next 4, and the last 4 after that. In total, we have computed the matrix multiplication in 32 cycles, which checks out.

Why can’t we do the matrix multiplication in 16 cycles by using all 64 threads inside of the Compute Unit? Section 7.6 of the instruction set manual indicates:

[Dual issue] is legal only for wave32.

WMMA supports both wave32 and wave64, but it sounds like dual issue is deactivated in wave64, and thus it would still take 32 cycles, making it an ill-documentedly unfavorable proposition, I believe.

All in all, using WMMA, the RX 7900 XTX can crank through 512 × 96 (Compute Units) × 2.5 (GHz) ÷ 1K (GHz per teracycle/s) = 123 TFLOPS in FP16…

That ends up being less than half the performance of the RTX 4090. The superior number of operations per Compute Unit is offset by the crushingly lower number of cores. Perhaps the AMD strategy is to have the better circuit ready before migrating to the TSMC N5 (“5 nm”) process at a less affordable price.

In practice, the lower performance is less of an issue for AI training, because they are famously limited in the amount of parallelization opportunities (even the best training runs typically incur only 50% GPU use at a given time). The VRAM bandwidth then matters a lot for large models, and the RX 7900 XTX, despite using GDDR6 instead of GDDR6X, has a higher bandwidth than the RTX 3090, thanks to its faster memory clock. Still, it also is lower than the RTX 4090 on that front (but at a lower price point).

Name | Price | TFLOPS (FP16) | Memory bandwidth (GB/s) | RAM (GB) | TFLOPS/€ | Value (TFLOPS·GB·MB/s/€³) |
---|---|---|---|---|---|---|

RTX 4090 | €1770 | 330 | 1008 | 24 | .186 | 1.4 |

RTX 3060 | €314 | 51 | 360 | 12 | .162 | 7.1 |

RTX 3080 | €905 | 119 | 760 | 10 | .131 | 1.2 |

RX 7900 XTX | €1110 | 123 | 960 | 24 | .111 | 2.1 |

RX 7900 XT | €942 | 103 | 800 | 20 | .109 | 2.0 |

RTX 3090 | €1500 | 143 | 936 | 24 | .095 | 1.0 |

*(The value unit was edited per Scheurneus’ suggestion.)*

Thus the RX 7900 XTX is not technically the best TFLOPS per price, as was presumed in Hotz’s raise announcement. But that metric is not crucial for the purpose of making LLM machines, and purely looking at hardware, that GPU is a fine choice for that, in part because it has a fairer RAM per dollar offer, so that it can hold a large model without needing pricier GPUS, yet likely reaching reasonable inference speeds.

The other thorns on the side of AMD in AI, though, rear their ugly heads:

- The compilers don’t produce great instructions;
- The drivers crash frequently: ML workloads feel experimental;
- Software adoption is getting there, but kernels are less optimized within frameworks, in particular because of the fracture between ROCm and CUDA. When you are a developer and you need to write code twice, one version won’t be as good, and it is the one with less adoption;
- StackOverflow mindshare is lesser. Debugging problems is thus harder, as fewer people have encountered them.

(I will note, however, that the wealth of information provided by AMD outshines that from NVIDIA tremendously, even though they could better vulgarize those subtleties and explain how to perform specific workloads like BERT training, into which NVIDIA puts welcome care. Just contrast NVIDIA’s matmul page to AMD’s. AMD doesn’t even recognize its own flagship GPUs as supported for ROCm, which is mindboggling coming from NVIDIA’s superior CUDA support.)

]]>