Counting N-Queens Solutions
Based on a report for Markov Chains and Applications at EPFL in 2022 with Jean-Baptiste Michel and Gaiëtan Renault. Original research from Cheng Zhang & Jianpeng Ma (2009). [1]
Got comments? Send them to oiliossatos@gmail.com
!
The 8-queens puzzle
Can you place eight queens on a chessboard such that no two queens attack each other? Click on the chessboard below to try the puzzle!
Click to add / remove a queen
Now that you've tried the puzzle, how many different solutions are possible? And more generally, how can we count \(Q(n)\), the number of solutions for \(n\) queens on an \(n\)-by-\(n\) board? That's the problem we'll explore in this post.
Contents
We'll look at three different approaches to counting solutions, each addressing the problems of the last:
This post contains many interactive visualisations to aid in understanding, accompanied by an implementation in Zig. Please let me know if something breaks on your end.
1. Depth-first backtracking
The following board shows the progression of a depth-first backtracking solution counting algorithm for the 4-queens problem. Use the ➡️ buttons below to step 🔁 through (left) or reset (right) the algorithm.
Algorithm
The backtracking algorithm works by building the solution one column at a time. We place exactly one queen in each column so that we never have to worry about vertical attacks; only rows and diagonals need checking. When no spaces in a column are available, we backtrack to the previous column.
Here's what that would look like as a recursive function in Zig:
Click to reveal / hide.
fn countSolutions(board: []u8, c: usize, n: usize) usize {
if (c == n) return 1;
var count: usize = 0;
for (0..n) |r| {
if (isSafe(board, r, c)) {
board[c] = @intCast(r);
count += countSolutions(board, c+1, n);
}
}
return count;
}
Now, to determine whether a square is safe, recall that we don't need to worry about vertical attacks. Also, because we progress one column at a time, we only need to worry about attacks from previous columns. When board
stores the row position \(r_i\) of each queen \(i\), checking for horizontal attacks is easy, while a diagonal attack between queens in columns \(c_i\lt c_j\) is characterised by the following equality condition: \[|r_j - r_i| = c_j-c_i.\]
Again, here's a Zig function that implements the check:
Click to reveal / hide.
fn isSafe(board: []u8, rj: usize, cj: usize) bool {
for (board[0..cj], [0..cj]) |ri, ci| {
if (ri == rj or
@max(ri, rj) - @min(ri, rj) == cj - ci)
return false;
}
return true;
}
Results
Here is the performance I was able to achieve with the above Zig program, running on an Intel Core i5-5250U CPU at 1.6GHz.
\(n\) | \(Q(n)\) | time (s) |
8 | 92 | 0.002 |
9 | 352 | 0.010 |
10 | 724 | 0.041 |
11 | 2'680 | 0.239 |
12 | 14'200 | 1.227 |
13 | 73'712 | 6.861 |
14 | 365'596 | 44.773 |
15 | 2'279'184 | 330.301 |
It is pretty clear that the wall clock time balloons as a function of \(n\). Solution counts for higher board sizes are collected in sequence A000170 of the OEIS, which tops out at \(n=26\).
Conclusion
Since exact results for very large \(n\) are evidently out of reach, the challenge for the rest of the post will be to get a good estimate for the solution count for larger \(n\) in reasonable time.
2. Uniform Monte Carlo integration
Suppose the queens are already on separate columns and rows, preventing horizontal and vertical threats. Picture all the ways of permuting the queens as a vast candidate space \(\mathcal{P}\). We wish to estimate the size of the solution set \(S\) among the candidates.
Before we get into details, we can see how uniform Monte Carlo integration works with the following interactive simulation. Click the 🎲 buttons to generate random permutations of queens.
Rounding the estimate to the closest integer gives an estimate of 2 solutions after enough samples, which concurs with depth-first backtracking.
Algorithm
Uniform Monte Carlo integration estimates \(p(n):=Q(n)/n!\), the fractional size of the solution set \(\mathcal{S}\), by independently and uniformily sampling random permutations \(X_1, X_2, \dots, X_T\) from the candidates \(\mathcal{P}\) and computing the estimator \[\hat{p}(n):=\frac{1}{T}\sum_{t=1}^T \mathbf{1}_\mathcal{S}(X_t).\] This estimator is unbiased because the expected value of the indicator function \(\mathbf{1}_\mathcal{S}\) is \(p(n)\). In the following we also refer to \(p(n)\) as \(p\) for clarity.
In Zig the outline of the algorithm is like this.
Click to reveal / hide.
fn fill(n: usize, board: []usize) void {
for (0..n) |i|
board[i] = i;
}
fn uniform_monte_carlo(n: usize, T: usize) usize {
var rng = std.Random.DefaultPrng.init(0);
const rand = rng.random();
var board: [n]usize = undefined;
fill(n, &board);
// Enables O(n) safe check.
var diags: [4 * n - 2]usize = undefined;
var S: usize = 0;
for (0..T) |t| {
rand.shuffle(usize, &board);
if (fastSafe(n, t, &board, &diags))
S += 1;
}
return S;
}
There is a big optimization hiding here thanks to the fact that the queens are already on separate rows and columns. Recall the condition \(|r_j-r_i|=c_j-c_i\) for queens \(i \lt j\). Naïvely checking all pairs for attacks requires \(O(n^2)\) work, but this grows costly as \(n\) increases.
Verifying a solution in O(n)
Instead of checking all pairs, scan each queen and record which diagonals were occupied. A repeated diagonal means a conflict; no repeats means a valid solution. Using arrays of length \(2n-1\) for marking the major and minor diagonals, lookup is \(O(1)\) and memory use is \(O(n)\). With this method, time complexity for verification drops to \(O(n)\).
Here is what that would look like in Zig.
Click to reveal / hide.
inline fn fastSafe(
n: usize,
t: usize,
board: []usize,
diags: []usize,
) bool {
for (0..n) |c| {
const r = board[c];
const a = r + c;
const b = 3 * n - 2 + r - c;
if (diags[a] == t or diags[b] == t) {
return false;
}
diags[a] = t;
diags[b] = t;
}
return true;
}
Results
Here are the estimates for a fixed \(T=1'000'000\) candidates. The estimates are accompanied by a 95% Wilson confidence interval \((L, U)\) rounded to the closest integer.
\(n\) | solutions | \(\hat{p}n!\) | L | U | time (s) |
10 | 724 | 660 | 571 | 764 | 6.018 |
11 | 2'680 | 2'395 | 1'860 | 3'082 | 6.655 |
12 | 14'200 | 14'849 | 10'461 | 21'076 | 7.554 |
13 | 73'712 | 56'043 | 29'485 | 106'522 | 8.421 |
14 | 365'596 | 435'891 | 186'184 | 1'020'498 | 8.623 |
15 | 2'279'184 | 1'307'674 | 230'830 | 7'408'050 | 9.223 |
16 | ? | 41'845'579 | 11'475'330 | 152'592'188 | 10.400 |
While the estimates look reasonable and were computed very quickly, the confidence intervals for \(n\ge 15\) are so large that one is not really inclined to believe the estimate, especially as I was unable to compute the ground-truth number of solutions for \(n=16\) at all. To tighten the confidence intervals we need to increase \(T\), but how much?
Analysis
Say we were content with merely having high confidence in the estimated order of magnitude. That is, we would like \(T\) large enough so that \(\hat{p}\) is with at least 95% probability within a factor 10 of the true \(p\). We can deduce how many samples we need from the Chernoff bound \[\operatorname{Pr}\left[|\hat{p} - p| \geq \epsilon p\right] \leq 2\exp\left(-\frac{\epsilon^2 p T}{3}\right) \leq \delta,\] with \(\epsilon = 9\) and \(\delta=0.05\). This tells us that the probably-correct-order-of-magnitude estimate is achievable with \[T\geq \frac{3}{\epsilon^2 p}\log\left(\frac{2}{\delta}\right)\] samples. But the probability \(p\) appears in the denominator of the lower bound, which is precisely the thing we are trying to estimate anyway, and has no closed form outside of an asymptotic result by Simkin (2021): \[p(n)=\frac{\left((1+o(1))ne^{-\alpha})\right)^n}{n!}\approx \frac{(ne^{-\alpha})^n}{n!}\approx\frac{e^{-(\alpha-1)n}}{\sqrt{2\pi n}},\] with \(\alpha\approx 1.944\). Unfortunately, it follows that we need an exponential amount of samples to estimate the correct order of magnitude with uniform sampling.
Conclusion
The fundamental problem of integration via uniform random sampling for \(n\)-queens is that the proportion of solutions is so small that for very large \(n\) it becomes unlikely to even see a single solution at all. This is something importance sampling will address.
3. Annealed Importance Sampling
Let's recap. We wanted to find the number of ways of safely placing \(n\) queens on an \(n\)-by-\(n\) chessboard. Exhaustive counting is too slow, and so is uniform Monte Carlo integration of the indicator function of the solution set.
Motivation
There are a few important facts that will let us get around the problem of vanishing \(p(n)\). Here is fact number one, stated abstractly.
- Given positive numbers \(A\lt B \lt C\), the fractions involved in the equation \[\frac{A}{C}=\frac{A}{B}\frac{B}{C}\] are larger on the right hand side than the left hand side.
Applying this fact to our problem, we can estimate \(Q(n)/n!\) by constructing a sequence of intermediate fractions whose size we may better estimate individually. To do that, we need a way to climb monotonically from \(Q(n)\) to \(n\).
Interpolating between \(Q(n)\) and \(n!\)
Recall that these \(Q(n)\) and \(n!\) may be understood as discrete integrals of indicator functions on \(\mathcal{S}\) and \(\mathcal{P}\), respectively. This hint leads us to fact number two.
- An interpolation from \(\mathbf{1}_\mathcal{P}\) to \(\mathbf{1}_\mathcal{S}\) is provided by the Boltzmann function \[X \mapsto \exp{(-\beta f(X))}\] when the inverse temperature \(\beta\) varies from \(0\) to \(\infty\), and \(f\) is a positive energy function that is zero on the solution set \(\mathcal{S}\).[2]
Play with the \(\beta\) slider to see how it tempers the original function, singling out the two solutions (\(n=4\)).
Our choice of energy function \(f\) simply returns the number of diagonal threats. This energy function is nice for a reason that will make sense later.
As a side note, if there are \(k\) queens in a diagonal, then that diagonal contributes \(k(k-1)/2\) threats to the energy function, and the total number of threats (energy) is the sum over all diagonals.
inline fn pairs(comptime T: type, k: T) T {
return @divExact(k * (k - 1), 2);
}
inline fn threats(diags: []isize) isize {
var sum: isize = 0;
for (diags) |d|
sum += pairs(isize, d);
return sum;
}
Returning to the point, let's discuss how to use the intermediate functions to build a ladder from \(Q(n)\) to \(n!\). The discrete integral of the Boltzmann function is a normalisation constant \[Z(\beta)=\sum_{X\in \mathcal{P}}\exp{(-\beta f(X))}\] that yields the desired interpolation between \(Q(n)\) and \(n!\) since \(Q(n)=Z(\infty)\lt \dots \lt Z(0)= n!\). To apply this fact in our estimation, we create a temperature schedule \(\beta_1 \lt \beta_2\lt \dots \lt \beta_K\approx \infty\), and compute the number of solutions as follows: \[\frac{Q(n)}{n!}=\prod_{k=1}^{K-1} \frac{Z(\beta_{k+1})}{Z(\beta_k)}:= \prod_{k=1}^{K-1} w_k.\]
Now, the schedule design matters. If \(\beta_k\) and \(\beta_{k+1}\) are close enough, then so are \(Z(\beta_{k})\) and \(Z(\beta_k)\), thus we benefit from balancing the factors. Too close, and we waste our time on factors with no impact.
Designing a well-balanced temperature schedule depends on how we estimate the fractions, so we turn our attention to this problem next.
Importance sampling
Instead of uniform sampling, we switch to Boltzmann sampling. To make the switch, introduce the Boltzmann distribution \[\pi_{\beta}(X)=\frac{\exp{(-\beta f(X))}} {Z(\beta)}\] in the formula as follows. \[w_k=\sum_{X\in\mathcal{P}}\exp{\left(-(\beta_{k+1} - \beta_k)f(X)\right)}\pi_{\beta_k}(X).\] This lets us estimate the very same ratio by drawing M independent samples \(X_1,\dots, X_M\) according to \(\pi_{\beta_k}\) and computing the estimator \[\hat{w}_k=\frac{1}{M}\sum_{m=1}^M \exp{(-(\beta_{k+1}-\beta_k)f(X_m))}.\]
This switch of sampling distributions an example of importance sampling. The new estimator is unbiased in the sense that \[\mathbb{E}_{\pi_{\beta_k}}\left[\hat{w}_{k}\right]=w_k,\] and it has lower variance than a uniform estimator since lower energy elements are sampled more often under \(\pi_\beta\) than under the uniform distribution. Either way, our final estimator of \(Q(n)/n!\) is \[\hat{p}(n)=\prod_{k=1}^{K-1}\hat{w}_{k}.\]
But doesn't this mean we have to incur the cost of computing the distribution \(\pi_\beta\) with its \(n!\)-sized support? Actually, the answer is no.
Metropolis algorithm
To sample according to \(\pi_\beta\), we set up a Markov chain whose limiting distribution is theoretically \(\pi_\beta\). Press the 🔥 button to make a random Markov chain transition: a column swap.
By making enough of these swaps, the permutation of queens becomes more and more shuffled. As a result, the limiting distribution of permutations is uniform, corresponding to \(\pi_0\).
The Metropolis algorithm augments the chain so that the limiting distribution is any \(\pi_\beta\) at all. It consists of repeating the following steps for as long as required. The result is a sequence of samples that tend to \(\pi_\beta\) in distribution.
- Propose a Markov chain transition.
- Accept the transition with probability \(\min\{1, \exp{(-\beta\Delta f)}\}.\)
The energy change \(\Delta f\) is the difference in energy if the transition were accepted. For our number-of-diagonal-threats energy function \(f\), the energy change is computable in \(O(1)\). Additionally, this \(f\) acts as a search heuristic that naturally leads the chain to solutions.
Press the 🏛️ button to run the Metropolis algorithm.
Notice how the hot distribution converges quickly, while the cold one converges slowly (try ~50k steps). Convergence is slower because bad swaps are discouraged at lower temperatures, and potentially many bad swaps are required to move between solutions.
Designing a temperature schedule
Recall that we wanted a temperature schedule that balances the fractions \(w_k\) as fairly as possible. It turns out that balancing the KL divergence of successive \(\pi_{\beta_k}\) is a simple proxy for this goal.
KL-constant temperature schedule
Suppose we aim to keep the KL divergence fixed at some \(\alpha \gt 0\). Then after running Metropolis it suffices to choose \[\Delta\beta = \sqrt{\frac{2\alpha}{\operatorname{Var}[f]}},\]
and update \(\beta \leftarrow \beta + \Delta\beta\). Here, \(\operatorname{Var}\left[f\right]\) is estimated from the Metropolis samples with Bessel correction. [3] We stop once \(\beta\) surpasses a maximum \(\beta_K := \beta^*\).
Maximum inverse temperature \(\beta^*\)
Theoretically, \(\beta^*\) is a stand-in for \(\infty\), but in practice there is a tradeoff to consider when choosing finite \(\beta^*\).
- Small \(\beta^*\) means \(\pi_{\beta^*}\) doesn't put enough weight on \(\mathcal{S}\). This leads to over-estimating \(p(n)\), equivalent to dropping factors \(w_{k\gg 1}\).
- Large \(\beta^*\) and \(M\) finite with the Metropolis chain restarting from the highest-energy position, skews the energy distribution and thus the extra factors \(\hat{w}_{k\gg 1}\lt1\) begin to drag the estimate down.
We can avoid situation (1) by increasing \(\beta\) until almost-all (99.9%) of Metropolis samples have energy zero. Additionally, we can mitigate the effects of the Metropolis warm-up period causing the non-unity of \(\hat{w}_{k\gg1}\) in situation (2) by re-using the final board state from the previous temperature.
Algorithm
This is the outline of annealed importance sampling.
- Decide on \(M\), \(\alpha\). Initialize board and \(\beta_0=0\).
- For \(k=1,2,\dots\)
- With Metropolis, sample \(X_1,\dots,X_M\sim \pi_{\beta_k}\)
- Pick \(\Delta\beta\) and update \(\beta \leftarrow \beta + \Delta\beta\).
- Compute \(\hat{w}_k = \frac{1}{M}\sum_{m=1}^M \exp{(-\Delta\beta f(X_m))}\)
- If 99.9% of samples have \(f(X_m)=0\) then break.
- Return \(n!\prod_k \hat{w}_k\)
Here's what the outline might look like in Zig.
Click to reveal / hide.
// convenient casting
inline fn F64(x: anytype) f64 {
return @floatFromInt(x);
}
// annealed importance sampling
fn ais(
comptime n: usize,
comptime M: usize,
) f64 {
// set up random number generator
var rng = std.Random.DefaultPrng.init(0);
const rand = rng.random();
// set up board.
var board: [n]usize = undefined;
fill(n, &board);
var diags: [4 * n - 2]isize = undefined;
fillDiags(n, &board, &diags);
// initialize schedule and estimate
var beta: f64 = 0;
var estimate: f64 = 0;
while (true) {
var energies: [M]usize = undefined;
metropolis(n, M, rand, &board, &diags, beta, &energies);
const delta_beta = compute_delta_beta(&energies);
beta += delta_beta;
var w: f64 = 0;
for (energies) |f|
w += @exp(-delta_beta * F64(f));
estimate += @log(w) - @log(F64(M));
var sum: usize = 0;
for (energies) |f| {
sum += if (f == 0) 1 else 0;
}
if (F64(sum) >= 0.999 * F64(M)) break;
}
for (1..(n + 1)) |i|
estimate += @log(F64(i));
return @exp(estimate);
}
Now let's dive in to the implementation of the Metropolis algorithm. We need to compute the change of energy \(\Delta f\) due to a column swap. This only involves looking at eight indices, so it may be done in \(O(1)\).
Click to reveal / hide
inline fn metropolis(
comptime n: usize,
comptime M: usize,
rand: std.Random,
board: []usize,
diags: []isize,
beta: f64,
energies: []usize,
) void {
var energy: usize = @intCast(threats(diags));
for (0..M) |m| {
energies[m] = energy;
// Pick two columns to swap
const c1 = rand.uintAtMost(usize, n - 1);
const c2 = (c1 + 1 + rand.uintAtMost(usize, n - 2)) % n;
const r1 = board[c1];
const r2 = board[c2];
// Note indices of occupied diagonals.
const offset = 3 * n - 2;
const diag_idxs: [8]usize = .{
r1 + c1,
offset + r1 - c1,
r2 + c2,
offset + r2 - c2,
r2 + c1,
offset + r2 - c1,
r1 + c2,
offset + r1 - c2,
};
// Change in number of pieces on diagonal due to swap.
const piece_deltas: [8]isize = .{ -1, -1, -1, -1, 1, 1, 1, 1 };
// Compute energy change due to swap
const energy_delta = step(diags, diag_idxs, piece_deltas);
// If swap accepted, commit and update board, energy, diagonals
if (@log(rand.float(f64)) < @min(0, -beta * F64(energy_delta))) {
const tmp = board[c1];
board[c1] = board[c2];
board[c2] = tmp;
energy = @intCast(@as(isize, @intCast(energy)) + energy_delta);
for (0..8) |i|
diags[diag_idxs[i]] += piece_deltas[i];
}
}
}
But some of those indices may overlap, for instance when two queens are initially on the same diagonal. So we can count the change in threats due to both queens leaving, simultaneously.
Click to reveal / hide.
inline fn step(
diags: []isize, // how many queens per diagonal
diag_idxs: [8]usize, // diagonals involved in the swap
piece_deltas: [8]isize, // whether those diagonals lose or gain a piece
) isize {
var energy_delta: isize = 0;
// Count the total gain or loss of pieces per diagonal
var diag_read: [8]bool = .{false} ** 8;
for (0..8) |i| {
if (diag_read[i]) continue;
const idx = diag_idxs[i];
var total_piece_delta = piece_deltas[i];
// Look ahead to see if diagonal is involved with other piece
for (i + 1..8) |j| {
if (diag_idxs[j] == idx and !diag_read[j]) {
total_piece_delta += piece_deltas[j];
diag_read[j] = true;
}
}
// Continue if number of pieces stays the same.
if (total_piece_delta == 0) continue;
// Compute change in number of threats
const d_old = diags[idx];
const d_new = d_old + total_piece_delta;
energy_delta += pairs(isize, d_new) - pairs(isize, d_old);
}
return energy_delta;
}
Missing here is the code computing \(\Delta\beta\), which is not very interesting.
Results
We deploy the algorithm as follows. Fix \(\alpha=0.001\). Take \(M=25'000\) samples. Here we record \(K\) and \(\beta^*\) out of interest.
\(n\) | solutions | AIS (balanced KL) | \(K\) | \(\beta^*\) | time (s) |
4 | 2 | 2 | 73 | 15.404 | 5.577 |
5 | 10 | 10 | 65 | 5.563 | 5.016 |
6 | 4 | 5 | 83 | 6.102 | 6.346 |
7 | 40 | 41 | 92 | 8.074 | 7.272 |
8 | 92 | 97 | 105 | 10.242 | 8.059 |
9 | 352 | 359 | 112 | \(\infty\) | 8.770 |
10 | 724 | 850 | 118 | 10.319 | 9.049 |
11 | 2'680 | 2'737 | 126 | 10.633 | 9.552 |
12 | 14'200 | 14'563 | 133 | \(\infty\) | 10.463 |
13 | 73'712 | 78'710 | 141 | \(\infty\) | 11.038 |
14 | 365'596 | 378'424 | 149 | 11.767 | 11.357 |
15 | 2'279'184 | 2'301'984 | 153 | \(\infty\) | 11.610 |
16 | ? | 15'307'300 | 159 | \(\infty\) | 12.003 |
Not bad, right? To clear up any confusion \(\beta^*=\infty\) occurs when all samples have energy zero, which causes \(\Delta\beta\) to blow up. In my code I added a small constant in the denominator to prevent this, so it actually shows up as \(\beta^*\approx 52\).
So far, we pretended \(Q(16)\) was unknown, but in fact it is known to be 14'772'512. That makes ours a pretty good guess for a 12 second calculation! Let's push our luck and compare results for \(n=27\), the known maximum.
\(n\) | solutions | AIS (balanced KL) | \(K\) | time (s) |
27 | 234'907'967'154'122'528 | 240'409'405'794'640'320 | 203 | 14.256 |
And now, into the unknown!
\(n\) | solutions | AIS (balanced KL) | \(K\) | time (s) |
50 | ? | 2.982 \(\times 10^{44}\) | 281 | 20.468 |
100 | ? | 3.079\(\times 10^{117}\) | 391 | 27.809 |
200 | ? | 2.248 \(\times 10^{293}\) | 530 | 37.658 |
Conclusion
Lots of open questions here! In particular,
- Is there a more efficient balance of \(M\) and \(\alpha\)?
- Would a finer analysis for the choice of \(\Delta\beta\) result in better estimates?
- How should we adjust the estimate of \(\operatorname{Var}[f]\) given the autocorrelation inherent to Metropolis sampling?
Overall, annealed importance sampling works really well for this problem, allowing us to get estimates that would be impossible under uniform Monte Carlo sampling.
Ending note
Of course, you could swap the \(n\)-queens problem for any high-dimensional estimation problem with an energy heuristic that measures solution quality. Can you think of other places you'd apply this method? Is there another puzzle you have in mind? I'm happy to chat over email at oiliossatos@gmail.com
.
Footnotes
- Cheng Zhang & Jianpeng Ma. 2019. Counting solutions for the N-queens and Latin-square problems by Monte Carlo simulations.
-
The appearance of physics vocabulary is explained by the connection of this method to the theory of thermodynamics. This great video is relevant: Simulating and understanding phase change | Vilas Winstein.
-
Consider consecutive distributions \(\pi_\beta\) and \(\pi_{\beta + \Delta\beta}\). Their forward KL divergence is \[D_{KL}(\pi_\beta \| \pi_{\beta+\Delta\beta}) = \mathbb{E}_{\pi_\beta}\left[\log\frac{\pi_\beta}{\pi_{\beta+\Delta\beta}}\right].\] Plugging in definitions, we can show that \[\log\frac{\pi_\beta(X)}{\pi_{\beta + \Delta\beta}(X)}=\Delta\beta f(X) + \log Z(\beta+\Delta\beta) - \log Z(\beta).\] We are interested in simplifying this expression. Note that \(Z(\beta + \Delta\beta)\) factors as \[Z(\beta + \Delta\beta) = Z(\beta)\mathbb{E}_{\pi_\beta}\left[\exp{(-\Delta\beta f)}\right].\]
Taking logs on both sides, we have \[\log{Z(\beta + \Delta\beta)} = \log{Z(\beta)} + \log{\mathbb{E}_{\pi_\beta}}\left[\exp{(-\Delta\beta f)}\right],\] where the second term is the cumulant generating function of \(-f\) under \(\pi_\beta\). Hence in terms of familiar statistics we have \[\log{Z(\beta + \Delta\beta)} \approx \log{Z(\beta)} - \Delta\beta \mathbb{E}_{\pi_\beta}\left[f\right] + \frac{1}{2}(\Delta\beta)^2\operatorname{Var}_{\pi_\beta}\left[f\right].\]
Substituting this into previous formula yields the following approximation of KL divergence: \[D_{KL}(\pi_\beta \| \pi_{\beta + \Delta\beta}) \approx \frac{1}{2}(\Delta\beta)^2\operatorname{Var}_{\pi_\beta}\left[f\right],\]which can be solved for \(\Delta \beta\).
Last edited 12/09/2025