\(\DeclareMathOperator*{\argmax}{arg\,max}\) \(\DeclareMathOperator*{\argmin}{arg\,min}\) Orfeas | N-queens

Counting N-Queens Solutions


Based on a report for Markov Chains and Applications at EPFL in 2022 with Jean-Baptiste Michel and Gaiëtan Renault. Original research from Cheng Zhang & Jianpeng Ma (2009). [1]

Got comments? Send them to oiliossatos@gmail.com!

The 8-queens puzzle

Can you place eight queens on a chessboard such that no two queens attack each other? Click on the chessboard below to try the puzzle!

Click to add / remove a queen

Now that you've tried the puzzle, how many different solutions are possible? And more generally, how can we count \(Q(n)\), the number of solutions for \(n\) queens on an \(n\)-by-\(n\) board? That's the problem we'll explore in this post.

Contents

We'll look at three different approaches to counting solutions, each addressing the problems of the last:

  1. Depth-first backtracking
  2. Uniform Monte Carlo integration
  3. Annealed importance sampling

This post contains many interactive visualisations to aid in understanding, accompanied by an implementation in Zig. Please let me know if something breaks on your end.

1. Depth-first backtracking

The following board shows the progression of a depth-first backtracking solution counting algorithm for the 4-queens problem. Use the ➡️ buttons below to step 🔁 through (left) or reset (right) the algorithm.

➡️
🔁
Solutions found: 0

Algorithm

The backtracking algorithm works by building the solution one column at a time. We place exactly one queen in each column so that we never have to worry about vertical attacks; only rows and diagonals need checking. When no spaces in a column are available, we backtrack to the previous column.

Here's what that would look like as a recursive function in Zig:


		
Click to reveal / hide.

Now, to determine whether a square is safe, recall that we don't need to worry about vertical attacks. Also, because we progress one column at a time, we only need to worry about attacks from previous columns. When board stores the row position \(r_i\) of each queen \(i\), checking for horizontal attacks is easy, while a diagonal attack between queens in columns \(c_i\lt c_j\) is characterised by the following equality condition: \[|r_j - r_i| = c_j-c_i.\]

Again, here's a Zig function that implements the check:


		
Click to reveal / hide.

Results

Here is the performance I was able to achieve with the above Zig program, running on an Intel Core i5-5250U CPU at 1.6GHz.

\(n\) \(Q(n)\) time (s)
8 92 0.002
9 352 0.010
10 724 0.041
11 2'680 0.239
12 14'200 1.227
13 73'712 6.861
14 365'596 44.773
15 2'279'184 330.301

It is pretty clear that the wall clock time balloons as a function of \(n\). Solution counts for higher board sizes are collected in sequence A000170 of the OEIS, which tops out at \(n=26\).

Conclusion

Since exact results for very large \(n\) are evidently out of reach, the challenge for the rest of the post will be to get a good estimate for the solution count for larger \(n\) in reasonable time.

2. Uniform Monte Carlo integration

Suppose the queens are already on separate columns and rows, preventing horizontal and vertical threats. Picture all the ways of permuting the queens as a vast candidate space \(\mathcal{P}\). We wish to estimate the size of the solution set \(S\) among the candidates.

Before we get into details, we can see how uniform Monte Carlo integration works with the following interactive simulation. Click the 🎲 buttons to generate random permutations of queens.

🎲
10🎲
100🎲

Rounding the estimate to the closest integer gives an estimate of 2 solutions after enough samples, which concurs with depth-first backtracking.

Algorithm

Uniform Monte Carlo integration estimates \(p(n):=Q(n)/n!\), the fractional size of the solution set \(\mathcal{S}\), by independently and uniformily sampling random permutations \(X_1, X_2, \dots, X_T\) from the candidates \(\mathcal{P}\) and computing the estimator \[\hat{p}(n):=\frac{1}{T}\sum_{t=1}^T \mathbf{1}_\mathcal{S}(X_t).\] This estimator is unbiased because the expected value of the indicator function \(\mathbf{1}_\mathcal{S}\) is \(p(n)\). In the following we also refer to \(p(n)\) as \(p\) for clarity.

In Zig the outline of the algorithm is like this.


    
Click to reveal / hide.

There is a big optimization hiding here thanks to the fact that the queens are already on separate rows and columns. Recall the condition \(|r_j-r_i|=c_j-c_i\) for queens \(i \lt j\). Naïvely checking all pairs for attacks requires \(O(n^2)\) work, but this grows costly as \(n\) increases.

Verifying a solution in O(n)

Instead of checking all pairs, scan each queen and record which diagonals were occupied. A repeated diagonal means a conflict; no repeats means a valid solution. Using arrays of length \(2n-1\) for marking the major and minor diagonals, lookup is \(O(1)\) and memory use is \(O(n)\). With this method, time complexity for verification drops to \(O(n)\).

Here is what that would look like in Zig.


    
Click to reveal / hide.

Results

Here are the estimates for a fixed \(T=1'000'000\) candidates. The estimates are accompanied by a 95% Wilson confidence interval \((L, U)\) rounded to the closest integer.

\(n\) solutions \(\hat{p}n!\) L U time (s)
10 724 660 571 764 6.018
11 2'680 2'395 1'860 3'082 6.655
12 14'200 14'849 10'461 21'076 7.554
13 73'712 56'043 29'485 106'522 8.421
14 365'596 435'891 186'184 1'020'498 8.623
15 2'279'184 1'307'674 230'830 7'408'050 9.223
16 ? 41'845'579 11'475'330 152'592'188 10.400

While the estimates look reasonable and were computed very quickly, the confidence intervals for \(n\ge 15\) are so large that one is not really inclined to believe the estimate, especially as I was unable to compute the ground-truth number of solutions for \(n=16\) at all. To tighten the confidence intervals we need to increase \(T\), but how much?

Analysis

Say we were content with merely having high confidence in the estimated order of magnitude. That is, we would like \(T\) large enough so that \(\hat{p}\) is with at least 95% probability within a factor 10 of the true \(p\). We can deduce how many samples we need from the Chernoff bound \[\operatorname{Pr}\left[|\hat{p} - p| \geq \epsilon p\right] \leq 2\exp\left(-\frac{\epsilon^2 p T}{3}\right) \leq \delta,\] with \(\epsilon = 9\) and \(\delta=0.05\). This tells us that the probably-correct-order-of-magnitude estimate is achievable with \[T\geq \frac{3}{\epsilon^2 p}\log\left(\frac{2}{\delta}\right)\] samples. But the probability \(p\) appears in the denominator of the lower bound, which is precisely the thing we are trying to estimate anyway, and has no closed form outside of an asymptotic result by Simkin (2021): \[p(n)=\frac{\left((1+o(1))ne^{-\alpha})\right)^n}{n!}\approx \frac{(ne^{-\alpha})^n}{n!}\approx\frac{e^{-(\alpha-1)n}}{\sqrt{2\pi n}},\] with \(\alpha\approx 1.944\). Unfortunately, it follows that we need an exponential amount of samples to estimate the correct order of magnitude with uniform sampling.

Conclusion

The fundamental problem of integration via uniform random sampling for \(n\)-queens is that the proportion of solutions is so small that for very large \(n\) it becomes unlikely to even see a single solution at all. This is something importance sampling will address.

3. Annealed Importance Sampling

Let's recap. We wanted to find the number of ways of safely placing \(n\) queens on an \(n\)-by-\(n\) chessboard. Exhaustive counting is too slow, and so is uniform Monte Carlo integration of the indicator function of the solution set.

Motivation

There are a few important facts that will let us get around the problem of vanishing \(p(n)\). Here is fact number one, stated abstractly.

  1. Given positive numbers \(A\lt B \lt C\), the fractions involved in the equation \[\frac{A}{C}=\frac{A}{B}\frac{B}{C}\] are larger on the right hand side than the left hand side.

Applying this fact to our problem, we can estimate \(Q(n)/n!\) by constructing a sequence of intermediate fractions whose size we may better estimate individually. To do that, we need a way to climb monotonically from \(Q(n)\) to \(n\).

Interpolating between \(Q(n)\) and \(n!\)

Recall that these \(Q(n)\) and \(n!\) may be understood as discrete integrals of indicator functions on \(\mathcal{S}\) and \(\mathcal{P}\), respectively. This hint leads us to fact number two.

  1. An interpolation from \(\mathbf{1}_\mathcal{P}\) to \(\mathbf{1}_\mathcal{S}\) is provided by the Boltzmann function \[X \mapsto \exp{(-\beta f(X))}\] when the inverse temperature \(\beta\) varies from \(0\) to \(\infty\), and \(f\) is a positive energy function that is zero on the solution set \(\mathcal{S}\).[2]

Play with the \(\beta\) slider to see how it tempers the original function, singling out the two solutions (\(n=4\)).

0

Our choice of energy function \(f\) simply returns the number of diagonal threats. This energy function is nice for a reason that will make sense later.

As a side note, if there are \(k\) queens in a diagonal, then that diagonal contributes \(k(k-1)/2\) threats to the energy function, and the total number of threats (energy) is the sum over all diagonals.

inline fn pairs(comptime T: type, k: T) T {
    return @divExact(k * (k - 1), 2);
}

inline fn threats(diags: []isize) isize {
    var sum: isize = 0;
    for (diags) |d|
        sum += pairs(isize, d);
    return sum;
}

Returning to the point, let's discuss how to use the intermediate functions to build a ladder from \(Q(n)\) to \(n!\). The discrete integral of the Boltzmann function is a normalisation constant \[Z(\beta)=\sum_{X\in \mathcal{P}}\exp{(-\beta f(X))}\] that yields the desired interpolation between \(Q(n)\) and \(n!\) since \(Q(n)=Z(\infty)\lt \dots \lt Z(0)= n!\). To apply this fact in our estimation, we create a temperature schedule \(\beta_1 \lt \beta_2\lt \dots \lt \beta_K\approx \infty\), and compute the number of solutions as follows: \[\frac{Q(n)}{n!}=\prod_{k=1}^{K-1} \frac{Z(\beta_{k+1})}{Z(\beta_k)}:= \prod_{k=1}^{K-1} w_k.\]

Now, the schedule design matters. If \(\beta_k\) and \(\beta_{k+1}\) are close enough, then so are \(Z(\beta_{k})\) and \(Z(\beta_k)\), thus we benefit from balancing the factors. Too close, and we waste our time on factors with no impact.

Designing a well-balanced temperature schedule depends on how we estimate the fractions, so we turn our attention to this problem next.

Importance sampling

Instead of uniform sampling, we switch to Boltzmann sampling. To make the switch, introduce the Boltzmann distribution \[\pi_{\beta}(X)=\frac{\exp{(-\beta f(X))}} {Z(\beta)}\] in the formula as follows. \[w_k=\sum_{X\in\mathcal{P}}\exp{\left(-(\beta_{k+1} - \beta_k)f(X)\right)}\pi_{\beta_k}(X).\] This lets us estimate the very same ratio by drawing M independent samples \(X_1,\dots, X_M\) according to \(\pi_{\beta_k}\) and computing the estimator \[\hat{w}_k=\frac{1}{M}\sum_{m=1}^M \exp{(-(\beta_{k+1}-\beta_k)f(X_m))}.\]

This switch of sampling distributions an example of importance sampling. The new estimator is unbiased in the sense that \[\mathbb{E}_{\pi_{\beta_k}}\left[\hat{w}_{k}\right]=w_k,\] and it has lower variance than a uniform estimator since lower energy elements are sampled more often under \(\pi_\beta\) than under the uniform distribution. Either way, our final estimator of \(Q(n)/n!\) is \[\hat{p}(n)=\prod_{k=1}^{K-1}\hat{w}_{k}.\]

But doesn't this mean we have to incur the cost of computing the distribution \(\pi_\beta\) with its \(n!\)-sized support? Actually, the answer is no.

Metropolis algorithm

To sample according to \(\pi_\beta\), we set up a Markov chain whose limiting distribution is theoretically \(\pi_\beta\). Press the 🔥 button to make a random Markov chain transition: a column swap.

🔥

By making enough of these swaps, the permutation of queens becomes more and more shuffled. As a result, the limiting distribution of permutations is uniform, corresponding to \(\pi_0\).

The Metropolis algorithm augments the chain so that the limiting distribution is any \(\pi_\beta\) at all. It consists of repeating the following steps for as long as required. The result is a sequence of samples that tend to \(\pi_\beta\) in distribution.

  1. Propose a Markov chain transition.
  2. Accept the transition with probability \(\min\{1, \exp{(-\beta\Delta f)}\}.\)

The energy change \(\Delta f\) is the difference in energy if the transition were accepted. For our number-of-diagonal-threats energy function \(f\), the energy change is computable in \(O(1)\). Additionally, this \(f\) acts as a search heuristic that naturally leads the chain to solutions.

Press the 🏛️ button to run the Metropolis algorithm.

🏛️
1K🏛️

Notice how the hot distribution converges quickly, while the cold one converges slowly (try ~50k steps). Convergence is slower because bad swaps are discouraged at lower temperatures, and potentially many bad swaps are required to move between solutions.

Designing a temperature schedule

Recall that we wanted a temperature schedule that balances the fractions \(w_k\) as fairly as possible. It turns out that balancing the KL divergence of successive \(\pi_{\beta_k}\) is a simple proxy for this goal.

KL-constant temperature schedule

Suppose we aim to keep the KL divergence fixed at some \(\alpha \gt 0\). Then after running Metropolis it suffices to choose \[\Delta\beta = \sqrt{\frac{2\alpha}{\operatorname{Var}[f]}},\]

and update \(\beta \leftarrow \beta + \Delta\beta\). Here, \(\operatorname{Var}\left[f\right]\) is estimated from the Metropolis samples with Bessel correction. [3] We stop once \(\beta\) surpasses a maximum \(\beta_K := \beta^*\).

Maximum inverse temperature \(\beta^*\)

Theoretically, \(\beta^*\) is a stand-in for \(\infty\), but in practice there is a tradeoff to consider when choosing finite \(\beta^*\).

  1. Small \(\beta^*\) means \(\pi_{\beta^*}\) doesn't put enough weight on \(\mathcal{S}\). This leads to over-estimating \(p(n)\), equivalent to dropping factors \(w_{k\gg 1}\).
  2. Large \(\beta^*\) and \(M\) finite with the Metropolis chain restarting from the highest-energy position, skews the energy distribution and thus the extra factors \(\hat{w}_{k\gg 1}\lt1\) begin to drag the estimate down.

We can avoid situation (1) by increasing \(\beta\) until almost-all (99.9%) of Metropolis samples have energy zero. Additionally, we can mitigate the effects of the Metropolis warm-up period causing the non-unity of \(\hat{w}_{k\gg1}\) in situation (2) by re-using the final board state from the previous temperature.

Algorithm

This is the outline of annealed importance sampling.

  1. Decide on \(M\), \(\alpha\). Initialize board and \(\beta_0=0\).
  2. For \(k=1,2,\dots\)
    1. With Metropolis, sample \(X_1,\dots,X_M\sim \pi_{\beta_k}\)
    2. Pick \(\Delta\beta\) and update \(\beta \leftarrow \beta + \Delta\beta\).
    3. Compute \(\hat{w}_k = \frac{1}{M}\sum_{m=1}^M \exp{(-\Delta\beta f(X_m))}\)
    4. If 99.9% of samples have \(f(X_m)=0\) then break.
  3. Return \(n!\prod_k \hat{w}_k\)

Here's what the outline might look like in Zig.


			
Click to reveal / hide.

Now let's dive in to the implementation of the Metropolis algorithm. We need to compute the change of energy \(\Delta f\) due to a column swap. This only involves looking at eight indices, so it may be done in \(O(1)\).


		    
Click to reveal / hide

But some of those indices may overlap, for instance when two queens are initially on the same diagonal. So we can count the change in threats due to both queens leaving, simultaneously.


			
Click to reveal / hide.

Missing here is the code computing \(\Delta\beta\), which is not very interesting.

Results

We deploy the algorithm as follows. Fix \(\alpha=0.001\). Take \(M=25'000\) samples. Here we record \(K\) and \(\beta^*\) out of interest.

\(n\) solutions AIS (balanced KL) \(K\) \(\beta^*\) time (s)
4 2 2 73 15.404 5.577
5 10 10 65 5.563 5.016
6 4 5 83 6.102 6.346
7 40 41 92 8.074 7.272
8 92 97 105 10.242 8.059
9 352 359 112 \(\infty\) 8.770
10 724 850 118 10.319 9.049
11 2'680 2'737 126 10.633 9.552
12 14'200 14'563 133 \(\infty\) 10.463
13 73'712 78'710 141 \(\infty\) 11.038
14 365'596 378'424 149 11.767 11.357
15 2'279'184 2'301'984 153 \(\infty\) 11.610
16 ? 15'307'300 159 \(\infty\) 12.003

Not bad, right? To clear up any confusion \(\beta^*=\infty\) occurs when all samples have energy zero, which causes \(\Delta\beta\) to blow up. In my code I added a small constant in the denominator to prevent this, so it actually shows up as \(\beta^*\approx 52\).

So far, we pretended \(Q(16)\) was unknown, but in fact it is known to be 14'772'512. That makes ours a pretty good guess for a 12 second calculation! Let's push our luck and compare results for \(n=27\), the known maximum.

\(n\) solutions AIS (balanced KL) \(K\) time (s)
27 234'907'967'154'122'528 240'409'405'794'640'320 203 14.256

And now, into the unknown!

\(n\) solutions AIS (balanced KL) \(K\) time (s)
50 ? 2.982 \(\times 10^{44}\) 281 20.468
100 ? 3.079\(\times 10^{117}\) 391 27.809
200 ? 2.248 \(\times 10^{293}\) 530 37.658

Conclusion

Lots of open questions here! In particular,

  1. Is there a more efficient balance of \(M\) and \(\alpha\)?
  2. Would a finer analysis for the choice of \(\Delta\beta\) result in better estimates?
  3. How should we adjust the estimate of \(\operatorname{Var}[f]\) given the autocorrelation inherent to Metropolis sampling?

Overall, annealed importance sampling works really well for this problem, allowing us to get estimates that would be impossible under uniform Monte Carlo sampling.

Ending note

Of course, you could swap the \(n\)-queens problem for any high-dimensional estimation problem with an energy heuristic that measures solution quality. Can you think of other places you'd apply this method? Is there another puzzle you have in mind? I'm happy to chat over email at oiliossatos@gmail.com.

Footnotes


  1. Cheng Zhang & Jianpeng Ma. 2019. Counting solutions for the N-queens and Latin-square problems by Monte Carlo simulations.
  2. The appearance of physics vocabulary is explained by the connection of this method to the theory of thermodynamics. This great video is relevant: Simulating and understanding phase change | Vilas Winstein.

  3. Consider consecutive distributions \(\pi_\beta\) and \(\pi_{\beta + \Delta\beta}\). Their forward KL divergence is \[D_{KL}(\pi_\beta \| \pi_{\beta+\Delta\beta}) = \mathbb{E}_{\pi_\beta}\left[\log\frac{\pi_\beta}{\pi_{\beta+\Delta\beta}}\right].\] Plugging in definitions, we can show that \[\log\frac{\pi_\beta(X)}{\pi_{\beta + \Delta\beta}(X)}=\Delta\beta f(X) + \log Z(\beta+\Delta\beta) - \log Z(\beta).\] We are interested in simplifying this expression. Note that \(Z(\beta + \Delta\beta)\) factors as \[Z(\beta + \Delta\beta) = Z(\beta)\mathbb{E}_{\pi_\beta}\left[\exp{(-\Delta\beta f)}\right].\]

    Taking logs on both sides, we have \[\log{Z(\beta + \Delta\beta)} = \log{Z(\beta)} + \log{\mathbb{E}_{\pi_\beta}}\left[\exp{(-\Delta\beta f)}\right],\] where the second term is the cumulant generating function of \(-f\) under \(\pi_\beta\). Hence in terms of familiar statistics we have \[\log{Z(\beta + \Delta\beta)} \approx \log{Z(\beta)} - \Delta\beta \mathbb{E}_{\pi_\beta}\left[f\right] + \frac{1}{2}(\Delta\beta)^2\operatorname{Var}_{\pi_\beta}\left[f\right].\]

    Substituting this into previous formula yields the following approximation of KL divergence: \[D_{KL}(\pi_\beta \| \pi_{\beta + \Delta\beta}) \approx \frac{1}{2}(\Delta\beta)^2\operatorname{Var}_{\pi_\beta}\left[f\right],\]which can be solved for \(\Delta \beta\).

Last edited 12/09/2025