Applying quantum algorithms to constraint satisfaction problems

Quantum algorithms can deliver asymptotic speedups over their classical counterparts. However, there are few cases where a substantial quantum speedup has been worked out in detail for reasonably-sized problems, when compared with the best classical algorithms and taking into account realistic hardware parameters and overheads for fault-tolerance. All known examples of such speedups correspond to problems related to simulation of quantum systems and cryptography. Here we apply general-purpose quantum algorithms for solving constraint satisfaction problems to two families of prototypical NP-complete problems: boolean satisfiability and graph colouring. We consider two quantum approaches: Grover's algorithm and a quantum algorithm for accelerating backtracking algorithms. We compare the performance of optimised versions of these algorithms, when applied to random problem instances, against leading classical algorithms. Even when considering only problem instances that can be solved within one day, we find that there are potentially large quantum speedups available. In the most optimistic parameter regime we consider, this could be a factor of over $10^5$ relative to a classical desktop computer; in the least optimistic regime, the speedup is reduced to a factor of over $10^3$. However, the number of physical qubits used is extremely large, and improved fault-tolerance methods will likely be needed to make these results practical. In particular, the quantum advantage disappears if one includes the cost of the classical processing power required to perform decoding of the surface code using current techniques.

Many quantum algorithms are known, for tasks as diverse as integer factorisation [92] and computing Jones polynomials [4]. Indeed, at the time of writing, the Quantum Algorithm Zoo website [62] cites 392 papers on quantum algorithms. However, there are relatively few cases known where quantum algorithms substantially outperform their classical counterparts for problems of practical importance, and the runtime of the quantum algorithm has been calculated in detail. Examples include simulating the chemical processes involved in biological nitrogen fixation [84]; breaking cryptosystems based on integer factorisation [58,75] and elliptic curves [86]; quantum simulation of spin systems [37] and electronic structure Hamiltonians [10]. In all of these cases, the underlying quantum algorithm achieves an exponential speedup over its classical counterpart, and under realistic assumptions about the performance of the quantum hardware, can solve a problem in days or weeks that might take decades or centuries on a fast classical computer.
Notwithstanding the extreme practical importance of some of these applications, they share the feature that they are rather special-purpose. While simulation of quantum systems, for example, has a large number of uses [55], there are many problem domains for which it is simply not relevant. Here we focus on problems in the general area of constraint satisfaction and optimisation -an area critical to many different industry sectors and applications -and aim to quantify the likely advantage that could be achieved by quantum computers. We seek to satisfy the following desiderata: 1. (Rigour) There should exist a quantum algorithm * ashley.montanaro@bristol.ac.uk which solves the problem with provable correctness and rigorous performance bounds.

(Broad utility)
The abstract problem solved by the algorithm should be broadly useful across many different applications.

(Performance bounds)
We should compute the performance of the quantum and classical algorithms explicitly for particular problem instances, including all relevant overheads.

(Runtime)
The problem instance used for comparison should be one that can be solved by the quantum computer within a reasonable time (e.g. < 1 day) under reasonable assumptions about the performance of the quantum hardware.

(Benchmarking)
The point of comparison should be one of the best classical algorithms known, running on modern-day hardware.
These points between them seem to put severe restrictions on the ability of quantum computing to achieve a significant performance enhancement. First, the requirement of rigour rules out heuristic algorithms running on current or near-term hardware, such as quantum annealing (e.g. [85]) or the quantum approximate optimisation algorithm [49].
Next, the requirement of broad utility rules out the exponential speedups discussed above. In general, quantum algorithms that are broadly applicable to accelerating classical algorithms tend to achieve at best quadratic speedups (that is, the scaling with problem size of the quantum algorithm's runtime is approximately the square root of its classical counterpart); one famous example is Grover's algorithm [57], which speeds up unstructured search. In other models, such as query complexity, it can even be proven that special problem structure is required to see an exponential quantum speedup [16]. Although even a quadratic speedup will become arbitrarily large for large enough input sizes, choosing an extremely large input size will make the execution time of the quantum algorithm unacceptably long for practical purposes. This motivates the runtime requirement, which is particularly challenging to satisfy because many quantum algorithms (e.g. Grover's algorithm [97]) are inherently serial: they cannot be parallelised without reducing the quantum speedup.
The requirement to compute accurate performance bounds implies that we should take into account not just the performance of the quantum hardware itself (which will in general be slower than modern-day classical hardware) but also the overhead from fault-tolerance, which could correspond to an increase of several orders of magnitude in the number of qubits required, and a concomitant increase in cost and runtime. Table I lists parameters for quantum hardware in various regimes ("Realistic", "Plausible" and "Optimistic"). "Realistic" is based on relatively modest improvements on parameters already demonstrated in experiment. For example, in superconducting qubit systems, 2-qubit gate times of 40ns [15] and measurement times of under 50ns [94] have been demonstrated, and numerical simulations suggest the possibility of 26ns gates [45]; 2-qubit gate error rates of 0.06 have been demonstrated [15] and 0.004 is predicted [45]. In ion traps, 2-qubit gate times of 480ns have been demonstrated [87], as have error rates of 0.001 (at the cost of increasing the gate duration) [13]. The other categories are based on the simple assumption that order-of-magnitude improvements are possible.
One may reasonably query whether this assumption is realistic. Considering gate times, there is quite a wide variation in leading results reported in the literature. Even considering superconducting qubits alone, these include 40ns in a 5-qubit system (2014) [15], 150ns in a 6-qubit system (2017) [63], and 100-250ns in a 19-qubit system (2017) [79]. Classically, within the period 1995-2000 Intel CPUs increased in clock speed by about a factor of 10. In the case of error rates, although error rates of 10 −3 combined with <100ns gate times have not yet been demonstrated, an ultimate error rate of 10 −5 may even be pessimistic, if more exotic technologies such as topological quantum computers come to fruition; an effective error rate of 10 −9 has been assumed elsewhere in the literature for such devices [84]. See [3] for a more detailed performance extrapolation.
Finally, the benchmarking requirement implies that we should not simply compare the quantum algorithm against the simplest or most obvious classical competitor, but should choose a competitor that is one of the fastest options actually used in practice. For example, Grover's algorithm can determine satisfiability of a boolean formula containing n variables with O(2 n/2 ) evaluations of the formula [57], whereas exhaustive search would use O(2 n ) evaluations. However, other algorithms are known which are much faster in practice, for example based on the DPLL method [41,42]. A fair quantumclassical comparison should test against these best algorithms.  corresponds to values reported in the literature as possible now, or predicted in the near future. "Cycle time" is the time required to perform one surface code cycle. Each such cycle comprises four 2qubit gates, possibly two 1-qubit gates and a measurement. These must be performed for each of X and Z, but this can be parallelised to an extent that depends on the relative times required to implement measurements and gates; we therefore only consider one X/Z cycle when estimating the cycle time, and assume that a 1-qubit gate can be implemented in half the time required for a 2-qubit gate.
It is worth pausing to check whether there is hope to achieve a substantial quantum speedup at all while satisfying all the above desiderata. Imagine there exists a family of problems which is exceptionally challenging to solve classically: for a problem instance involving n boolean variables, the best classical algorithm consists of simply evaluating some simple "oracle" function of the variables for ∼ 2 n different assignments. Further assume that this function can be evaluated efficiently on both a classical and quantum computer. For example, we could consider a cryptographic hash function. Such functions are designed to be easy to compute and hard to invert, and in some cases the hash of (e.g.) a 256-bit integer can be computed in under approximately 1000 CPU cycles [21]. Given the overhead required to implement a classical circuit reversibly, it is hard to imagine 1 performing an equivalently complex operation via a quantum circuit in circuit depth less than 1000 (and if this were possible, it is plausible that it would lead to a faster classical algorithm).
Therefore, assume that the quantum circuit depth required to solve an instance of size n is approximately 1000 × 2 n/2 , corresponding to approximately the depth required to execute Grover's algorithm, while the classical runtime is 1000 × 2 n clock cycles. For simplicity, assume the classical computer's clock speed is 1GHz. (This may appear unrealistic, as highperformance computing hardware could be used to solve a problem of this form via parallel computation. However, in the context of a cost comparison between quantum and classical computation, this would correspond to multiplying the cost of the classical computation by the number of parallel processors used. So computing the speedup over one classical processor can be used as a proxy for the cost advantage over multiple processors.) Given no overhead at all for fault-tolerance, considering the gate times in Table I and only problem instances that can be solved in 1 day, we obtain the middle row of Table II.  TABLE II. Likely upper bounds on speedup factors possible for square-root-type quantum algorithms running for at most one day in different regimes, assuming that there is no overhead for fault tolerance, so maximum circuit depths are only determined by gate times.
It is clear that the speedups achieved are very substantial in all cases. An example of a more realistic depth overhead is the quantum circuit for computing the SHA-256 hash function described in [8], which has depth ≈ 5 × 10 5 . Using this example, we achieve a speedup factor between roughly 2 × 10 2 and 4 × 10 6 , depending on assumptions, which is still quite substantial at the high end. Note that, counterintuitively, decreasing the quantum clock speed (equivalently, increasing the oracle circuit depth) by a factor of c reduces the largest speedup that can be achieved in a given time period by a factor of approximately c 2 . This strongly motivates the design of depth-efficient quantum circuits and hardware with high clock speeds. Table II represents an estimate of the best possible speedups for square-root-type quantum algorithms. It remains to attempt to show that significant speedups can actually be achieved for problems of practical interest, which is our focus in this work.

I. OUR RESULTS
In an attempt to satisfy all the above requirements, we focus on two prominent and fundamental NP-complete problems: graph colouring and boolean satisfiability. In the graph colouring problem, we are given a graph G with n vertices, and asked to assign one of k colours to each vertex, such that no pair of adjacent vertices shares the same colour. If no such colouring exists, we should detect this fact. In the boolean satisfiability problem, we are given a boolean formula φ on n variables in conjunctive normal form and asked to find a satisfying assignment to the formula, if one exists. That is, the formula is made up of clauses, where each clause is an OR function of some of the variables (each possibly appearing negated), and we are asked to find an assignment to the variables such that all of the clauses evaluate to true. Here we consider the special case k-SAT, where each clause contains exactly k variables.
Each of these problems has countless direct applications.
In the case of graph colouring, these include register allocation [38]; scheduling [65]; frequency assignment problems [2]; and many other problems in wireless networking [12]. In the case of boolean satisfiability, these include formal verification of electronic circuits [82]; planning [90]; and computer-aided mathematical proofs [64]. We seek a problem instance which can be solved using a quantum computer in one day, but would take substantially longer for a classical computer to solve. This raises the question of how to be confident that the runtime of the classical algorithm is indeed large (we cannot simply run the algorithm, as by definition it would take too long). A strategy to achieve this is to find a family of instances, parametrised by problem size, which can be solved for small problem sizes in a reasonable time, and where the runtime for larger problem sizes can be extrapolated from these.
A straightforward way to satisfy this criterion is to choose instances at random. Another advantage of using random instances is that they are likely to be hard for classical algorithms, as they have no structure that the algorithm can exploit. Indeed, in the case of graph colouring, even random instances on around 80 vertices are already challenging for the best classical algorithms [68]. We use the following models: • k-colouring: pick a uniformly random (Erdős-Rényi) graph on n vertices, where each edge is present with probability 0.5. As n → ∞, the chromatic number χ n,0.5 of such graphs has long been known to be (1 + o(1))n/(2 log 2 n) with high probability [23]. Empirically, the estimate which is based on a small modification to the asymptotic formula in [80], seems to be an excellent predictor of the mean chromatic number of a random graph (see Figure 1). For n ≤ 200, this estimate is at most 24.
• k-SAT: choose m clauses, each of which contains k variables. Each clause is picked independently and uniformly at random from the set of 2 k n k distinct clauses containing k distinct variables. We aim to fix m such that m/n ≈ α k , where α k is the threshold for k-SAT. The threshold is the point α k such that, as n → ∞, a random k-SAT formula on αn clauses will be satisfiable with probability approaching 1 for α < α k , and unsatisfiable with probability approaching 1 for α > α k . It has long been predicted theoretically, and verified experimentally, that random k-SAT instances around the satisfiability threshold will be very challenging [35].
Here we applied efficient quantum algorithms with rigorous performance bounds to these random instances of kcolouring and k-SAT problems, carried out a detailed analysis of their performance (including overheads for fault-tolerance), and compared them against leading classical competitors. We considered two (families of) general-purpose quantum algorithms: Grover's algorithm [57] for accelerating unstructured search, and a quantum algorithm for accelerating the general classical algorithmic technique known as backtracking [76]. Each of these algorithms achieves a near-quadratic reduction in computational complexity compared with its classical counterpart (that is, if the classical runtime is T , the dominant component of the quantum runtime scales like √ T ) and has a rigorous correctness proof.
In the case of k-SAT, we compared the performance of these two algorithms against the performance of the Maple LCM Dist SAT solver, which was the winner of the SAT Competition 2017 2 [14]. We evaluated the performance of this solver on many random instances, for different values of k, to estimate its runtime scaling with the number of variables n. We then calculated the complexity of highly optimised versions of Grover's algorithm and the quantum backtracking algorithm applied to this problem. In order to solve the largest instances possible while meeting the runtime requirement, the algorithms are optimised to perform as many operations in parallel as possible, and hence minimise their circuit depths.
In the case of graph k-colouring, we compared against the commonly used ("de facto standard" [91]) DSATUR algorithm [28] (see Section VI). This is a backtracking algorithm itself, so can be accelerated directly via the quantum backtracking algorithm. In this case, Grover's algorithm is not applicable, as for relevant values of k the runtime of DSATUR is empirically exponentially faster than the O(k n/2 ) runtime scaling that would be achieved by Grover's algorithm applied to k-colouring. In Figure 2 we illustrate the depth overhead of our optimised k-colouring algorithm versus the input size n. For reasonable graph sizes (e.g. n ∈ {100, . . . , 200}) it is less than 4 × 10 6 , and hence not substantially greater than the overhead of the SHA-256 hash function implemented in [8]. We stress, however, that the algorithm has been optimised for depth, and the number of logical qubits that it uses is large ( 10 5 for reasonable problem sizes).
We obtain the result that for both k-SAT and k-colouring, substantial quantum speedups could be possible: in the case of k-SAT, in the most optimistic regime we consider, the speedup could be as large as a factor of over 10 5 , compared with a standard desktop computer. (That is, to solve an instance solved by the quantum algorithm in one day, the classical algorithm running on a standard computer would need over 10 5 days.) In the case of k-colouring, speedups by a factor of over 10 4 could be possible. However, the extent of these speedups is strongly dependent on the details of the hardware parameters, and the overhead for error-correction. In other regimes for these, there is no quantum speedup at all. In addition, the number of physical qubits required to obtain these speedups is very large (e.g. over 10 12 ). This is largely caused by the need for many large "factories" to operate in parallel to produce high-quality magic states, which are used to implement T gates and Toffoli gates fault-tolerantly in the error-correcting code used (the surface code [53]). A related issue is that this speedup does not take into account the cost of classical processing in the quantum error-correction process, which should also be considered to obtain a true cost comparison (see Section VIII). When we include an estimate for the cost of the classical processing power required to perform decoding of the surface code using current techniques, the quantum advantage disappears. Thus, improvements to fault-tolerance methods are likely to be required if such speedups are to be realised in practice.  III. Representative spacetime costs (measured in units of surface code cycles × physical qubits) to implement one Toffoli gate, assuming gate error rate and a circuit of N Toffoli gates. Calculated using method described in Appendix A. 10 24 gates is a generous upper bound on the number of Toffoli gates that can be executed in 1 day (corresponding to > 10 9 qubits at a clock speed of 1GHz).
In order to state our results more precisely, we must describe the model and methodology used to calculate the cost of a quantum computation.

A. Timing and cost model
Here we outline our resource methdology, which follows a model developed by several previous works in this area [3,8,10,53,78]. The model assumes that the quantum computation is encoded using the surface code [53], a quantum errorcorrecting code with properties that make it an excellent candidate for implementation on near-term hardware platforms (e.g. high fault-tolerance threshold, implementability via local operations). Then the cost of a computation can be calculated via the following sequence of steps: 1. Determine the cost of the quantum circuit, in terms of the number of gates used and the circuit depth.
2. Calculate the number of physical qubits required for the logical computation, and the physical depth.
3. Insert hardware-dependent parameters for clock speed and other variables to compute a physical runtime. According to the runtime requirement, this should be at most 1 day, putting a limitation on the problem instance size that can be solved.
4. Use the above to make a comparison between the cost of quantum and classical computation.
When considering the cost of quantum circuits implemented using the surface code, it is helpful to divide the circuit into parts consisting of Clifford gates (which can be implemented relatively straightforwardly) and non-Clifford gates (which cannot). In the circuits that we consider, the non-Clifford gates used are Toffoli and T gates.
Toffoli and T gates can be implemented fault-tolerantly using a state injection technique where a special state is prepared offline (a Toffoli state [48,61] or T state [26]), and then used to implement the corresponding gate. We include in Appendix A an algorithm for computing the costs associated with this, based on the protocol of [48,61] and using the analysis of [78] (see also [3,53]). Some illustrative spacetime costs are shown in Table III for Toffoli gates, which dominate the complexity of the circuits we consider; the values for T gates are similar.
For reasonable parameter ranges for the error rate and the number N of Toffoli gates, and using standard protocols, the number of qubits used by a single Toffoli-factory is between 10 4 and 10 6 , and the depth of the factory is between 100 and 1000 surface code cycles. However, using more factories this process can be parallelised, such that each new magic state is available almost arbitrarily quickly. Using time-optimal methods [51], the limiting factor becomes only the time required to inject a magic state, which is the time taken for a single physical measurement 3 .
The time complexity of the circuit is then governed by its depth, considering only Toffoli or T gates. As a Toffoli gate can be implemented using a single layer of T gates [89] or injected directly from a Toffoli magic state, this is equal to the "T-depth" of the circuit. The T-depth is defined as the number of T-stages in a circuit, where a T-stage is a group of T gates that can be performed simultaneously [7]. Each time step corresponds to the cost of one measurement.
The parts of the circuit corresponding to Clifford gates can also be implemented using state injection by preparing a particular graph state offline, then measuring all the qubits of this state simultaneously. As the results of this measurement only affect the interpretation of subsequent measurement results, not which measurements are performed, it can be performed in parallel with the implementation of a subsequent Toffoli or T gate. Hence Clifford gates do not contribute to the time cost of the circuit.
The drawback of implementing the circuit in this way is that a large number of ancillas are used, though in practice this ancilla cost is still small compared to the size of the magic state factory. Making a detailed analysis of time-optimal implementations of a Grover oracle, we found that the factory comprised 95% − 99% of all physical qubits, so it is safe to assume factory-dominated costs. There is a space-time tradeoff here and we have chosen to minimise time over space.
Some additional aspects which we do not take into account in our cost calculations, for simplicity and because of their hardware-specific nature, are: • Any additional cost required to implement long-range gates. This cost will depend on the underlying hardware (e.g. certain architectures allow long-range gates, while others are restricted to nearest-neighbour interactions), and some apparently "long-range" gates can be implemented efficiently in the surface code (e.g. controlled-NOT gates with multiple targets).
• Any additional cost required to lay out and route qubits physically within a desired spacetime volume. A discussion of these issues can be found in [10].   One way to address point 4 above, and find a basis for comparing the cost of classical and quantum computation, is to consider the cost of the classical processing required to perform the quantum computation (and in particular to carry out the calculations required for fault-tolerance). We discuss this in Section VIII.

B. Summary of results
Now we have described the cost model, we summarise the results obtained in Tables IV to VI. Each table column corresponds to an extrapolation for the maximal instance size n that can be solved by a quantum algorithm in one day, and includes the parameters of this algorithm, and the speedup obtained. These speedups are expressed as a multiple of the likely performance of the DSATUR and Maple LCM Dist algorithms running on a standard desktop computer in the cases of graph colouring and SAT, respectively (see Section VII for more on the classical experimental results, and the assumptions made). We stress that these figures are sensitive to the precise assumptions made about the classical algorithm's scaling and hardware performance, as well as to certain assumptions (detailed below) about the quantum algorithms' performance on random instances 4 . However, any reduction in performance due to a change in these assumptions can be offset by allowing the quantum algorithm to run for longer. If there were no need for fault-tolerance at all, the runtime of the algorithm would be determined only by the time required for 2-qubit gates, which is somewhat faster than the measurement time in Table I, so the speedup factor would likely be somewhat larger.
The results in Tables IV to VI were obtained by using computer programs to calculate the complexity of the various algorithms used (in terms of T-depth and T-count) for different parameter values. We then chose parameters that produced the largest speedups, while respecting the constraint that the quantum algorithm should run for at most one day. For example, in the case of k-SAT and Grover's algorithm, choosing k = 14 led to the largest quantum speedup. All code developed, together with the experimental results for the classical algorithms, is available at [1].
The largest potential speedup factor found is reasonably large in the "Plausible" scenario, and very large in the "Optimistic" scenario; over 10 5 in the case of applying Grover's algorithm to random 14-SAT, and over 4 × 10 4 in the case of determining colourability of a random graph with 144 vertices. However, the number of physical qubits used is very large, which (as discussed in Section VIII) implies a concomitant increase in the cost of classical processing, which could erase this advantage. This strongly motivates the design of improved fault-tolerance strategies. Observe that this overhead could be mitigated somewhat at the expense of allowing a longer runtime.
It is interesting to note that, in the case of k-SAT, the quantum backtracking algorithm achieves worse performance than straightforward use of Grover's algorithm. This is because of lower-order terms in the runtime (cf. Tables VII and IX below); although the backtracking algorithm will be more efficient for large problems, Grover's algorithm is faster for the problem sizes that can be solved in one day.

C. Organisation and notation
In the remainder of this paper, we give the technical details behind the calculations reported in Tables IV to VI. First, in Sections II and III, we describe the variants of Grover's algorithm and the backtracking algorithm that we use. In Section IV, we discuss the detailed implementation decisions and optimisations that go into calculating the backtracking algorithm's complexity in the case of graph colouring. Section V describes the modifications that need to be made to apply the algorithm to k-SAT. Section VI describes the classical DSATUR algorithm, while Section VII gives the results of the classical experiments to determine the empirical complexity of Maple LCM Dist and DSATUR. Section VIII discusses how to estimate the cost of quantum computation in terms of classical processing. We finish in Section IX with conclusions and further discussion.
We use certain notation throughout the paper. All logs are base 2 and [n] denotes the set {1, . . . , n}. We use n for the number of variables in a constraint satisfaction problem, and m for the number of constraints (edges in the case of colouring problems, clauses in the case of k-SAT). In the case of the graph colouring problem, we also write r = log(k + 1) , s = log(n+1) . These represent the number of bits required to store an element of [k] ∪ { * }, {0, . . . , n} respectively.

II. GROVER'S ALGORITHM
Given access to an oracle function f : [N ] → {0, 1}, Grover's quantum search algorithm can be used to find x such that f (x) = 1, or output that no such x exists, using O( √ N ) evaluations of f [24,57], with arbitrarily small failure probability δ. The algorithm is based on "Grover iterations", each of which can be written as DO f , where D is a fixed "diffusion operator" and O f is an oracle operator performing the map |x |y → |x |y ⊕ f (x) . If the size S of the set {x : f (x) = 1} is known in advance, the optimal number of Grover iterations to maximise the success probability can be calculated in advance; otherwise, one can show that running the algorithm using varying numbers of iterations (e.g. exponentially increasing, or random) is sufficient to solve the unstructured search problem. A precise analysis by Zalka [96] of one variant of the algorithm showed that, to achieve failure probability δ, it is sufficient to carry out at most iterations 5 . This is close to optimal, as even under the promise that S = 1, Ω( √ N ) evaluations of f are required to find the unique x such that f (x) = 1 with high probability [19,97]. Here we will choose δ = 0.1, where we obtain an upper bound of 3.642 N iterations can be derived from the tight bound for the special case S = 1, also shown by Zalka [97].) Assuming that N = 2 n for some integer n (as is the case for k-SAT), the diffusion operation can be implemented using a layer of Hadamard gates on every qubit, followed by a Toffoli gate controlled on all n bits, with target an ancilla bit in the state 1 √ 2 (|0 − |1 ), and then another layer of Hadamard gates. The majority of the complexity in the algorithm therefore comes from the purely classical oracle operation f .
1. Fan-out x to m copies.
2. In parallel, for each clause c: set Ac to 1 if x satisfies clause c.
3. Set the output bit to 1 if Ac = 1 for all c.
5. Fan-in x back to 1 copy.
Algorithm 3. Check whether x violates any clause in a k-SAT formula φ with m clauses.
In the case of k-SAT, the oracle needs to output 1 if and only if the input x satisfies all clauses c in φ. This is an mwise AND function of OR functions of k bits each (and some additional NOT operations). There is a straightforward algorithm for implementing this depth-efficiently, which is stated as Algorithm 3. The key point is that to check all the clauses in parallel, x needs to be fanned out to multiple copies, because two gates cannot be performed on the same qubit at the same time. Note that for random formulae, m will usually be an overestimate of how many copies of each bit are required, because clauses that involve disjoint sets of variables can be checked at the same time. In the best case, the clauses could be grouped into approximately mk/n sets of n/k clauses, where the clauses in each set could be checked simultaneously. This would lead to mk/n copies of each bit being required.
We can now calculate the complexity of Grover's algorithm for particular choices of k, m and n. The algorithm consists of Toffoli gates and Clifford gates. We ultimately will be concerned with measuring the Toffoli depth and Toffoli count, with our primary goal being to minimise Toffoli depth (as this controls the runtime of the overall computation).
Each Toffoli gate controlled on c ≥ 2 bits can be implemented using a circuit containing 2c−3 standard Toffoli gates arranged into 2 log c − 1 layers in a tree structure. However, almost half of these gates can be replaced with classically controlled Clifford gates using the "uncompute AND" operation described in [56], to give a circuit containing c − 1 Toffoli gates. This does not change the measurement depth of the resulting circuit, as the classically controlled Clifford operation itself requires a measurement.
The operation of fanning out a single bit b to c copies can be implemented via a tree of CNOT operations of the following form, illustrated for c = 8:  The depth of the circuit is log c , and it uses c − 1 CNOT gates and no other gates. However, in the surface code a fanout operation (equivalently, a CNOT gate with multiple target bits) can be executed in the same time as required for one CNOT gate [53]. So we assign this operation the same cost as one CNOT gate, which in any case is 0, when considering Toffoli depth and Toffoli count. Combining all the components of Algorithm 3, the Toffoli depth is The complexities of the various parts of the oracle operation are summarised in Table VII for a particular choice of parameters. It is obvious that the Toffoli depth is very low, while the Toffoli count seems very high in comparison. It is not clear that this could be reduced by more than an order of magnitude or so, though, given that there needs to be at least one Toffoli gate per clause (controlled on k bits).

III. BACKTRACKING ALGORITHMS
Some of the most effective and well-known classical algorithms for graph colouring and boolean satisfiability are based on the technique known as backtracking [41,42,88,91]. This approach can be applied to any constraint satisfaction problem where we have the ability to rule out partial solutions that have already failed to satisfy some constraints (e.g. partially coloured graphs where two adjacent vertices share the same colour). The basic concept behind this algorithm, in the context of graph colouring, is to build up a colouring by trying to assign colours to vertices. If a conflict is found, the colour is erased and a different colour is tried.
The skeleton of a generic backtracking algorithm that solves a k-ary CSP defined by a predicate P and a heuristic h is given in Algorithm 4. P determines whether a partial assignment already violates a constraint, whereas h determines the next variable to assign a value in a given partial assignment. Define D := ([k] ∪ { * }) n , where the * 's represent the positions which are as yet unassigned values.
Backtracking algorithms can be seen as exploring a tree whose vertices correspond to partial solutions to the CSP. In the case where no solution is found, the runtime of the algorithm is determined by the size T of this tree (along with the cost of computing P and h). A quantum algorithm was Assume that we are given access to a predicate P : D → {true, false, indeterminate}, and a heuristic h : D → {1, . . . , n} which returns the next index to branch on from a given partial assignment.
Return bt( * n ), where bt is the following recursive procedure: bt(x): 1. If P (x) is true, output x and return.
2. If P (x) is false, or x is a complete assignment, return.
(a) Set y to x with the j'th entry replaced with w.
Algorithm 4. General classical backtracking algorithm (see [76]) described in [76] which solves the CSP with bounded failure probability in time scaling with √ T , up to some lower-order terms. Two variants of the algorithm were given, with differing runtimes and preconditions. The first detects the existence of a solution and requires an upper bound on T . The second outputs a solution (if one exists) and does not require an upper bound on T ; the price paid is a somewhat longer runtime.  76]). Let T be the number of nodes in the tree explored by Algorithm 4. Then there is a quantum algorithm which makes O( √ T n 3/2 log n) evaluations of each of P and h, and outputs x such that P (x) is true, or "not found" if no such x exists. If we are promised that there exists a unique x 0 such that P (x 0 ) is true, there is a quantum algorithm which outputs x 0 making O( √ T n log 3 n) evaluations of each of P and h. In both cases the algorithm uses poly(n) space, O(1) auxiliary operations per use of P and h, and fails with probability at most 0.01.
These algorithms are based on the use of a quantum walk algorithm of Belovs [17,18] to search efficiently within the backtracking tree. Their failure probabilities can be reduced to δ, for arbitrarily small δ > 0, at the cost of a runtime penalty of O(log 1/δ).
There have been some subsequent improvements to these algorithms. First, in the case where the classical algorithm finds a solution without exploring the whole tree, its runtime could be substantially lower than T . Ambainis and Kokainis [6] showed that the quantum runtime bound can be improved to O( where T is the number of nodes actually explored by the classical algorithm. Second, Jarret and Wan [60] showed that the runtime of Theorem 2 can be improved to O( √ T n) without the need for any promise on the uniqueness of the solution 6 . Also, the quantum backtracking algorithm has been applied to exact satisfability problems [69], the Travelling Salesman Problem [77], and attacking lattice-based cryptosystems [5,81]. However, none of these works precisely calculates the complexity of the algorithm for specific instances. Filling in the details of the backtracking algorithmic skeleton requires implementing the P and h operations. We now describe how this can be done in the cases of k-colouring and satisfiability: false if there is a pair of adjacent vertices in G that are assigned the same colour by x; and indeterminate otherwise. One natural choice for the heuristic h is to choose the uncoloured vertex which is the most constrained ("saturated"), i.e. the one which has the largest number of adjacent vertices with different colours [28].
• k-SAT: x ∈ {0, 1, * } n represents a partial assignment to variables in the formula. P (x) returns true if x is a complete, valid assignment; false if there exists a clause that x does not satisfy; and indeterminate otherwise. A simple choice of h is to order the variables in advance in some sensible way, and then to return the lowest index of a variable that has not yet been assigned a value.
Here we ordered variables in decreasing order of the number of times they appear in the formula.
In the case of k-SAT, one could also consider dynamic strategies for h (e.g. choosing the variable that occurs in the largest number of clauses in the formula produced by substituting in the assigned values to variables, then simplifying). However, although these could give improved complexities for large instances, they seem likely to lead to larger runtime overheads per operation, so we did not consider them here. Conversely, in the case of k-colouring, one could consider static strategies for h (e.g. choosing the highest-degree uncoloured vertex first). In our experiments, these seemed to be less efficient.
Undertaking a full analysis of the complexity of the backtracking algorithm then requires calculating the complexity of the P and h operations in detail, together with the complexity of the remaining operations in the algorithm. We do this in the next section.

IV. BACKTRACKING ALGORITHM COMPLEXITY OPTIMISATION
For simplicity in the analysis, and to allow for future theoretical developments that may improve the runtime of the algorithm, we consider the simplest version of the backtracking algorithm of [76]: the algorithm that detects the existence of a marked vertex, given an upper bound on the number of vertices T in the backtracking tree. (In practice, it may not be realistic to have access to such an upper bound; however, given a known distribution on problem instances, one could choose a bound that is expected to hold for most instances, for example.) The algorithm is based on the use of a quantum walk to detect a marked vertex within a tree containing T nodes. A marked node corresponds to a valid solution. Abstractly, the quantum walk operates on the Hilbert space H spanned by where r labels the root. The walk starts in the state |r . Let A be the set of nodes an even distance from the root (including the root itself), and let B be the set of nodes at an odd distance from the root. We write x → y to mean that y is a child of x in the tree. For each x, let d x be the degree of x as a vertex in an undirected graph.
The walk is based on a set of diffusion operators D x , where D x acts on the subspace H x spanned by {|x } ∪ {|y : x → y}. The diffusion operators are defined as follows: • If x is marked, then D x is the identity.
• If x is not marked, and x = r, then D |y .
A step of the walk consists of applying the operator Then the detection algorithm from [76] is presented as Algorithm 5.
We will mostly be interested in the complexity of the phase estimation step. The outer repetition step can be performed across multiple parallel runs of the algorithm, and hence does not affect the overall runtime (circuit depth). However, it does affect the overall cost of executing the algorithm, so we give an estimate of K below.

A. Optimisation of phase estimation step
We first observe that the full phase estimation procedure is not actually required in Algorithm 5; it is sufficient to distin-Input: Operators RA, RB, a failure probability δ, upper bounds on the depth n and the number of vertices T . Let β, K, L > 0 be universal constants to be determined.
1. Repeat the following subroutine K times: (a) Apply phase estimation to the operator RBRA on |r with precision β/ √ T n.
2. If the number of acceptances is at least L, return "marked vertex exists"; otherwise, return "no marked vertex".
If there is a marked vertex, there exists an eigenvector |φ of where P χ is the projector onto the span of eigenvectors of R B R A with eigenvalues e 2iθ such that |θ| ≤ χ.
To distinguish between these two cases, we can perform the following algorithm: 1. Attach an ancilla register of m qubits, initially in the state |0 ⊗m , to |r .

2.
Apply H ⊗m to the ancilla qubits.

Apply the operation
4. Apply H ⊗m to the ancilla qubits and measure them. 5. Accept if the outcome is 0 m . The quantum part of the algorithm is the same as the standard phase estimation algorithm, but with the final inverse quantum Fourier transform replaced with Hadamard gates, which are negligible when calculating the overall complexity of the algorithm. The total number of controlled-R B R A operations used is M −1. Expanding |r = j α j |φ j in terms of a basis |φ j of eigenvectors of R B R A with corresponding eigenvalues e 2iθj , where |φ 0 corresponds to eigenvalue 1, the final state before the measurement is so the probability p that the algorithm accepts is We see that, if |r were an eigenvector with eigenvalue 1 (i.e. α 0 = 1) the algorithm would accept with certainty. If there is a marked element, | φ 0 |r | 2 ≥ 1/2 by Claim 3, so the algorithm accepts with probability at least 1/2. In the case where there is no marked element, we split up the sum over k to obtain Upper-bounding these sums now proceeds via a similar argument to standard calculations for phase estimation [39], but we will attempt to find somewhat tighter bounds. We can evaluate by the formula for a geometric series. Clearly sin 2 (M θ j ) ≤ 1, and using sin x ≥ x − x 3 /6 we have .
We can now upper-bound p as where we upper-bound the first sum in (3) using Claim 3 to infer that j,|θj |≤χ |α j | 2 ≤ χ 2 T n, and that µ j ≤ 1; and upper-bound the second sum using j |α j | 2 = 1 and that the upper bound on µ j decreases with θ j in the range [0, π/2]. We now optimise χ to find the best possible bound on p. Assume that χ = a/ √ T n, M = √ T n/b for some constants a, b. Then the upper bound on p becomes .
The minimum over a of a+b/a is 2 √ b, so we obtain an overall bound that p ≤ 2 (1)). In order to achieve a separation from 1/2 in this case, it is sufficient to take b < 1/16; different choices of b allow a tradeoff between the complexity of the phase estimation step, and the number of times it needs to be repeated in Algorithm 5. For the calculations here, we choose b = 1/32. Given this, we can now calculate (numerically) the value of K in Algorithm 5 required to obtain a desired probability of success. For b = 1/32, to achieve failure probability δ = 0.1 it is sufficient to take K = 79. In this case the overall cost of the algorithm, in terms of uses of controlled-R B R A , is at most and the number of sequential uses of R B R A in the circuit is at most 32 √ T n. The above algorithm assumes that M is a power of 2, but the algorithm can easily be modified to handle the case where it is not, by replacing the use of H ⊗m with an operation creating a uniform superposition over M < 2 m elements. The cost of this operation and its inverse is negligible in the context of the overall algorithm. Note that the "optimal" variant of phase estimation where a different input state is used (rather than |+ ⊗m ) does not seem to achieve a significantly better bound when m is close to its minimal value (see e.g. the analysis in [36]).

B. Efficient parallel implementation of RA and RB
To implement the R A and R B operations efficiently requires some additional work. A full description of how these operations can be implemented was given in [76] (Algorithm 3). Here, in Algorithm 6 we describe an essentially equivalent algorithm for the case of k-colouring which can be implemented more efficiently, and which computes operations in parallel where possible. The algorithm as used for k-SAT is similar, but simpler, and the modifications required for this are described in Section V.
In Algorithm 6, D corresponds to inversion about the state |ψ = i∈{ * }∪[c+1] |i , and D corresponds to inversion about a state |ψ of the form |ψ = α| * + β √ k i∈[c+1] |i . The algorithm implements R A ; the operation R B is similar, but slightly simpler because the alternative diffusion map D does not need to be performed. Correctness can be checked by tracking the effect of the algorithm on a computational basis state; we omit the routine details.
The main points of comparison between Algorithm 6 and the algorithm presented in [76] are: • Both algorithms need to convert between two input representations: one of the form (i 1 , v 1 ), . . . , (i , v ), representing that variable i j is assigned value v j , and one of the form x ∈ D, representing a partial assignment with * 's indicating unassigned variables. This is necessary because the first form allows efficient assignment of a value to a variable, while the second allows efficient checking against constraints. In Algorithm 6, this conversion is performed once for both P and h, as opposed to being done internally to each of P and h separately.
• The most complicated operations in the algorithm are P and h. In Algorithm 6, they are executed in parallel, and each of them in turn contains some operations performed in parallel. To achieve this, copies of the input need to be produced using fan-out operations.
• At each step of the algorithm, it is necessary to diffuse over neighbours of nodes in the backtracking tree. The algorithm in [76] achieves this by checking whether P (x ) is false for each of the children x of the current node x in turn, and only diffusing over those children for which this is not the case. For simplicity and efficiency, Algorithm 6 defers this check to when the child nodes themselves are explored.
• The standard backtracking algorithm would have k children for each node in the backtracking tree, corresponding to the k colours available. However, to determine colourability, it is more efficient (and equivalent) to 1. Convert input to x ∈ D, stored in ancilla register. If is odd, ignore the pair (i , v ) and instead swap i with h and swap v with a.
3. In parallel: Evaluate the number of colours used in x, stored in c register, and if is even, evaluate P (x), stored in p register; if is even, evaluate h(x), stored in h register; 4. If a = * , subtract 1 from .
5. If p = F, invert the phase of the input.
6. If p = ?, and > 0, apply diffusion map D to a to mix over c + 1 colours; if p =?, and = 0, apply diffusion map D to a to mix over c + 1 colours.
9. Fan-in x back to 1 copy. 10. Unconvert x. If is odd, ignore the pair (i , v ) and instead swap a with v and swap h with i . Algorithm 6. RA operation for k-colouring (optimised and parallelised version of algorithm in [76]). Details of how the number of colours is stored are given in Section IV H.
only allow the algorithm to choose a colour between 1 and c + 1 for each subsequent node to be coloured, where c is the number of colours currently used. In particular, this enforces the constraint that all colourings considered include all colours between 1 and c , for some c . This optimisation is used in the classical DSATUR algorithm.
It remains to describe the steps of Algorithm 6 in more detail, including the implementation of the P and h operations for particular problems. The overall complexities of the various steps found are summarised in Table VIII and IX in the cases of graph colouring and SAT, respectively, for choices of parameters close to the limit of problem size that can be solved in one day. The total scaling of the T-depth of R A with problem size in the case of graph colouring is illustrated in Figure 2.

C. Calculating circuit complexities
With the exception of step 6, all of the operations in R A are completely classical, and can be described in terms of Toffoli gates controlled on m ≥ 0 bits (incorporating NOT and CNOT gates). We use the same depth-efficient construction of Toffoli gates controlled on m bits that was discussed in the context of Grover's algorithm. When R A is used, it is a con-   trolled operation itself, so we must add 1 to the number of control lines used by all of its gates. In particular, to compute the T-depth of the overall circuit we need to keep track of the "CNOT depth" of R A . In the description of R A below, for readability we do not include the additional control line required, but we do include it when discussing T-depth and Toffoli count. Also note that in some cases below, we do not give formulae for the T-depth and Toffoli count, but compute this via a program, available at [1]. Throughout the implementation, we represent an element of the set [k] ∪ { * } as a r := log(k + 1) -bit string, where * corresponds to 0 r . The ?, F, T values that the p register can take are represented as 00, 01, 10, respectively. Where algorithms for steps state usage of ancillas, this only encompasses those required to describe the algorithm (i.e. subroutines within the steps may use additional ancillas).

D. Conversion of input
In the first step of the algorithm, we need to convert an input of the form (i 1 , v 1 ), . . . , (i n , v n ) to a string x ∈ D. We also need to ignore a particular pair (i , v ) if is odd. An Input: n pairs (ij, vj), h, a; ij, h ∈ {0, . . . , n}, vj, a ∈ [k] ∪ { * }. Output: x ∈ D. Ancillae: n×(n−1)×(r +s)-qubit register A; n×n×r-qubit register B.
1. Fan-out each of the n pairs ij, vj to n copies, stored in register A and the input register.
2. In parallel, for each pair p, q ∈ [n]: (a) If is even or = p, and ip = q, copy vp to Bpq; (b) If is odd and = p, swap ip with h and swap vp with a.
3. In parallel, for each q: copy p Bpq to xq.

Uncompute B and then A.
Algorithm 7. Conversion of input.
algorithm to do all of this is described as Algorithm 7. The main idea behind the algorithm is to create an n × n array B where B pq stores v p if i p = q, and is otherwise zero (so, for each p, B pq is nonzero for at most one q). Then, for each q, if such a nonzero element B pq exists it is copied to x q . We now describe each of the operations in Algorithm 7 in more detail, and calculate their complexities. First, the fanout and fan-in operations are implemented in the same way as in Grover's algorithm. In step 1 of Algorithm 7, each bit can be fanned out in parallel, so this operation corresponds to n(r + s) parallel fan-outs of 1 bit to n bits.
To implement step 2, we can first use a Toffoli gate controlled on s bits with a target of one ancilla bit to store whether = p (in the case of odd p; for even p, this is omitted). Then to carry out step 2a, we perform r parallel bit copy operations controlled on this bit being 0 and on i p = q (corresponding to a Toffoli controlled on s+2 bits).
Step 2b then consists of two controlled-swap operations, each controlled on the ancilla bit, which can be performed in parallel. Swapping pairs of bits can be implemented as 3 CNOT gates. At the end of step 2, we uncompute the ancilla bit.
For step 3, we need to copy the one non-zero element of the n-element set S q := {B pq : q = q} to x q , for each q. We could achieve this by a binary tree of addition operations (qv) but a simpler method is to use a tree of CNOT gates. We consider each set S q in parallel, and each bit within elements of this set in parallel too, corresponding to a sequence of n bits containing at most one 1. For sequences of this form, their sum is just the same as their sum mod 2. So, for each pair of bits (b i , b j ), we can copy their sum into an ancilla qubit with the circuit Using a binary tree of such addition operations, we can set each bit of x q correctly in depth 4 log n (including the cost of uncomputing to reset the ancillas), where all gates are CNOT gates. As precisely n − 1 bitwise summation opera-Input: n copies of x ∈ D. Output: ?, F, T (represented as 00, 01, 10) Ancillae: m-qubit register A, n-qubit register B.
1. In parallel: for each edge e = (i, j) in G, check whether xi = xj and xi = * . Set Ae to 1 if so (corresponding to edge e being violated).
2. Set the second bit of the output register to 1 if any of the bits of A are equal to 1.
4. If the second bit of the output register is 0, and Bi = 0 for all i, set the first bit of the output register to 1.

Uncompute B and A.
Algorithm 8. P (x): Checking violation of a constraint.
tions take place, the circuit uses n − 1 ancillas and 4(n − 1) CNOT gates for each bit in each q. These quantities are multiplied by nr to obtain the overall ancilla and gate complexities.
In step 4 we finally need to uncompute the B and A registers, with the same complexities as computing them.
For large n, the T-depth of the circuit is dominated by steps 1, 3 and 4, and is approximately 6 log n . This will turn out to be negligible in the context of the overall algorithm.

E. Evaluation of P (x): checking violation of a constraint
The algorithm for evaluating the predicate P in the case of graph colouring is presented as Algorithm 8. The goal is to check whether, given a partial assignment of colours to vertices in a graph G, there exists an edge of G whose endpoints are assigned the same colour. The output should be false if so, true if there is no such edge and all vertices are assigned a colour, and undetermined otherwise. We can hard-code the edges of G into a quantum circuit for checking this, making the algorithm quite straightforward. Steps 1 and 2 check whether any constraint is violated, while steps 3 and 4 check whether the partial assignment is complete. As we have access to n copies of x, we can perform all the checks in parallel for all edges, using a decomposition of the m ≤ n 2 edges into at most n matchings, where the edges in each matching can be checked in parallel.
In step 1, checking equality of 2 bits a, b can be done using 3 ancillas and 3 Toffoli gates, via the following circuit: These equality checks can be done in parallel across all the bits of x i , x j , followed by an Toffoli gate controlled on the r Input: kn copies of x ∈ D. Output: Identity of the vertex labelled * whose adjacent vertices are labelled with the largest number of different colours. Ancillae: n × n × k-qubit register A; n r-qubit registers Bi.
1. In parallel: for each triple (i, j, c) such that i is adjacent to j, set Aijc = 1 if xj = c.
2. For each i, set Bi = c j Aijc. equality test bits. The check x i = * corresponds to at least one of the bits of x i not being equal to 0, which can be checked using a similar Toffoli, some NOT gates, and one more ancilla.

Copy maxi
We then use one more Toffoli gate to write the AND of these into bit A e . We do not need to uncompute the ancilla bits used in these steps, as we will do this in step 5.
Step 2 can be implemented using a Toffoli gate controlled on m qubits; step 3 can be implemented via n Toffoli gates in parallel, each controlled on r bits; step 4 can be implemented using a Toffoli gate controlled on n + 1 bits (together with some NOT gates in each case).

F. Evaluation of h(x): choosing the next vertex
The algorithm for computing h(x) in the case of kcolouring is presented as Algorithm 9. The goal of the algorithm is to output the uncoloured vertex that is the most constrained or "saturated" (has the largest number of adjacent vertices assigned distinct colours).
Each of the operations performed in step 1 of the algorithm corresponds to a Toffoli gate controlled on r bits (together with some NOT gates). Similarly to the case of the P operation, they can all be performed in parallel, using copies of the input x.
The most complicated part is the second and third steps. The second step can be split into two parts. First, we initialise an ancilla register C ic = j A ijc ; then we set B i = c C ic . Each bit C ic can be set in parallel using a Toffoli gate controlled on n qubits (and some NOT gates). Summing these bits efficiently in parallel can be achieved via a binary tree of addition operations, where the tree is of depth log n . The t'th level of the tree sums approximately n/2 t pairs of integers in parallel, producing (t + 1)-bit integers as output. The addition operations themselves can be carried out using remarkably efficient out-of-place addition circuits presented by Draper et al. [47]. Even for adding two 10-bit numbers, the depth of the circuit given in [47] is only 11 (8 layers of which are made up of Toffoli gates, the remainder of which contain CNOTs). Here this can be reduced further by simplifying the circuit in the earlier layers of the tree that compute smaller numbers, and by observing that some layers of the circuit in [47] are only used to uncompute ancilla bits, which are uncomputed

If
6. If B = 1, perform a Hadamard gate on C.
7. If B = C = 1 and a = 0 r , invert the phase of a.
8. If B = 1, perform a Hadamard gate on C.

Uncompute B and A.
Algorithm 10. Diffusion in backtracking tree (assuming k + 1 not a power of 2). U * , Vc, α, β are defined in the text. anyway in step 4. Hence these layers can be omitted.
The third step can be achieved via a binary tree of "compare-and-swap" operations. The n B i values are split into adjacent pairs. Then the efficient in-place comparator also presented by Draper et al. [47] is used to determine which element of each pair is larger. If they are out of order, they are swapped, using a decomposition of a controlled-SWAP operation in terms of 3 Toffoli gates. The result at the end of the tree is that the largest value is moved to the bottom, and we can then copy it into the output register. We then need to reverse the rest of step 3 to put the elements of B i back into their original order before uncomputing steps 1 and 2.

G. Diffusion operations
In step 6 of the R A operation, we need to apply a diffusion map D or D to the state of an ancilla register |a of r qubits, controlled on some other qubits. See [30] for a general discussion of how this can be achieved in quantum walk algorithms. Recall that D corresponds to inversion about the state |ψ = |i . Further recall that we represent * as 0 r . First assume that c is fixed throughout (we will handle c being given as input to the algorithm later).
The algorithm for achieving this map is described as Algorithm 10. We begin by using two ancilla qubits A and B to store whether (a) p = ?, and > 0; (b) = 0. If we set B first, these can be implemented using Toffoli and NOT gates controlled on 3 and s qubits, respectively. Controlled on A, we apply D; controlled on B, we apply D . Afterwards, we uncompute the ancilla qubits. Note that, once the ancilla qubits have been set, we do not need to access the overall R A control bit again.
We first describe how to implement D. To do this, it is suf-ficient to implement V such that V | * = |ψ . This is because the combined operation V U * V † , where U * | * = −| * and U * |x = |x if x = * , maps |ψ → −|ψ , |ψ ⊥ → −|ψ ⊥ as desired. U * can be implemented using a Toffoli gate controlled on r qubits and an ancilla qubit in the state Implementing V is easy if c = k − 1 and k + 1 is a power of 2 (and hence |ψ = |+ ⊗r ), by setting V = H ⊗r . In the context of the overall circuit, this corresponds to implementing r copies of a controlled-Hadamard gate. An efficient Clifford+T circuit for this gate is given in [20] which has T-depth 2 and T-count 2, and uses no ancillas. The total resources required to implement D are then only T-depth 2( log(r − 1) + 3, Tcount 10r − 9, and 2(r − 2) ancillas (excluding the ancilla in the state |− , which can be reused from step 5 of R A ).
If c < k − 1 or k + 1 is not a power of 2, we need to reflect about a state that is not |+ ⊗r . We can do this using just one iteration of the exact amplitude amplification trick of Brassard et al. [25]. Let i be the smallest integer such that c + 2 ≤ 2 i . Then define H = H ⊗ I ⊗r−i ⊗ H ⊗i , where H |0 = γ|0 + δ|1 , and |γ| 2 = 2 i−2 /(c + 2). Note that this is well-defined, because c + 2 ≥ 2 i−1 . Then V can be implemented using the following sequence of operations: where U ≤c+1 is an operation which inverts the phase of any computational basis state corresponding to a value less than or equal to c + 1, and which leaves all other computational basis states unchanged. By the standard analysis of amplitude amplification (where U ≤c+1 is the "verifier" and H is the guessing algorithm) this will produce the state |0 |ψ when applied to the state |0 |0 , as can be confirmed by direct calculation.
Many of the operations in V are quite efficient. To implement the less-than checking step U ≤c+1 , we can use the efficient comparison circuit of Draper et al. [47]. The other operations are (controlled) single-qubit gates. The only one which has not yet been discussed is H . This is not straightfoward to implement, as it requires approximate synthesis using Clifford and T operations. An arbitrary operation mapping |0 to cos θ|0 + sin θ|1 can be written as T HRH, for some rotation R in the Z plane. It was shown in [22] that the mean expected T-count to implement Z-rotations up to inaccuracy using "repeat-until-success" circuits is ≈ 1.15 log 2 (1/ )+9.2. In order to achieve a good overall level of accuracy for the whole algorithm, we require to be upper-bounded by the total number of uses of D in the circuit. This is at most 64 √ T n by (4) and the text below it. In the algorithm we actually need a controlled-H operation (controlled on A). Using a technique of Selinger [89], controlled-R and controlled-T can each be implemented at an additional cost of T-depth 2 and T-count 8. So the expected overall cost of implementing H is T-depth ≈ 1.15 log 2 (64 √ T n) + 17.2, T-count ≈ 1.15 log 2 (64 √ T n) + 29.2. It remains to implement D , which can be done as follows. First, add another ancilla qubit C in the state |0 and map it to the state α|0 + β|1 using a similar synthesis technique.
Then use a controlled-D operation (controlled on C) to produce α|0 | * + β|1 1 √ k i∈ [k] |i , followed by a Hadamard gate on C to produce |i .
Next, conditional on C being in the state 1 and the other qubits not being in the state 0 r , the phase is inverted to produce the state A final Hadamard gate on C restores it to its initial state. We finally describe how to handle dependence on c. When used in Algorithm 6, we can think of steps 3-8 of the overall diffusion operation as being provided with an input of the form |c |A |B |a , where c specifies the number of colours used in x. The goal is to apply an operation depending on c (call this D c ) to |a := |A |B |a . D c encompasses the controlled-D and controlled-D operations above. This can be achieved by attaching k − 1 ancilla registers of r + 2 qubits each, where the i'th register is initially in a state |e i which is an eigenvector of D i with eigenvalue 1. These states can be prepared in advance of the algorithm, so the cost of preparing them is negligible in terms of the overall complexity. Then, before step 3 in Algorithm 10, swap operations controlled on c are used to swap |a with |e c , before applying D i to each register i in parallel, and then swapping |a back into place. This has the effect of applying D c to |a and leaving the other registers unchanged.
Rather than applying k − 1 controlled-swap operations sequentially, we can achieve a reduction in depth by storing all binary prefixes of the r-bit string corresponding to c. We can then use an r-step procedure where, at step i, the state |a is swapped into a register labelled with the first i bits of c (with the final register corresponding to |e c ). These swap operations can all occur in parallel for different choices of prefixes of length i − 1. We also only need to perform swaps for the cases where c i = 1, because at the (i + 1)'th step we can associate each register labelled with an i-bit string with the corresponding (i + 1)-bit string ending in 0. The final result is a "controlled-swap tree" of depth r.

H. Miscellaneous parts and overall bound
In step 3, we need to count the number c of colours used in x, and also store whether each prefix of the binary string c (i.e. each substring of the form c 1 . . . c i ) is equal to each possible prefix (i.e. element of {0, 1} i ). The algorithm for this is given as Algorithm 11. The part that counts the number of colours used in x is essentially a simplified version of Algorithm 9. Then, to calculate the complexity of step 3 of Algorithm 11, observe that there are r i=1 2 i = 2 r+1 −1 non-trivial prefixes of x, and checking each i-bit prefix can be achieved using a Toffoli gate controlled on i bits.
Input: k copies of x ∈ D. Output: The number c of colours used in x, and a bit-string e checking equality to each prefix of the binary string corresponding to c. Ancillae: n × k-qubit register A. In steps 2 and 9, we perform fan-out and fan-in operations, whose complexities can be calculated similarly to step 1. In steps 4 and 7, we need to perform a controlled decrement and increment, controlled on a = * . These can be done by first setting an ancilla qubit based on the r bits of a, then using a controlled version of the in-place addition circuit of Draper et al. [47] (hardcoding the first input to 1 or −1), and then uncomputing the ancilla. We can reduce the complexity of the circuit stated in [47] slightly by restricting to 8-bit integers and observing that some of the gates can be removed because of the input hardcoding. The total depth is then 14 layers of Toffoli gates and 3 layers of CNOT gates, and the circuit uses 38 Toffoli gates and 25 CNOT gates.
Finally, in step 5, we need to invert the phase of the input if p = F. This can be done using a Toffoli gate and one ancilla qubit in the state 1 √ 2 (|0 − |1 ).

V. BACKTRACKING FOR k-SAT
The quantum backtracking algorithm can be applied to boolean satisfiability (SAT) problems in much the same way as for colouring problems. In the case of SAT, variables are boolean; within the backtracking algorithm, this corresponds to 2 bits being used to store each variable (we use * → 00, F → 01, T → 10). As the P and h operations are substantially simpler than in the case of graph colouring, the runtime of the algorithm is dominated by the diffusion operation, whose cost in turn is dominated by the gate synthesis step required. We can reduce the cost of this by observing that diffusion becomes much simpler if each variable represents an element of a set of size 2 c − 1, for some integer c, so allowing the two bits representing x i to take the fictitious value 11 makes this step more efficient. We must then modify the P operation to check and reject this value. This increases the cost of the P operation somewhat, and also increases the size of the backtracking tree by a factor of 3/2. However, it is still advantageous overall. Note that some gate synthesis is still Input: m copies of x ∈ D, where xi is represented by two bits. Output: ?, F, T (represented as 00, 01, 10) Ancillae: (m + 1)-qubit register A, n-qubit register B.
1. In parallel: for each clause c in φ, check whether x violates c. Set Ac to 1 if so.
3. Set the second bit of the output register to 1 if any of the bits of A are equal to 1.
5. If the second bit of the output register is 0, and Bi = 0 for all i, set the first bit of the output register to 1.
6. Uncompute B and A. The algorithm for checking whether any of the clauses is violated is given as Algorithm 12. Checking whether x violates an individual clause can be achieved with just one Toffoli gate, controlled on k bits (together with some additional NOT gates). This is because we use 01, 10 to represent a variable being set to false and true, respectively; so controlling on one or other of those bits is enough to check for an assignment being consistent with a literal in a clause.
Step 2 of the algorithm checks whether any of the bits of x are equal to the fictitious value represented by the binary string 11. This can be achieved using an n-qubit ancilla register y, such that bit y i is set to 1 if both of these bits are equal to 1, using a Toffoli gate. Then A m+1 is set to 1 if any of the bits y i are equal to 1, and then y is uncomputed.
Then steps 4 and 5 of the algorithm determine whether the partial assignment x is complete. If the assignment is complete, and not inconsistent with any of the clauses, the output is set to true.

B. Evaluation of h(x): choosing the next variable
Given our simple choice of heuristic for k-SAT, the h function is also very simple: it returns +1, which can be achieved via copying and then incrementing it in place.

VI. CLASSICAL GRAPH COLOURING ALGORITHMS
Graph colouring algorithms can be categorised as either heuristic or exact. Heuristic algorithms aim to output a "good" colouring (one with a low number of colours) without proving optimality, while exact algorithms allow the impossibility of colouring with a certain number of colours to be certified. In this work the focus is on exact algorithms, corresponding to problems for which it is crucial to determine optimality. For example, a wireless communication problem where it is extremely expensive to use unnecessarily many frequencies (we might imagine that there is a fixed allocation of bandwidth across all cells in a wireless communication system, so having more colours would correspond to each cell having lower bandwidth). A vast array of heuristic algorithms for graph colouring has been proposed; for a relatively recent survey of these (and exact algorithms), see [68].
The idea of choosing a vertex to colour that is the most "saturated" was proposed by Brélaz in 1979 [28] as a heuristic, called DSATUR, for finding a good colouring. This idea can also be used as part of a backtracking algorithm, also called DSATUR. The full algorithm, incorporating an optimisation due to Sewell [91], can be summarised as follows: 1. Find a large (though possibly non-maximal) clique in the graph via an efficient algorithm, and colour the vertices in that clique.
2. Perform the standard backtracking algorithm with the following heuristic h for choosing the next vertex to colour: (a) Choose the vertex which is adjacent to the largest number of vertices with distinct colours.
(b) In case of a tie in (a), choose the vertex with the largest degree.
(c) In case of a tie in (b), choose the lexicographically first vertex.
The P function is defined to reject any partial colouring that is invalid.
There are two simple further optimisations that are implemented in standard versions of DSATUR [71]. The first is, when the current partial colouring contains colours between 1 and c, to only consider colours between 1 and c + 1 for colouring the next vertex. The second is not to assign a vertex a colour that is already used by any of its neighbours. The former optimisation can be implemented in the quantum backtracking algorithm quite efficiently (see above), whereas the latter seems less straightforward to achieve without incurring a multiplicative loss of O( √ k) in the complexity. DSATUR can also be used to compute the chromatic number, rather than checking k-colourability. In this (more standard) variant, the number of colours used can be as large as n. When a complete and valid colouring of the graph is found that uses fewer colours than the best colouring found so far, the number of colours it uses is stored and any further partial colourings that use more than that number of colours are rejected. Then the algorithm finally outputs the stored (minimal) number of colours.
The clique-finding step, which affects the runtime very little, can be implemented classically before the rest of the algorithm is executed on the quantum computer. The tie-breaking steps can be built into the h function presented in Section IV F with little extra effort, by sorting the vertices according to degree before running the algorithm.
The DSATUR algorithm is very simple, but remains a competitive approach for colouring random graphs, called a "de facto standard" [91], performing "suprisingly well" and used on small subproblems as part of other algorithms [88]. (For large structured graphs, approaches based on expressing the colouring problem as an integer program and solving this via relaxations seem to be superior [68].) Further, a standard implementation is available 7 , and this algorithm is widely used as a benchmark in colouring competitions.
Additional modifications to the DSATUR algorithm were suggested by Sewell [91] and San Segundo [88] which achieve improved runtimes for certain graphs (e.g. up to about a factor of 2 in results reported in [88]). Mehrotra and Trick presented a new linear programming (LP) relaxation and compared it against DSATUR [71]. Results were mixed; on some random instances the LP relaxation substantially outperformed DSATUR, and on others the reverse was true. For structured instances, the LP approach was often (though not always) superior. Mixed results are also seen in the comparison given in [73] of DSATUR against a new branch-and-cut method: the new method (which also uses DSATUR as a subroutine) usually, but not always, outperforms DSATUR itself. By contrast, in the experimental results reported in [88], DSATUR always outperforms a competitor branch-and-price algorithm on random instances. Interestingly, despite significant progress on the development of fast SAT solvers over recent years, the approach of encoding a graph colouring as a SAT instance and solving it using a SAT solver does not seem to be particularly effective for random graphs [54].
Given all the above results, together with the algorithm's ubiquity, we believe that DSATUR is a reasonable choice of classical algorithm for benchmarking purposes.

VII. CLASSICAL EXPERIMENTAL RESULTS
In order to determine the likely performance on large instances of the classical algorithms considered, and to calculate the quantum backtracking algorithm's likely performance, we evaluated the classical algorithms on many random instances to determine their runtime scaling with problem size. We now describe the results of these experiments.

A. Satisfiability
We ran the Maple LCM Dist SAT solver, the winner of the "main track" of the SAT Competition 2017 [14], on randomly generated instances of k-SAT, for various choices of k. Experiments were run on an Intel Core i7-4790S CPU operating at 3.20GHz with 7GB RAM. We generated the random 7 http://mat.gsia.cmu.edu/COLOR/solvers/trick.c k α k Exponent nmax  instances using CNFgen, a standard tool which picks k-SAT instances according to the distribution described in Section I. For each k and n, we chose a number of clauses m such that m ≈ α k n, where α k is the threshold for k-SAT. The value of the threshold is not known precisely for all k. Indeed, it was only recently shown rigorously that a sharp threshold exists for large k [46]; for small k ≥ 3 this is still unknown. The tightest bound known for general k is that where the o(1) terms approach 0 as k → ∞ [40]. Nonrigorous predictions for the threshold in the range 3 ≤ k ≤ 7 obtained via the "cavity method" are given in [74]. Here we used these predictions as our estimates for α k for k ≤ 7. In the range 8 ≤ k ≤ 10, we used the (non-rigorous) upper bounds α (7) c given in [74, Table 1], while an approximate threshold in the range k > 10 can be found via the third-order approximation given in [74,Appendix]. For large k, this approximation rapidly approaches the upper bound in (5). This method does not guarantee that we have found a good approximation to the true threshold α k ; however, the level of accuracy of this approximation seems to be sufficient for the small input sizes that we were able to consider. It could also be possible that another choice of α k away from the threshold could produce even harder instances for the Maple LCM Dist solver.
For each value of n considered, we ran the solver on at least 100 random instances and took the median runtime. The estimated scaling parameters for each k are listed in Table X, and examples of the results for k ∈ {9, . . . , 12} are shown in Figure 13. Note that, given that we only consider relatively small values of n, we cannot rule out that the runtime does not simply scale as a function f (n) = 2 an+b for some a, b ∈ R. For example, in the case k = 3, the runtime behaviour seems to be more complex in the range that we considered (see Figure 14).
In the case of backtracking, we evaluated the size of the backtracking tree produced for the standard backtracking algorithm applied to random k-SAT instances where the heuristic h chooses the next variable from a variable ordering fixed in advance, in order of number of appearances (that is, the variable that appears in the largest number of clauses is chosen first). Some previous comparisons of static variable ordering strategies for solving CSPs have been made in [11,43]. A good static ordering strategy can produce substantially smaller backtracking trees than choosing the variables in a fixed order that does not depend on the problem instance (see [29,76]  TABLE XI. Estimated backtracking tree size on random k-SAT instances with n variables and ≈ α k n clauses, when variables are ordered in terms of appearance count. Table lists suspected exponents f (n) in tree sizes of the form 2 f (n) . Based on taking linear leastsquares fits to the log of median runtimes, from: 15 random instances and 10 ≤ n ≤ 30 for k ≤ 9; 7 random instances and 15 ≤ n ≤ 25 for 10 ≤ k ≤ 12; 5 random instances and 15 ≤ n ≤ 20 for 13 ≤ k ≤ 15 (omitting n = 15 in the case k = 15).  analytic expressions for the expected size of the latter).

B. Graph colouring
We experimentally evaluated the DSATUR algorithm for 1000 random graphs in the range n ∈ {10, . . . , 75}, where each edge is present with probability 0.5. As discussed in Section VI, the algorithm can either be used to calculate the chromatic number precisely, or to determine k-colourability for a fixed k (which matches what the quantum algorithm achieves and is, in principle, easier). The second variant can be obtained from the first by rejecting any colouring which uses more than k colours. In order to determine a challenging value of k for a given graph G, we first run the first variant of DSATUR to calculate the chromatic number χ(G), then the second variant with k = χ(G). In Figure 16, we verify that for most random graphs, χ(G)-colouring is not significantly easier than computing χ(G).
We also observe that, as expected, the number of nodes T in the backtracking tree has a strong linear correlation with the CPU time τ used (see Figure 17). Performing a least-squares fit, we obtain that on a 3.20GHz Intel Core i7-4790S processor, τ ≈ 2.50 × 10 −6 T − 0.05, where τ is measured in seconds. This corresponds to each node in the backtracking tree being evaluated in ≈ 8000 CPU cycles. Although the number of operations per node in the backtracking tree does depend on n, the dependence is linear, so the scaling should remain correct up to a factor of 3 or so for all reasonable graph sizes. Therefore, T is a good proxy for the actual runtime of the algorithm.
Next, we calculate how T scales with n. We compute both the median value and the 90th percentile, where the latter aims to provide an estimate for how the runtime will scale for the most difficult graphs. As expected, these quantities scale exponentially with n. Performing a least-squares fit on log T (omitting small values of n), we obtain that T ≈ 2 0.40n−7.43 for the median, and T ≈ 2 0.42n−6.20 for the 90th percentile; see Figure 18. Throughout this work, we use the median to compute the anticipated runtime of classical and quantum algorithms on random graphs.
The quantum algorithm presented here can be seen as accelerating a somewhat simplified version of DSATUR, without an optimisation to rule out colours that are used by each vertex's neighbours. So, when computing an estimate for the backtracking tree size of this simplified algorithm (to obtain a corresponding estimate of the quantum algorithm's runtime), we should take into account the cost of not including this optimisation. This cost will vary depending on the input. Further, DSATUR has two roles: for a graph with chromatic number k, it both finds a k-colouring, and rules out any (k − 1)colouring. The quantum algorithm, by contrast, only determines colourability. When comparing the two algorithms, we therefore measure DSATUR searching for a k-colouring against the simplified algorithm ruling out a (k−1)-colouring.
We expect the ratio of the tree sizes of the simplified algorithm and the standard algorithm to usually be relatively small (though perhaps growing slowly with n); it is always upper-bounded by the number of colours. In some cases, the simplified algorithm may have smaller tree size, because it aims only to prove that a (k − 1)-colouring does not exist. This effect is particularly significant in the case where the clique-finding preprocessing step finds a k-clique, enabling a (k − 1)-colouring to be ruled out immediately; this event is quite common for small random graphs. Experimentally, the median ratio of tree sizes varies for different choices of n; in all experiments run, the ratio was less than 15. The median ratio found for each n is plotted in Figure 19. It is apparent that this varies and, for some fairly large values of n, fluctuates below 4. As we aim to find a "best case" but fair scenario for the quantum algorithm outperforming its classical counterpart, we use a factor of 4 as our estimate of the penalty to the quantum algorithm's backtracking tree size. Put another way, we are assuming that the algorithms are run on random graphs for a "good" choice of n.   Figure 19. Median ratio between the number of nodes in the unoptimised DSATUR backtracking tree, as used in the quantum algorithm, and the number of nodes in the optimised DSATUR backtracking tree. Taken over 1000 random graphs for each n ∈ {10, . . . , 75}.

VIII. COST OF CLASSICAL COMPUTATION
Under some models, the quantum algorithms we consider seem to achieve a substantial speedup over their classical competitors. However, the depth-optimised algorithms we describe use substantial physical quantum resources. In order to determine if we obtain a real-world reduction in cost, we need to compare the cost of classical and quantum computation. Given that early quantum computers are likely to be accessed remotely as a cloud service, perhaps a fair model for comparison is to consider a cloud-based compute service which charges by hour of CPU time. Then one can calculate the number of CPU-hours required to solve the instance which can be solved by the quantum computer in 1 day, allowing a price to be put on "1 quantum CPU-hour". See [70] for a very recent example of a similar comparison in the context of quantum computational supremacy experiments. The classical algorithm used may not be perfectly parallelisable across multiple machines; however, for the algorithms that are typically used to solve hard constraint satisfaction problems in practice (e.g. backtracking, local search), a relatively high level of parallelisation may be possible [83].
In order to make a fair cost comparison between the quantum and classical machines, it is also necessary to take into account the cost of the classical processing used in the quantum computer. For example, if each physical qubit requires one classical CPU to carry out the error-correction computations for the surface code, there are 10 6 qubits, and the computation takes 1 day, we could instead use these classical CPUs to carry out a computation requiring 10 6 CPU-days. We therefore calculate the overall cost of the quantum computation in terms of the classical CPU time 8 , and use the ratio between this and the classical CPU time used as the quantum speedup factor. Although this measure is a lower bound on the true cost of the quantum computation (which should also take into account issues such as power consumption), it is independent of the details of the quantum hardware platform itself.
The classical processing required may not be carried out using a standard CPU, but instead using a GPU or specialised electronics (FPGAs or ASICs). The extent of the speedup that might be possible by this approach can be estimated using Bitcoin mining as a point of comparison. Current commercially available hardware platforms based on ASICs can achieve a hash rate of over 2 × 10 13 hash function evaluations per second 9 , as compared with a rate of around 2 × 10 9 for a GPU or 10 7 for a CPU. The best average runtime reported for a recent near-linear time decoding algorithm for the surface code [44], using a standard CPU, is approximately 440µs for a system of 5000 qubits, corresponding to 8.8 × 10 −8 s per qubit. (Note that this latter figure is likely to be an overoptimistic estimate, as this would imply that decoding the surface code could be perfectly parallelised; on the other hand, algorithmic and implementation optimisations to the algorithm of [44] could improve its runtime.) So one CPU could support around 2 qubits while still achieving a surface code cycle time of 200ns. Assuming that GPU/ASIC implementations of this decoding algorithm are indeed possible with roughly similar performance enhancements to the case of Bitcoin, we might hope that a GPUbased system could support 100 times as many qubits at this clock speed, and that an ASIC could support 10 6 times as many qubits. (Increased clock speeds would require a corresponding increase in classical performance.) In unpublished work [50], it has been estimated that a CPU core and FPGA   [44]. Measured in processor-days (where type of processor is CPU, GPU and ASIC respectively in realistic, plausible and optimistic regimes). Assumes that the speedup offered by GPUs and ASICs over CPUs is a factor of 100 and 10 6 respectively. could realistically perform error correction for around 100 physical qubits, which would be 1-2 orders of magnitude faster than a CPU alone; so this is comparable to what we here call a GPU.
An indication of the computational resources required to implement different numbers N of Toffoli gates is shown in Table XII (the figures would be similar for T gates). This table is calculated based on the cost of implementing Toffoli factories alone. First, the number of qubits required for each Toffoli factory is estimated. Then the processing time required by the decoder to support each Toffoli factory is determined by multiplying the estimated time for decoding each qubit (based on [44], scaled according to processor type) by the spacetime cost of the Toffoli factory. Finally, this is multiplied by the number of Toffoli gates to get an estimate of overall processing cost. This is clearly a very approximate calculation, yet may give an indication of the effect of this classical overhead. (For example, error-correction procedures may run more quickly if there are fewer errors, and we have not considered this effect here.) By comparing Tables VI and XII, one can see that under all of the regimes, and even if ASICs are used, the complexity of classical processing wipes out any significant advantage for the quantum algorithms. One reason for this is that the Toffoli counts of the algorithms are substantially higher than the Tdepths. This may motivate (especially in the Optimistic scenario) the use of an alternative error-correction scheme with lower overhead, albeit perhaps a worse threshold than the surface code.

IX. CONCLUSIONS AND FURTHER WORK
For the first time, we have given a detailed analysis of the complexity of quantum algorithms for graph colouring and boolean satisfiability, including overheads for faulttolerance, and have shown that in some scenarios the algorithm could substantially outperform leading classical competitors in terms of runtime. However, when one takes into account the cost of classical processing using current techniques, the speedup disappears. Also, the space usage of the algorithms is extremely high (sometimes over 10 13 physical qubits), although this could be reduced at the expense of a longer runtime, by changing the algorithms to perform fewer tasks in parallel. Simply allowing the algorithms to run for longer will increase the size of the potential speedup too; for example, allowing the backtracking algorithm for k-colouring to run for c days would increase the speedup by a factor of approximately c.
There are some theoretical improvements that could be considered to improve the complexity of these algorithms. Arunachalam and de Wolf have given a variant of Grover's algorithm which solves the unstructured search problem for a unique marked element in a set of N elements using only O(log(log * N )) gates between each oracle query [9], where log * N is the number of times the binary logarithm function must be applied to N to obtain a number that is at most 1. It might be possible to use these ideas to accelerate the variant of Grover's algorithm that we used here. (Note that the depth of the circuit between each oracle query in the variant of the algorithm that we use is only of size O(log log N ) already, so the gain might be relatively minor.) Some specific ways in which it might be possible to improve the quantum backtracking algorithm are: • The main component of the algorithm is a controlled quantum walk operation (controlled-R B R A ) used within phase estimation. The use of controlled gates throughout adds some overhead to the algorithm. If it were possible to just run the quantum walk directly, rather than needing to apply phase estimation to it, this would reduce the complexity (and could also give a more efficient algorithm for finding a solution, rather than detecting its existence). Preliminary calculations suggest that replacing controlled operations with uncontrolled ones could reduce the runtime by up to ∼20%.
• An automated circuit optimisation tool, such as Tpar [7], could be used to reduce the complexity of the quantum circuits developed, and to check correctness. It is not obvious how large a reduction could be achieved, given that some of the subroutines used (such as integer addition) are already low-depth and highly optimised.
• One could consider optimisations targeted at particular classes of graphs (e.g. sparse graphs) or boolean formulae, and could consider different heuristic functions h.
Using the quantum backtracking algorithm to actually find a k-colouring when one exists, rather than simply determining whether one exists, would increase its complexity. However, careful use and optimisation of the techniques of [6,60] could minimise this overhead.
In this work, the backtracking algorithm has been aggressively optimised for circuit depth. It is unclear how much further this process can be continued. Given that the depth overhead of the quantum circuit is within an order of magnitude of the quantum circuit for SHA-256 [8], it seems plausible that one or two further orders of magnitude improvement are the most that will be possible. However, it may be possible to reduce the Toffoli-count of the algorithm, which is extremely high at present. Also, even without considering the overhead for fault-tolerance, the space usage of the algorithm in general seems very large ( 10 5 logical qubits in the case of graph colouring), although it is worth bearing in mind that the input size itself is relatively high (e.g. to describe an arbitrary graph on 150 vertices requires over 10 4 bits). A further issue faced by the quantum backtracking algorithm is its inability to retain state across different evaluations of oracle functions, whereas this is available to the classical algorithm and can make it more efficient.
Our results suggest that improved fault-tolerance techniques will be required to make the algorithms presented here truly practical. (See [52,67] for some very recent developments in this direction.) In particular, if a significant quantum speedup is to be realised, it seems to be essential to have highly efficient decoding algorithms and/or specialised decoding hardware. Although the numbers reported here are daunting, there is scope for improvement. A potentially hopeful parallel is previous work on quantum algorithms for quantum chemistry, where initially reported complexities were high [95], but were rapidly improved by orders of magnitude (see [10] and references therein). This was achieved by a careful analysis of optimisations to the specific algorithms used, and numerical calculations to determine performance of the algorithms in practice, beyond the theoretical worstcase bounds. Each of these general strategies could be applied here.
The analysis here could be extended by attempting to find more realistic models for constraint satisfaction problems that occur in practice, rather than the completely unstructured families considered here. However, these can be seen as representing the "core" of a challenging problem, so are perhaps reasonable in themselves. A more detailed comparison could also be made with other classical solvers, including the use of special-purpose hardware [93]. tially set to the target error probability per Toffoli magic state. We aim for N p tol ≤ 1/3 so the total failure probability of the algorithm is less than a third. We work backwards from the last round of distillation to the first round, increasing p tol until p tol > p g . Therefore we are assuming the initial magic states must have an error probability of p g or less, which is justified due to Ref. [66]. The index i starts at 1 for the last round of distillation that uses the Toffoli protocol, and we increment i as we go to lower rounds of distillation. A key concept here is that of balanced investment [78]. That is, on the i th round of distillation we use a surface code of distance d i that suffices to reduce Clifford gate noise below the target error probability p tol .
In steps (3)-(5) we calculate the resource costs in the Toffoli routine [48,61,78]. The noise per logical Clifford gate is suppressed to ∼ d(100p g ) (d+1)/2 and there are 99 possible locations for a logical gate error (this will become clear below). Given the surface code distances, we can calculate the quantity of physical qubits needed. Each Toffoli routine uses 11 logical qubits when ancillae are included (see Fig. 9 of Ref. [78]). For each logical qubit we need 2d i (d i −1) physical qubits to realise the code and a further 2d i (d i − 1) qubits for syndrome extraction. Therefore we require 44d i (d i − 1) physical qubits as in step (4). The protocol is depth 9 in logical gates, with each logical gate requiring d i surface code cycles, giving step (5). For a depth 9 circuit with 11 logical qubits, the number of possible error locations is upper bounded by the value 99 that we used earlier. The Toffoli routine takes T magic states of error probability and outputs a Toffoli state with error probability 27 2 + O( 3 ), which for < 10 −3 is strictly upper bounded by 28 2 . Inverting this relationship, we obtain the p tol update in step 6. There is some finite failure probability but for p g ≤ 10 −4 this will not affect the order of magnitude of the calculation.
Next, we iterate through a suitable number of rounds of the 15 → 1 Bravyi-Kitaev protocol [26]. The noise per logical Clifford gate is suppressed to ∼ d i (100p g ) (di+1)/2 and there are 250 possible locations for a logical gate error (this will be-come clear below). Given the surface code distances, we can calculate the quantity of physical qubits needed. We use 25 logical qubits for the 15 → 1 protocol (see Fig. 8 of Ref. [78]). Therefore, (2 + 2) * 25d i (d i − 1) = 100d i (d i − 1) physical qubits are needed for each 15 → 1 protocol. On the distillation round with index i, we need (8 * 15 i−2 ) copies of the 15 → 1 protocol. The product of these numbers gives Q i in line 7(c). The 15 → 1 protocol can be executed in depth 10 logical gates each requiring d i surface code cycles, giving a total T i shown in step 7(d). Since there are 25 logical qubits in a depth 10 circuit, this gives the 250 potential error locations asserted earlier. We must update p tol . The 15 → 1 routine takes T magic states of error probability and outputs a T state with error probability 35 3 + O( 4 ), which for < 10 −3 is strictly upper bounded by 36 3 . Inverting this relationship, we obtain the p tol update in step 7(e). Again, the failure probabilities are negligible in the regimes considered here.
Finally, we combine the physical qubit cost and time cost into a single total space-time cost S per Toffoli state, shown in step (9). To deliver N of these states within time t algo requires a factory with a total number of qubits given by step (10). The constant t algo must be input in units of surface code cycles.
We offer some remarks on possible additional savings. There is potential to cut these numbers in half by using the rotated surface code [59] but it is currently unknown whether the error suppression still obeys ∼ d(100p g ) (d+1)/2 so we instead opt for a conservative, well-supported estimate. There are several additional protocols for T state distillation (including Refs. [27,33,72]) but optimising over all these protocols is much more involved and the above estimate will give a similar order of magnitude result. It is known that Toffoli states can be distilled using 6 T-states (asymptotically) rather than 8 T-states (see Refs. [31,32]) but we do not know the ancilla or time cost of implementing this protocol. It is often hoped that significant savings could be made by circumventing the need for magic state factories altogether, but the resource trade-offs are subtle; see Ref. [34] for a review.