Exact and efficient Lanczos method on a quantum computer

We present an algorithm that uses block encoding on a quantum computer to exactly construct a Krylov space, which can be used as the basis for the Lanczos method to estimate extremal eigenvalues of Hamiltonians. While the classical Lanczos method has exponential cost in the system size to represent the Krylov states for quantum systems, our efficient quantum algorithm achieves this in polynomial time and memory. The construction presented is exact in the sense that the resulting Krylov space is identical to that of the Lanczos method, so the only approximation with respect to the exact method is due to finite sample noise. This is possible because, unlike previous quantum Krylov methods, our algorithm does not require simulating real or imaginary time evolution. We provide an explicit error bound for the resulting ground state energy estimate in the presence of noise. For our method to be successful efficiently, the only requirement on the input problem is that the overlap of the initial state with the true ground state must be $\Omega(1/\text{poly}(n))$ for $n$ qubits.

The problem of finding ground state energies of local Hamiltonians is QMA-hard [30,31]. This means that any sufficiently general quantum algorithms for approximating ground state energies must require additional assumptions. Some typical additional assumptions are finding an initial guess state that has Ω(1/poly(n)) overlap with sufficiently low-lying energy states (for n qubits) [7,20,28,29], finding a gapped adiabatic path from an efficiently solvable Hamiltonian to the target Hamiltonian [9], optimizing a variational quantum circuit [2], or assuming bounded correlation length in the ground state [16].
One design principle for quantum algorithms is to reduce the overhead of classical numerical techniques. This can be applied to classical algorithms that target ground states. A well-known classical technique for finding extremal eigenvalues is the Lanczos method [32, 33], which constructs a linear Krylov space as the span of states obtained by applying powers of the Hamiltonian to some reference state. The Hamiltonian is then diagonalized within the Krylov space, obtaining a variational estimate of the ground state energy. This method has been widely studied and applied, and for a given problem instance it converges exponentially quickly in the dimension of the Krylov space, in practice often reaching sufficient accuracy within only some tens of powers of the Hamiltonian [34, 35, 36, 37]. However, for quantum systems, the Krylov basis vectors have exponential dimension in the system size, which makes the classical Lanczos method exponentially costly in general.
A natural question is therefore whether a quantum version of the algorithm can make this cost polynomial. Applying a power of the Hamiltonian operator to a reference state is a nonunitary operation. Hence, some quantum versions of the Lanczos method have been proposed that replace powers of the Hamiltonian with unitary operators that approximate real [17,18,21,25,23] or imaginary [16] time evolutions generated by the Hamiltonian. Another strategy approximates powers of the Hamiltonian itself using linear combinations of time evolutions [24]. These methods avoid the exponential cost of representing quantum states, which have dimension 2 n for n qubits. However, they only approximate the original Lanczos method due to the modification of the Krylov basis vectors. Furthermore, real and imaginary time evolutions can only be approximated on a digital quantum computer.
In this paper, we propose a quantum algorithm that exactly reproduces the Krylov space used in the classical Lanczos method up to finite sample noise, without the exponential classical representation cost, thus avoiding the aforementioned approximation errors. Unlike in typical classical implementations, our method does not involve iteratively orthogonalizing the Krylov vectors, but it does yield the same Krylov space. The algorithm we propose is based on the observation that the simplest case of qubitization presented in [38] applies to a Hamiltonian expressed as a linear combination of Pauli operators and produces block encodings of Chebyshev polynomials of the Hamiltonian. It does not require quantum signal processing [39], which would be required to approximate time evolution using qubitization. Instead, we show that applications of the block-encoding unitaries followed by measurement, repeated in order to estimate a collection of expectation values, provide enough information for the classical computer to do the rest.
We also provide an explicit analysis of the ground state energy error from our algorithm in the presence of noise. The resulting Krylov space dimension (which is proportional to the maximum circuit depth) required to achieve a given target error scales either as the inverse of the target error or of the spectral gap, whichever is more favorable, times logarithmic factors: see (38). This means that our method can tolerate a vanishing spectral gap, but also that when the spectral gap is nonzero, the total number of measurements required to achieve a given target error scales asymptotically as the inverse-squared error (times logarithmic factors): see (44). This is the scaling one would expect in an algorithm that is based on repeated sampling.
Compared to prior work, if we first compare to previous quantum subspace diagonalization methods [16,17,18,21,24,25,23], unlike all of these our algorithm is exactly equivalent to the classical Lanczos method up to sampling noise, and does not require approximate real or imaginary time evolution. Unlike imaginary time-evolution, our algorithm does not require finite correlation length [16]. Compared to adiabatic state preparation [9], our algorithm does not depend on the spectral gap unless that is larger than the target error, and unlike variational quantum eigensolvers [2], it does not require optimizing a parameterized quantum circuit. Finally, compared to quantum phase estimation, our algorithm requires a number of shots that is inverse quadratic in the target error rather than merely inverse (which is preferable). However, it has shorter circuits if the phase estimation algorithm is also based on block encoding, and our method is robust to noise. Details of all of these comparisons are given in Section 6.
The paper is organized as follows. In Section 2 we give requisite background on block encoding, qubitization, and quantum subspace diagonalization. In Section 3, we describe and analyze our quantum algorithm for producing a Krylov space. In Section 4, we describe the classical post-processing (regularization) required for our method, provide an error bound for the resulting ground state energy estimate in the presence of noise, and discuss the resulting scalings of Krylov space dimension and measurements. In Section 5, we provide some numerical demonstrations of our method. Finally, we discuss our results and conclude in Section 6. whose states are denoted |· s ) is a pair (U, G), where U is a unitary acting on H a ⊗ H s (for some auxiliary Hilbert space H a whose states are denoted |· a ), and |G a := G|0 a is a state of the auxiliary qubits such that Here 1 s denotes identity acting on H s .
As an example of block encoding, consider an n-qubit Hamiltonian expressed as a linear combination of Pauli operators: where α i are real coefficients, P i are Pauli operators, and the number of terms is T = O(poly(n)).
We assume that the coefficients α i are nonnegative, and instead each Pauli P i carries a ±1 sign. We also require a normalization condition on H: T −1 i=0 |α i | = T −1 i=0 α i = 1. Given some nonnormalized input Hamiltonian whose coefficients are α (E) i (units of energy), we obtain normalized coefficients as α i = |α i.e., application of each Pauli term P i controlled on the state of the auxiliary register being |i a . The corresponding block encoding state is Inserting U as defined in (3) and |G a as defined in (4) into the left-hand side of (1) and using the fact that α i ≥ 0 for all i, one can verify that (1) is satisfied, i.e., this (U, G) forms a valid block encoding of H.
One option for encoding |i a is to use log 2 T auxiliary qubits and let each |i a be the computational basis state corresponding to the binary number i [38]. In this case, to implement U we must, for each of the T P i 's, apply P (j) i (the jth single-qubit Pauli operator in P i ) to system qubit j, controlled on the auxiliary qubits being in state |i a . P i is an n-qubit Pauli operator, so implementing U requires applying at most nT single-qubit Pauli operators, each controlled on all of the auxiliary qubits.
To prepare |G a = G|0 a , we can use existing state preparation procedures, since there are only logarithmically-many auxiliary qubits and thus preparing an arbitrary state on them is efficient. For example, one can use the state preparation based on binary data structures of [40, Theorem A.1]. To prepare an arbitrary real-amplitude state of log 2 T qubits using this method requires fewer than 2T single-qubit rotations, each of which is controlled on up to all of the remaining qubits.
The qubitization iteration step that we will discuss below in Section 2.2 also requires implementing R, a reflection around |G a . One can do this by applying G † , the inverse of the state preparation unitary, then applying the log 2 T -controlled phase that reflects around |0 a , and finally reapplying G. Hence the total cost for R is at most 4T log 2 T -controlled gates. Methods for implementing the block encoding itself are not the main focus of this paper, but in Appendix C we present an alternative encoding based on the binary representation of the Pauli operators, which uses more qubits in exchange for shorter circuits and is welladapted to local Pauli Hamiltonians such as spin models. When this block encoding is applied to a Heisenberg model containing arbitrary XX and ZZ interactions, U requires 3n+T two-qubit gates, G requires 4n 2 − 10n two-qubit gates plus 2 singlequbit gates, and R requires 8n 2 +14 two-qubit gates plus 4 single-qubit gates. Various other block encodings have been proposed that may be advantageous depending on the Hamiltonian and the available quantum device [38, 41, 42, 43].

Qubitization
The other main component we will need is the simplest version of qubitization introduced in [38], which applies when the block encoding unitary is self-inverse, i.e., U 2 = 1.
In the example above, U is the product of controlled Pauli operators given by (3), so (5) is satisfied. The following lemma is based directly on Section IV in [38]; for completeness, we give its proof in Appendix A.
Lemma 1 (Chebyshev polynomials from block encoding). Given a block encoding (U, G) of a Hamiltonian H, such that U 2 = 1, let be the reflection around |G a in the auxiliary space. Then for any k = 0, 1, 2, ..., where T k (·) is the kth Chebyshev polynomial of the first kind. In other words, (RU ) k is a block encoding of T k (H).

Quantum subspace diagonalization
Quantum subspace diagonalization [13,14,17,16,19,15,18,21,22,37,24,25,23,26,27] is a method for obtaining a variational estimate of the ground state energy of a Hamiltonian H. In the most general setting, we assume access to a set of states that can be prepared on a quantum computer by quantum circuits U k . First, we form the D × D matrices H and S whose entries are S is the overlap (Gram) matrix of the Krylov states (8). Both can be estimated by repeated swap [44] or Hadamard tests [45]. Then, having estimated H and S, we classically solve the generalized eigenvalue problem and find the lowest eigenvalue . This is our variational estimate of the ground state energy, and it corresponds to the lowest expected energy of any state in the subspace span{|ψ k = U k |ψ 0 | k = 0, 1, 2, ..., D − 1}. (11) Note that typically the generalized eigenvalue problem (10) must be regularized due to ill-conditioning of S as D increases [23,37]: this is discussed in detail in Section 4. The Lanczos method [32, 33] is a classical matrix diagonalization method that is closely related to quantum subspace diagonalization. In it, Krylov states |ψ i are obtained by multiplying some reference state |ψ 0 by powers of the Hamiltonian, and these are used to construct the matrices H and S. This yields a ground state energy estimate that converges exponentially in D [36], assuming infinite precision arithmetic; the impact of noisy H and S is discussed in Section 4. In general, however, applying this method to quantum systems is exponentially costly in the number of qubits, since the Krylov states have exponential dimension.
To avoid the classical barrier of exponential dimensionality, as noted above, several methods have been proposed to permit a quantum variant of the algorithm. These include quantum Lanczos (QLanczos) [16] 3 Lanczos method on a quantum computer 3

.1 Algorithm description
We propose a quantum algorithm that exactly produces the Krylov space used in the classical Lanczos method up to finite sample noise. We will focus on the case where the input block encoding is of a Hamiltonian encoded as a linear combination of Pauli operators as in the example in Definition 1, since in this case the measurement scheme is particularly simple. However, the method can be generalized to other block encodings, as we will describe.
The idea is to use as the Krylov basis vectors The Chebyshev polynomials T k (x) for k = 0, ..., D − 1 are a basis for the polynomials of degree less than D, so the Krylov space spanned by (13) is identical to the Krylov space obtained from powers of the Hamiltonian: Since the quantum subspace diagonalization method classically finds the state with lowest expected energy in this span, the performance of the Krylov space spanned by Chebyshev polynomials of the Hamiltonian applied to the initial state will be identical to that of powers of the Hamiltonian applied to the initial state, up to noise. Also, we reviewed in Section 2.2 (Lemma 1) how to implement block encodings of Chebyshev polynomials of a block encoded Hamiltonian. There was no approximation in Lemma 1, so as long as the input block encoding is exact, the method will exactly produce the subspace (14).
It remains to show how to extract the matrices H and S from the block encodings of T k (H)|ψ 0 . For S we have where · 0 denotes expectation value with respect to the initial state |ψ 0 . Chebyshev polynomials satisfy the identity Inserting this into (15) yields For H we have Using the fact that H = T 1 (H) and applying (16) twice, we obtain Since i, j = 0, 1, 2, ..., D − 1, all matrix elements in both matrices are linear combinations of the expectation values where the highest value 2D −1 comes from the first term in (19) when i = j = D − 1. Therefore, in order to construct the matrices H and S it is enough to estimate all of the expectation values (20). Given our block encoding (U, G) of H, Lemma 1 yielded (7), which when we take the expectation value with respect to |ψ 0 in the system space becomes Since R is a reflection about |G a , it is Hermitian and also acts as identity on |G a , so the leftmost R in (21) can be removed, yielding The product of operators in (22) can be rewritten as (23) Therefore, if we define the state whose adjoint is then because U and R are both Hermitian, we can rewrite (22) as Hence, we can estimate all the matrix elements of H and S by estimating the expectation values in (26) for each k = 0, 1, 2, ..., 2D − 1. We accomplish this as follows: 1. Prepare |ψ k/2 by applying RU k/2 times to |G a ⊗ |ψ 0 .
2. If k is even, we want to measure R. To do this, first apply G † (recall that |G a = G|0 a ; see Definition 1). Then measure 2|0 a 0| a − 1 on the auxiliary qubits, i.e., measure all auxiliary qubits in the computational basis and return +1 if all outcomes are |0 , otherwise −1.
3. If k is odd, we want to estimate the expectation value of We do this by separately estimating the expectation value of each term, so for each state preparation we measure some |i a i| a ⊗ P i . Measure the auxiliary qubits in the computational basis and the system qubits in a local Pauli basis compatible with P i . If the outcome of this measurement is |i a , return the system qubits' measurement outcome for P i , and otherwise return 0.
4. Return to step 1 and repeat until enough measurements are obtained to estimate the desired expectation value to the desired precision.
Note that in step 3 above, locally-compatible Pauli measurements can be grouped using any of the same methods as for standard Pauli Hamiltonian expectation value estimation [46, 47, 48, 49, 50, 51, 52], by measuring the system qubits in the shared local Pauli basis. Also note that by employing one of these methods, the number of shots depends on the l1-norm of the Pauli coefficients in even k: For each k = 0, 1, 2, ..., 2D − 1 we repeatedly sample from one of the above circuits, where D is the desired Krylov space dimension. When k is even, we apply the upper circuit. When k is odd, we apply the lower circuit for each Pauli term P i in the Hamiltonian. All measurements are in the computational basis, and C Pi is a single layer of single-qubit Clifford gates that maps P i to a diagonal Pauli operator. As defined in the text, U block encodes the Hamiltonian H, G prepares the state that identifies the block containing H, and R reflects about that state.
the Hamiltonian, which in our case is fixed to 1 (see Definition 1), so the requirement of measuring in distinct Pauli bases does not contribute additional measurement overhead asymptotically. The above algorithm is displayed in Fig. 1.

Runtime and qubit cost
Let us summarize the requirements and costs for our quantum implementation of the Lanczos method: 1. The required input is a block encoding (U, G) of H: there are various options for constructing this, as discussed in Section 2.1. Using the example in Section 2.1, for n qubits and T Hamiltonian terms the cost of U is nT , the cost of G is 2T , and the cost of R is 4T , all in log 2 T -controlled single-qubit gates.
2. The number of measurements depends on the desired error and the classical method for regularizing and solving the generalized eigenvalue problem (10). This analysis is provided in Section 4.
3. Each state preparation [given in (24)] requires preparing |G a ⊗ |ψ 0 , followed by at most D − 1 applications of RU . We then either measure in a local Pauli basis, or apply G † and then measure. The latter leads to the longest coherent sequence of operations required by the algorithm. Using the costs of the block encoding in point 1 above, we find that this sequence requires at most (D − 1)nT + 4DT log 2 T -controlled single-qubit gates.
4. The required qubits are those encoding the system, i.e., those that H acts upon, plus auxiliary qubits whose state |G a identifies the block in which H is encoded. The number of auxiliary qubits required depends on the choice of block encoding, but can be chosen to be log 2 T when the Hamiltonian contains T terms, as discussed in Section 2.1.

Other types of block encoding
The algorithm above can be applied for any block encoding of a linear combination of Pauli operators as in (3). The method is independent of how the indices of the Pauli operators are encoded in the auxiliary register, but the specifics of the measurement scheme do depend on the terms being Pauli operators. To see this, recall that when k is odd, we want to estimate the expectation value of the block encoding unitary U in the prepared state |ψ k/2 . When the terms are Pauli operators, we accomplish this by separately estimating each term using the lower circuit in Fig. 1. When a different block encoding is the input, however, we cannot necessarily measure U by repeatedly measuring in Pauli bases. In such a case, since by assumption U is a unitary that we have a quantum circuit for, we can estimate its expectation value using a Hadamard test or equivalent (see for example [25]). This would replace the Pauli basis measurement in the odd k circuit in Fig. 1. Note that this is also an option in the case of a Pauli Hamiltonian, where it represents trading off increased circuit depth (for the Hadamard test or equivalent) for a lower number of repetitions (not having to repeat for all Pauli bases).

Error analysis
In this section we analyse the error scaling in our algorithm subject to finite sample noise and regularization of the overlap matrix involved in the generalized eigenvalue problem (10). Finite sample noise enters because the matrix elements of H and S are obtained from expectation values estimated by repeated measurements. This introduces a statistical error to the resulting energy estimates.
In addition, when executed on a real quantum computer any quantum algorithm will be subject to device noise. Since our error analysis applies to arbitrary perturbations of the measured quantities, it can also be used to obtain energy error bounds under such device noise, but since that is device specific, we focus on finite sample noise for simplicity.
The overlap matrix S must be regularized because the stability of the generalized eigenvalue problem (10) depends on S being well-conditioned. In practice the condition number of S grows exponentially with the Krylov space dimension D, which has been proven for a Krylov space generated by powers of the Hamiltonian [53], and appears to hold numerically for Chebyshev polynomials of the Hamiltonian. This means that the generalized eigenvalue problem must be regularized, which may be done by thresholding the eigenvalues of S, i.e., projecting both H and S onto the span of the eigenvectors of S whose eigenvalues are above some threshold > 0 [23, 37].
For Krylov spaces generated by real time evolutions, the resulting error in the ground state energy estimate was analyzed in [37] (related techniques for handling noise and regularization are discussed in [23]). We adapt their proof to our Krylov space generated by Chebyshev polynomials of the Hamiltonian, with the following result: (13) using the method in Section 3, yielding noisy estimates of (H, S), for some Hermitian perturbations (∆ H , ∆ S ) whose spectral norms are (η H , η S ). Denote the total noise rate as Let be the overlap of the initial reference state |ψ 0 with the true ground state |E 0 . Let S be regularized by projecting it (and H) onto the subspace spanned by its eigenvectors with eigenvalues above some threshold > 0. Let total be the sum of the eigenvalues of S discarded by regularizing it according to in the same way. Then provided the noise rate η is sufficiently small (small enough that the assumptions of Lemma B.1 can be satisfied), there exists a choice of threshold for some constant 0 ≤ α ≤ 1/2 such that for any δ > 0, the error E in the ground state energy estimate coming from the regularized version of the noisy problem ( H, S) is bounded by The proof is given in Appendix B, which also gives a more precise characterization of what "provided the noise rate η is sufficiently small" means. The noise rate is not required to be exponentially small in any of the problem parameters. Although (32) is given as an asymptotic scaling for simplicity, the two results it is based on (Lemma B.1 and Theorem 2 in Appendix B) give explicit bounds on the errors due to noise and thresholding, respectively, so the total error bound (32) could also be reformulated as an explicit (i.e., nonasymptotic) bound.
The result (32) may be interpreted as follows. The last term is the error due to the exact (i.e., noiseless and nonthresholded) Krylov space, which vanishes exponentially quickly with the Krylov space dimension D. The third term, δ, is an energy error tolerance that determines the rate of convergence of the last term. The error bound (32) holds for any δ > 0, i.e., δ is only relevant to the analysis, and does not actually impact the algorithm. It characterizes the rate of exponential decay of amplitudes of energies more than δ above the ground state energy, via the final term in (32). Hence if δ is chosen to be equal to the spectral gap ∆, then the third term in (32) can be removed, because in that case there are no excited states within δ of the ground state energy. However, if δ is chosen to be larger than the spectral gap, then although the ground state energy is approximated, the corresponding state will not approximate the true ground state in general: it will only approximate some arbitrary state in the low-energy subspace within δ of the ground state energy.
The second term in (32) is due to regularization of S by , i.e., to the discarding of eigenspaces of S with eigenvalues smaller than . Finally, the first term in (32) is due to finite sample noise. The factor of D 4 1+α in this term results from the proof technique of Theorem 2.7 in [37] (reproduced in our notation as Lemma B.1 in Appendix B), in which the authors note that this factor might be reduced or eliminated by an improved proof; we discuss this further below. Numerics also indicate linear scaling = Θ(η) (i.e., α = 0) if various heuristics are adopted to select an optimal threshold [note that this scaling is compatible with (31)]. Such heuristics involve starting at a high threshold and decreasing it as long as the energy estimate continues to converge.
To understand the implications of Theorem 1 in practice, we can begin by observing that since the eigenvalues of S decrease exponentially, as mentioned above, Also, when the noise η comes from finite numbers of samples to obtain each matrix element in ( H, S), when M measurements are performed per matrix element.
Note that a complete expression for η would also include some dependence on D, since η is obtained from spectral norms of the perturbation matrices (∆ H and ∆ S ). However, η enters (32) in the first two terms via (31) and (33), and in the context of this total energy error bound, no strong dependence on D is generally found in practice, whether coming from η or from the explicit factors in (31). While in the presence of noise and thresholding, it sometimes happens that the error increases over short ranges of D, these increases are relatively small and the overall behavior is still bounded below a monotonically decreasing envelope. See our numerics in Section 5 and the numerics in [23, 37], for example. Providing a rigorous proof of this property is an interesting direction for future work.
Finally, although [37] provided numerics suggesting that α = 1/4 is a good value, that was for the original bound through which α entered their proof, and our numerics suggest that in practice, at least in the context of Theorem 1 for our Chebyshev polynomial subspace, α = 0 is a better value [see Section 5]. Hence, using (31) with α = 0, (33), and (34), we obtain which summarizes the "in practice" relations between these quantities. Subject to the above assumptions, (32) becomes the "in practice" error bound (36) From (36), we can find the Krylov space dimension D required to reach error E in the ground state energy estimate. From the final term in (36), in the limit of small initial state overlap |γ 0 | and δ we obtain To simplify this scaling, note that since δ appears as a term in the error (36) whenever it is larger than the spectral gap ∆ (as discussed above), if the target error E is larger than ∆, we must choose δ = O(E). Given this, there is no reason to choose it to be smaller, so we can choose δ = Θ(E). Substituting this into (37) yields If instead the target error E is smaller than ∆, then we should instead choose δ = ∆, since in that case the δ term in (36) (and (32)) vanishes as noted above. Hence in general so a general bound on the required Krylov space dimension D is Recall that D is also the maximum circuit depth in terms of queries to the block encoding operators (see (26) and Fig. 1). The first two terms in (36) can then provide an asymptotic lower bound on the number of measurements M . From the first term, we find that which should be unsurprising since this is always the scaling obtained from algorithms based on repeated sampling. From the second term, we obtain a lower bound in terms of the initial state overlap |γ 0 | 2 : where the second equality follows using δ = Θ(E) as above. Since M must satisfy both of these asymptotic lower bounds, we can combine them as the total number of measurements required per matrix element. Finally, since all matrix elements of S and H are calculated from the 2D expectation values (20), the total number of measurements required for the entire algorithm is where the Θ indicates that the logarithmic factor in (40) has been suppressed. For a nonasymptotic analysis of this scaling, one would need to account for covariance of the resulting matrix element estimates, but this is unnecessary to obtain the asymptotic scaling (44). The first term in (36) might in principle also have some bearing on the number of measurements as a function of the Krylov space dimension, but as noted above, [37] points out that this scaling may be an artifact of their proof technique. Regardless, the number of measurements depends only polynomially on the Krylov space dimension as well as the problem parameters |γ 0 | and E.

Numerical demonstrations
As a demonstration of our method, we classically simulated its Krylov spaces for some example Hamiltonians. For each Hamiltonian, we directly calculated the expectation values in (20), which are the quantities we would actually estimate on the quantum computer in a real experiment, then used them to construct the subspace matrices H and S as in (17) and (19). We simulated the effect of finite sample noise by adding Gaussian noise with various standard deviations to those expectation values. We then regularized the generalized eigenvalue problem by thresholding as described in the previous section.
For zero noise simulations, the threshold was set to 10 −13 . For simulations with nonzero noise, the threshold was set proportional to the noise rate, with the constants being 30 for the spin models and 50 for molecules. These constants were chosen to eliminate most spurious eigenvalues due to ill-conditioning: such spurious eigenvalues appear as dramatic downward jumps in the energy estimates with increasing subspace dimension. In an actual experiment one would want to choose the thresholds by detecting such jumps automatically, but for the purpose of our simple demonstration it was enough to choose constants by hand that eliminated most of the jumps. Since some spurious eigenvalues remained, we further eliminated outliers by taking as our datapoints the means of the middle 10% of 100 independent runs at each noise rate and subspace dimension.
Results of these simulations are shown in Fig. 2   Mean converged errors, J1-J2 models (3, 4) (4, 4) x 10 7 10 9 10 11 10 13 10 15 10 17 10 19 total queries to block encoding 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 10 12 energy error per site Figure 3: Energy error per site versus noise rate per measured quantity (upper panel) and total queries to block encoding (lower panel) for the J1-J2 model on several different square lattices given in the legend. These energy errors were each averaged over a sequence of 10 dimensions once convergence was reached (i.e., once the contribution of the last term in (32) had become negligible); for example, the purple curve in the upper panel is the points on the right of the upper panel in Fig. 2, plotted against the corresponding noise rates. The dashed lines illustrate the expected scalings from the theoretical analysis. Note that the two panels show the same information since the total queries to the block encoding is proportional to the total number of shots times the circuit depth, which were chosen to be the depths required to reach convergence. For both plots above we took those depths to be 5 for (2, 2), 40 for (2, 3) and (3, 3), and 50 for (3,4) and (4, 4).

J1-J2 model is given by
where ·, · denotes nearest-neighbor pairs, ·, · denotes next-nearest-neighbor pairs, and the couplings in our case were set to J 1 = 1 and J 2 = 0.5. The initial state was chosen to be the antiferromagnetic state, i.e., a checkerboard of alternating spins across the lattice, which yielded an initial state overlap of |γ 0 | = 0.179. The upper panel shows the complete data, while for easier reading the lower panel shows a smoothed version of the same curves obtained by averaging each set of energy errors at 10 successive subspace dimensions. In both cases, after a period of ill-defined behavior at low Krylov space dimension, the overall trends at all noise rates settle into roughly exponential decay towards some minimum "converged" error level, with local fluctuations along the way. This is expected based on the theoretical error bounds in the previous section: the error decays exponentially with Krylov space dimension towards a minimum value set by the first three terms in (32). Fig. 3 shows energy error versus simulated noise rate for the J1-J2 model on a number of different square lattices. These energy errors were each averaged over a sequence of 10 dimensions once convergence was reached (i.e., once the contribution of the last term in (32) had become negligible). As we would expect, at this point energy error and noise rate are related approximately linearly (the dashed line shows the curve y = x).
Finally, Fig. 4 shows energy error versus Krylov space dimension for the electronic structure Hamiltonian BeH 2 in the STO-3G basis. The initial state was chosen to be the Hartree-Fock state, which has an initial state overlap of 0.986. Unfortunately, numerical instability in our classical simulation prevented us from reaching as high Krylov space dimension for the electronic structure Hamiltonian as for the spin models, and it is not clear that the energies had finished converging to their minimum values. This is possibly due to the choice of the Hartree-Fock state as the initial state: although it is a good initial approximation to the ground state in this case, the Hartree-Fock state contains no electron correlation. This means that all electron correlation of the lowest-energy state in the Krylov space must be contributed by the corresponding polynomial of the Hamiltonian, which may live in a poorly conditioned part of the space.
Since we have access only to classically simulable small models, we cannot draw strong conclusions from these simulations regarding the asymptotic behavior of our method. However, at least for these test cases, the behavior is roughly what we would expect from the theory in the previous section.

Conclusion
In this work we have shown how to exactly reproduce the Krylov space of the classical Lanczos method on a quantum computer, up to sampling error, while avoiding the exponential classical cost of representing quantum states. The required Krylov space dimension is inverse polynomial in the desired ground state energy error, and the sampling overhead is inverse polynomial in both the ground state energy error and the initial state overlap. To our knowledge, the present method is conceptually the simplest use of block encoding in quantum algorithms to date, since it only requires applying up to one local basis rotation per circuit in addition to the block encoding unitaries themselves (see Fig. 1).
We conclude by discussing how the present method compares with existing quantum algorithms for approximating ground states. As noted in the introduction, if we compare to previous quantum subspace diagonalization methods [16,17,18,21,24,25, 23], our algorithm benefits from being exactly equivalent to the classical Lanczos method up to sampling noise, without requiring approximate real or imaginary time evolution. This offers two advantages. First, the error in approximating real or imaginary time evolution, even if it can be suppressed, can never be reduced to zero. Second, in quantum algorithms for simulating real time evolution, the circuit depths increase as the target error is reduced. There are many methods to approximate time evolutions, but we mention two to illustrate this point. First, if the time evolutions are approximated by Trotterization, then they require depth of order −1/k where k is the Trotter order and is the error. Hence deep circuits will be required merely to reach high accuracy in approximating the Krylov states: if d digits of accuracy are desired, then the resulting circuit depths scale as O(exp(d/r)).
On the other hand, if the time evolutions are approximated by qubitization [38], the resulting circuit depths scale much more favorably with error as O(log(1/ )). However, this case is easily compared to our algorithm, because qubitization uses the same block encoding unitaries that our method is based upon. Thus, while qubitization would require circuits using up to O(D + log(1/ )) applications of the block encoding unitaries, as well as other controlled phases, to construct a Krylov space of dimension D, our method requires circuits of depth at most D in the same block encoding unitaries, with the only additional gates being the local basis rotation mentioned above.
This means that in practice, the advantage of our method over prior quantum subspace expansions is that it avoids all question of the accuracy of approximation of the Krylov states themselves, completely eliminating this source of error and circuit depth. This reduces both depth and number of measurements required to obtain a result of a given accuracy. Since circuit depth and number of measurements are key barriers to the success of quantum algorithms and will remain so even into the fault-tolerant era, this advance is significant.
In comparison to using quantum imaginary time evolution to approximate the ground state, we do not require the underlying quantum state to have finite correlation length as it goes through the evolution. Even in the presence of a finite correlation length, there are significant prefactors in imaginary time evolution that lead to high computational cost (true, the cost is 4 O(log n) = O(poly(n)), but this can be a high-degree polynomial [16]). Adiabatic state preparation [9] requires a gapped path from some efficiently-solvable Hamiltonian to the target Hamiltonian, while our method is independent of the spectral gap when that is small (see Section 4). It also avoids the typical pitfalls of parametric optimization that appear in variational quantum eigensolvers [2], such as local minima and barren plateaus [54, 55, 56, 57, 58], and admits rigorous error bounds (Section 4).
The most well-known algorithm for quantum computation of ground state energies is phase estimation. To be successful, both our method and phase estimation require an initial state overlap |γ 0 | = Ω(1/poly(n)), for different reasons. For phase estimation, the state overlap determines the success probability of the algorithm, while for our method it impacts the number of measurements required to obtain small errors (only entering the circuit depth logarithmically). Similar to other quantum subspace diagonalization methods, our algorithm has quantifiable robustness to noise in the sense that it produces energy estimates with bounded error even if the subspace matrices are not exact. This robustness is not encountered in conventional phase estimation (although more recent ground-state preparation algorithms such as [

A Proof of Lemma 1
The following lemma is based directly on Section IV in [38]: Lemma 1 (Chebyshev polynomials from block encoding). Given a block encoding (U, G) of a Hamiltonian H, such that U 2 = 1, let be the reflection around |G a in the auxiliary space. Then for any k = 0, 1, 2, ..., where T k (·) is the kth Chebyshev polynomial of the first kind. In other words, (RU ) k is a block encoding of T k (H).
Proof. Let |λ denote an arbitrary eigenvector of H with eigenvalue λ. We can evaluate the action of U on |G a ⊗ |λ as follows: where η λ is some positive number that preserves normalization and | ⊥ λ is some (normalized) state of all of the qubits that is orthogonal to |G a in the auxiliary space, i.e., such that Hence, (48) simply represents resolving the state U (|G a ⊗ |λ ) into its components parallel and orthogonal to |G a in the auxiliary space: the latter (| ⊥ λ ) is unknown, but the former is by (1), which forms the first term in (48).
Since |λ is an eigenstate of H, (48) becomes where the second step follows by identifying η λ = √ 1 − λ 2 as the required normalization factor for the second term. Hence, if we formally denote we may write Note that the vectors in (52) represent the states of the "virtual qubit" that is referred to in the name qubitization.
We also know that U 2 = 1 (5), which implies that within the two-dimensional subspace spanned by (52), the matrix representing U 2 must be From this and (53), it follows that the matrix form of U in this two-dimensional subspace must be since (53) defines the first column of this matrix, and the second column is required to make it selfinverse.
The matrix for U in (55) is a reflection. We can transform it into a rotation if we can flip the signs of the entries in the second row, since the result will have the form for a rotation angle θ defined by In order to flip the signs in the second row of (55), we need to multiply on the left by Z within the two-dimensional subspace: which has the desired form (56). One way to implement the action of Z on the two-dimensional subspace spanned by (52) is to reflect around |G a in the auxiliary space, i.e., to implement R as defined in (46). The actions of R on the subspace are where the second equation follows from (49). Hence, the matrix representation of R within the two-dimensional subspace is Therefore, within the two-dimensional subspace where θ is defined by (57).
As an aside, note that the form (46) for R is not actually a necessary condition, although it is sufficient. All that we strictly require is that R implements a reflection around |G a within each two-dimensional subspace associated to each λ: its action on the remainder of the auxiliary space is irrelevant. We will exploit this fact later on in constructing efficient implementations of R.
We can now use the fact that k powers of a rotation by angle θ are simply the rotation by angle kθ, so But where T k (·) is the kth Chebyshev polynomial of the first kind, so By the definition (52) of the two-dimensional subspace associated to λ, this implies that The final step in the proof is to note that any arbitrary state |ψ of the system can be expressed as a superposition of Hamiltonian eigenstates: Since this applies for any state |ψ , we have as desired.

B Convergence of the noisy, thresholded Krylov method with Chebyshev polynomials
In this section we prove Theorem 1, which is stated in the main text and bounds the error of our method subject to noise and regularization via eigenvalue thresholding. 1]. Suppose we compute the D-dimensional Krylov space spanned by (13) using the method in Section 3, yielding noisy estimates of (H, S), for some Hermitian perturbations (∆ H , ∆ S ) whose spectral norms are (η H , η S ). Denote the total noise rate as Let be the overlap of the initial reference state |ψ 0 with the true ground state |E 0 . Let S be regularized by projecting it (and H) onto the subspace spanned by its eigenvectors with eigenvalues above some threshold > 0. Let total be the sum of the eigenvalues of S discarded by regularizing it according to in the same way. Then provided the noise rate η is sufficiently small (small enough that the assumptions of Lemma B.1 can be satisfied), there exists a choice of threshold for some constant 0 ≤ α ≤ 1/2 such that for any δ > 0, the error E in the ground state energy estimate coming from the regularized version of the noisy problem ( H, S) is bounded by Note: the first term in (73) comes from Theorem 2.7 in [37] (reproduced below as Lemma B.1), and is subject to additional assumptions given in the statement of Lemma B.1 below.
Proof. Let E 0 be the true ground state energy of H, let E 0 be the energy estimate from the noiseless, thresholded Krylov space, and let E 0 be the energy estimate from the noisy, thresholded Krylov space. Then the total error is E = | E 0 − E 0 |, for which we obtain the bound (73) via the triangle inequality as a first step: We will upper bound the first term, characterizing the error due to noise, using Theorem 2.7 of [37], which can be applied directly to our case. We will upper bound the second term in (74) using Theorem 2, proved below, which is based on Theorem 3.1 in [37], but which we had to modify to apply to our case, since the version in [37] is specific to subspaces generated by real time evolutions.
Theorem 2.7 in [37] may be restated in our notation as follows: This assumption is merely to ensure that during thresholding the same numbers of eigenvalues are discarded from the noisy and noiseless overlap matrices, so that the resulting thresholded matrix problems have the same dimensions.

(bound on H relative to S) Assume that
for all i, j, for some constants µ > 0 and 0 ≤ α ≤ 1/2. This bound always holds if α = 1/2, but smaller α may be more realistic in practice (see [37]).

(bound on noise rate relative to threshold) Let
where χ is defined as

(gap condition) Assume that
where E 0 , E 1 are the lowest and second-tolowest eigenvalues of the noiseless, thresholded problem.
Then the difference between the lowest eigenvalue E 0 of the noiseless thresholded problem and the lowest eigenvalue E 0 of the noisy thresholded problem is bounded as where κ is the condition number of arctan(E 0 ).
[Note: Theorem 2.7 in [37] also includes the assumption but this is guaranteed to hold by (77).] Our input Hamiltonian will be normalized such that its eigenvalues lie inside the range [−1, 1]. Therefore, the condition number κ of arctan(E 0 ) is bounded as Hence we can replace κ by 1 in (80), and the resulting bound will be weaker but equivalent in scaling: Also, for x ∈ [−1, 1], the minimum value of the first derivative of arctan(x) is 1/2, so Finally, for x ∈ [0, 1], arcsin(x) ≤ πx/2, so we can obtain a simplified energy bound of Similarly, since the maximum value of the first derivative of arctan(x) on [−1, 1] is 1 and arcsin(x) ≥ x on [0, 1], the condition (79) on the gap can be simplified to Hence we just need to choose and (77) will also be guaranteed, so the only remaining assumption in the statement of Lemma B.1 will be that (75) holds.
Inserting the definition (78) of χ into (87) yields If we now insert the definition of χ into (85) and simplify within asymptotic notation, we can obtain where the second step uses (88). Since ∆ will be smaller than 1 except possibly at the lowest Krylov space dimensions, we can obtain a weaker upper bound by simply dropping the factor that depends on ∆ , obtaining in practice, the inclusion of ∆ in the denominator of the lower bound to as in (88) is probably unrealistic anyway, so we are likely not sacrificing any tightness of the bound by making this simplification. Inserting (90) in (74) provides the first term in (73), while (88) yields the condition (72). The remaining terms in (73), coming from the second term in (74), are provided by (91), which is the result of Theorem 2, below. When inserting this into (74), we assume that the threshold is small enough that the denominator in (91) scales as Θ(|γ 0 | 2 ), allowing us to ignore the second term the denominator in the asymptotic scaling. This is why the statement of Theorem 1 only claims that if the noise rate is sufficiently small then a threshold exists such that (73) is satisfied. Theorem 2. Instate the prevailing notation. The error in the noiseless thresholded ground state en-ergy estimate E 0 is bounded as is the overlap of the initial state with the subspace whose energy is within δ of E 0 . [Note: this form of the bound only holds when |γ 0 | > √ 2D , and is only tight when the denominator in (91) is Proof. Let K be the Krylov matrix whose columns are the Krylov vectors, i.e., The projected Hamiltonian H and overlap matrix S are given by Hence, the eigenvalues of S are the squares of the singular values of K, and the corresponding eigenvectors of S are the right singular vectors of K. Therefore, applying thresholding that removes the eigenspaces of S with eigenvalues smaller than is equivalent to truncating the Krylov space to the span of the left singular vectors (which are states in the Hilbert space) of K with singular values at least √ . We can factor K as follows: Let K = ΨΓF denote the matrix obtained by setting all singular values of K smaller than √ to zero.
Let R be the range (column space) of K , i.e., the thresholded Krylov space, spanned by the left singular vectors of K with singular values at least √ . Then by the Rayleigh-Ritz variational principle, the error in the ground state energy estimate in the thresholded Krylov space We will upper bound the above minimization by inserting a particular state |ψ ∈ R into its argument.
To construct this state, we use the polynomial p * given in Lemma B.2, below, with the parameters a, b, d in Lemma B.2 set to a = δ 2 /4, b = 1, and d = k/2 : in terms of this, define the function f as By Lemma B.2, but for any x ∈ [−1, 1] such that |x − E 0 | ≥ δ, (100) Note that this bound is only tight when it is much less than 1. Since p * has degree k/2 , f (x) has degree either k − 1 or k. Let c j be the coefficients of the Chebyshev expansion of f : Since |p * (x)| ≤ 1 for 0 ≤ x ≤ 1, |f (x)| ≤ 1 for −1 ≤ x ≤ 1, and for x outside the interval [E 0 − δ, E 0 + δ], the much tighter bound (100) holds. Therefore, which holds for δ < 1/4, so since we obtain the following bound on the coefficients c j : The above is really just an application of Parseval's Theorem, using the fact that the Chebyshev polynomials are orthogonal on [−1, 1] with respect to the measure dx/ √ 1 − x 2 . As noted after (100), this upper bound g(δ, k) is only tight when its second term is much smaller than 1, which will always hold in our regime of interest, since as we will see, this second term is proportional to the main energy error term we will end up with.
Let |ψ be defined in terms of the coefficients c j as (note that this state is in R because Let I be the least value of i such that E I − E 0 > δ. We separately upper bound the parts of (106) coming from terms in the numerator with i ≥ I and i < I. We start with the former, i.e., let i ≥ I and consider the term We upper bound the numerator as follows, using (100) and (104): where in the last line we defined For the denominator of this term, we use where the second line follows from the reverse triangle inequality and (101), the third line follows from the triangle inequality and (99), and the last line follows from the fact that each |c j | ≤ 2 (obtained by evaluating the integral in the first line of (104) replacing f 2 (x) with its upper bound of 1). Extending the above bound, using the spectral norm bound Hence, the term (107) is upper bounded by Now consider together all of the terms in (106) with i < I: The right-hand side of (114) is an expectation value of the difference between energy and ground state energy for some state in the subspace whose energies are within δ of the ground state energy. Hence by the Rayleigh-Ritz variational principle, (114) is upper bounded by δ: Inserting (113) and (115) into (106), we obtain our desired result, where the last line is obtained by inserting the definition of g(δ, k) as in (104) and upper bounding 1 − 2 π √ δ ≤ 1, and in the second line we inserted and total , which we define to be the squared Frobenius distance from K to K : (the second equality follows because Ψ is unitary.) Since K − K is a matrix whose only nonzero singular values are those singular values of K that are smaller than √ and are thus dropped in thresholding, and the Frobenius norm may also be written total as defined by (118) is equal to the sum of squares of the singular values of K that are discarded in thresholding. Since S = K † K, total is the sum of eigenvalues of S that are discarded in thresholding, as claimed in the theorem statement.

Lemma B.2 ([59]
,Theorem 4.1.11). Let 0 < a < b and let Π * d be the space of residual polynomials (polynomials whose value at 0 is 1) of degree at most d. The solution to and the corresponding minimal value is Note that the upper bound is only tight when it is much less than 1. so with equality at either end of the domain (123). Hence where the last step follows after some straightforward but tedious algebra.

C Block encoding based on binary representation of Pauli operators C.1 Block encoding components
Recall that to implement a block encoding, we need a preparation procedure for the state and an implementation of the unitary Suppose we let where ( x, z) is the binary representation of P i [60,61]. In this representation, both x and z are binary vectors, where a 1 in the jth position in x indicates that the Pauli operator contains an X on qubit j, and a 1 in the jth position in z indicates that the Pauli operator contains an Z on qubit j (so presence of both indices Y ). In other words, the Pauli operator that corresponds to ( x, z) is (the i in the above expression is the imaginary number, not the index of the Pauli operator). Hence, this block encoding requires 2n auxiliary qubits for n system qubits, so it triples the qubit requirement compared to the system alone. In exchange, the gate costs are substantially lower, depending on the specific Hamiltonian to be encoded, so this block encoding option may be advantageous on near-term quantum computers whose noisy gates make shorter depth and fewer controls a higher priority than fewer qubits. We can now express U as Inserting (129), we can factor this as so U can be implemented as only three layers of singly-controlled operations: a first layer that applies Z j to each qubit controlled on the corresponding auxiliary qubit encoding z j , a second layer that applies X j to each qubit controlled on the corresponding auxiliary qubit encoding x j , and a third layer that implements the two-qubit controlled phase on each pair of auxiliary qubits x j , z j . A circuit diagram for U is shown in Fig. 5. The complexity of preparing where ( x, z) i is the binary representation of the ith term in the Hamiltonian, depends strongly on form of the Hamiltonian. For each term P i in the Hamiltonian, the locality l of P i (i.e., the number of qubits it acts upon) determines the Hamming weight of ( x, z) i , since each of x and z can have Hamming weight at most l. Hence, one case in which this encoding may be advantageous is when the input Pauli Hamiltonian is local, i.e., l is constant. For example, the l = 2 case includes nearestneighbor spin models. In this case, the Hamiltonian is composed of terms of the form where α 1 , α 2 are coefficients and denotes a single-qubit Pauli operator acting on qubit j. The corresponding terms in |G a will be in which the only potentially nonzero bits x j , x k , z j , z k are in positions j and k in their respective bitstrings. In order to prepare such a state, it is enough if we can prepare an arbitrary real superposition of computational basis states with Hamming weight one or two. For weight one, we can accomplish this as follows: where after each step we only show the newest state in the superposition, and PSWAP ij (θ) is defined as the two-qubit gate acting on qubits i and j. If we desired to obtain the coefficient β i ≥ 0 for the bitstring with the one in the ith position, then the angles in (137) should be chosen as follows: This procedure requires only linear qubit connectivity.
We can prepare an arbitrary real superposition of weight-two bitstrings as follows: |000...00 where again the state shown after each step is only the latest state to be added to the superposition and one can check that each controlled partial swap acts nontrivially on only the state obtained in the preceding step. The controlled partial swap C k -PSWAP ij (θ) is simply the partial swap defined in (138), controlled on qubit k. Hence, if β i is the desired coefficient of the ith bitstring obtained in the order in (140), the associated angle θ i is again simply given by (139). The above procedure requires arbitrary qubit connectivity, but this can be changed to linear connectivity by adding O(n) SWAPs per step in (140) for n qubits.
In general, to prepare states containing terms of the forms (136), we need to be able to prepare an arbitrary real superposition of terms of Hamming weight one and two. Let β (1) i be the desired coefficients for the weight-one terms and β (2) i be the desired coefficients for the weight-two terms. To prepare the desired state, first prepare the weightone part, only with β This will map the |100...0 term to Finally, apply the procedure (140) to prepare the weight-two part of the superposition, skipping the initial X 0 X 1 step. Since all of the C-PSWAPs do not change states of weight one, this will not disturb the already prepared weight-one part of the superposition, but will simply distribute the ampli- i ) 2 placed on |110...0 over the other weight-two states, as desired.
The above procedures suffice to prepare any superpositions of weight-one and weight-two bitstrings, which capture all terms of the form (136) corresponding to X, Z, XX, ZZ, or XZ terms in the Hamiltonian. To include terms containing Y s, we need to obtain ones in the same locations in the x and z registers. This can be accomplished by doubly-and triply-controlled partial CNOTs, starting from states prepared as above and using a similar strategy to distribute amplitude.
The one remaining piece we require for the block encoding is to implement the signs of the terms. Recall that the unitary U in (127) is supposed to implement the signed Pauli operators so that the coefficients α i in the Hamiltonian can be nonnegative. However, our implementation of U above simply applied each Pauli without a sign. To include the signs, we can first assume without loss of generality that all single-qubit terms in the Hamiltonian are Zs with positive signs (this amounts to a choice of |0 and |1 states for each qubit). Hence, we only have to implement signs on the interaction terms.
If the only interaction terms have the form XX, ZZ, or XZ, then the sign for each term can be implemented as a controlled-Z on the pair of auxiliary qubits that are one in the binary representation of the term. This requires at most T controlled-Z gates for T terms. If terms containing Y s are also present, then these phases must be triplycontrolled, i.e., four-qubit gates instead. A circuit diagram for G is given in Fig. 6.

C.2 Reflection about block encoding state
The above completes the construction of the block encoding itself, but implementing qubitization as in Lemma 1 also requires reflections around |G a . Our construction so far requires at most four-qubit gates, so we would prefer to avoid the strategy of simply inverting G, reflecting around |0 a , and then reapplying G, since this requires a phase controlled on the entire auxiliary register. However, in the case of spin models, we can use the fact (pointed out in the proof of Lemma 1) that we do not in fact require a complete reflection around |G a , but only within the subspace containing states of the auxiliary qubits that can be reached by applying U to |G a .
U does not change the states of the auxiliary qubits directly, since it is only controlled by them, although this does mean that it entangles them with the system qubits. In particular, U does not change the Hamming weight of the auxiliary states. To reflect around |G a , we must invert G, i.e., apply G † to the auxiliary qubits. This operation can increase the Hamming weight of each of the registers | x and | z to at most three, which we can see as follows. Prior to applying G † , the Hamming weights of | x and | z are each at most two. The only operation that can increase the Hamming weight higher than this is the partial CNOT in (142) applied to qubits 0 and 1 in each register to divide amplitude between the weight-one and weight-two terms. When the inverse of this operation is applied to a state of weight two, if qubit 0 is in state |1 and qubit 1 is in state |0 , while the other |1 is somewhere else, then one of the terms resulting from the partial CNOT will have weight three, with qubit 1 now contributing the third |1 . However, the only other operation to be applied to this state that does not conserve Hamming weight is the initial X 0 (137), which decreases the weight of the state. Of course, this X 0 may increase the weight of other computational basis states in the superposition, but none of those could have weight higher than two prior to its application, so overall it is impossible to end up with states of weight higher than three.
Given this, after inverting G, we only need to reflect around |0 a , the all-zeroes state of the auxiliary qubits, in the subspace spanned by states of Hamming weight up to three (for both registers | x and | z ). This allows us to avoid reflecting by applying a phase controlled on the entire register. Instead, we can simply use two additional auxiliary qubits to count the ones in | x and another two additional auxiliary qubits to count the ones in | z . For example, to count the ones in | x , we prepare its two auxiliary qubits in the state where w as a binary number. Therefore, once we have done this for both | x and | z , we can simply apply a −1 phase controlled on the resulting four auxiliary qubits being in state |0000 . After that, we uncompute the auxiliary qubits used for counting, then reapply G, and our reflection about |G a is complete. A circuit diagram for R is given in Fig.  9.

C.3 Analysis
Finally, we can analyze the costs of U , G, and R. For simplicity, we will do this for the case where the Hamiltonian contains only Z, ZZ, and XX terms, i.e., it is a Heisenberg XYZ model with the YY interaction set to zero. For U , we require two main steps: 1. We argued above that the cost prior to implementing the signs of the terms is three layers of singly-controlled operations as in (131), for a total of 3n CNOTs and CPHASEs.
2. To implement the signs of the terms, we require at most T CZs for T terms (one CZ per term with a negative sign).
Hence the total cost to implement U is 3n + T two-qubit gates.
For G, we also require two main steps, as outlined above: 1. For each of | x and | z , prepare the weight-one part of the superposition as in (137). This requires a single NOT followed by n−1 PSWAPs for each, so in total requires 2 NOTs and 2(n − 1) PSWAPs.
2. For each of | x and | z , prepare the weight-one part of the superposition as in (140), replacing the first step (the double NOT) with the partial CNOT as in (142). This requires the partial CNOT followed by n−2 i=2 i = 1 2 n 2 − 3 2 n (147) C-PSWAPs, so in total requires 2 PCNOTs and n 2 − 3n C-PSWAPs.
We obtain the total cost for G by adding up the above gate counts, but first we note that the three qubit gates, which are all C-PSWAPS, can be implemented in place via four two-qubit gates each (this construction is inspired by the contruction of where j, k denote the qubits acted upon, we can express a partial swap as We also haveμ so C j -PSWAP kl (θ) = e iθμ kl /2 · C j Z k · e −iθμ kl /2 · C j Z k , (151) a product of four two-qubit gates, as desired. Using this, the total cost for G comes to 4n 2 − 10n two-qubit, 2 single-qubit (152) gates (for n ≥ 3). Finally, implementing R requires an application of G † and an application of G, with the reflection about |0 a in between. The cost of the latter is 4n Toffoli gates and 4n CNOTs (to compute and uncompute the weight of the state), plus the fourqubit CPHASE. As in the implementation of G, we can replace each three qubit gate (Toffolis in this case) with four two-qubit gates, this time usinĝ acts on the controlling qubit. As above, we also haveν Therefore, Toffoli jkl = e iπν kl /4 · C j Z k · e −iπν kl /4 · C j Z k · C j S † k , (157) and the C j S † k can be absorbed into the second C j Z k , yielding a sequence of four two-qubit gates, as desired. The four-qubit CPHASE can be implemented in place via fourteen two-qubit gates as follows. Note that for we have andω jk X j = −X jωjk .
Hence, CCCZ jklm = e iπω lm /4 · Toffoli jkl · e −iπω lm /4 · Toffoli jkl · CCS † jkl , where the Toffolis can be implemented via (157) and the CCS † can be implemented similarly, so that the total cost for the four-qubit CPHASE is fourteen two-qubit gates. Hence the total cost of R is twice the cost of G, plus 4n Toffoli gates and 4n CNOTs, plus the four-qubit CPHASE, which comes to a total of 8n 2 + 14 two-qubit, gates. Implementing R also requires six additional auxiliary qubits in total.