Qubitization of Arbitrary Basis Quantum Chemistry Leveraging Sparsity and Low Rank Factorization

Recent work has dramatically reduced the gate complexity required to quantum simulate chemistry by using linear combinations of unitaries based methods to exploit structure in the plane wave basis Coulomb operator. Here, we show that one can achieve similar scaling even for arbitrary basis sets (which can be hundreds of times more compact than plane waves) by using qubitized quantum walks in a fashion that takes advantage of structure in the Coulomb operator, either by directly exploiting sparseness, or via a low rank tensor factorization. We provide circuits for several variants of our algorithm (which all improve over the scaling of prior methods) including one with e O ( N 3 / 2 λ ) T complexity, where N is number of orbitals and λ is the 1-norm of the chemistry Hamiltonian. We deploy our algorithms to simulate the FeMoco molecule (relevant to Nitrogen ﬁxation) and obtain circuits requiring about seven hundred times less surface code spacetime volume than prior quantum algorithms for this system, despite us using a larger and more accurate active space. The points here correspond to progressively larger basis sets between STO-6G and cc-pV5Z (the ﬁrst point seems to have a slightly diﬀerent trend, likely because STO-6G uses very diﬀerent primitive functions than the correlation consistent basis sets). The linear regressions suggest λ T = O ( N 1 . 9 ) , λ V = O ( N 2 . 7 ) and λ W = O ( N 3 . 0 ) .


Introduction
Quantum computers were originally proposed as special purpose tools for efficiently modeling physical quantum mechanical systems [1]. Ever since then quantum simulation has been central to the study of quantum computing [2] while also regarded as one of its most promising applications. In recent years, progress in quantum hardware has led to great optimism for the field. However, a large gap remains between expectations for the technology and the expected value of the relatively few known applications that appear viable on even a small fault-tolerant quantum computer [3]. This disparity has underscored the importance of estimating and reducing the resources required to implement quantum algorithms within a fault-tolerant cost model.
The most widely studied and anticipated application of quantum simulation is chemistry [4]. Most work on this topic has focused on providing solutions to the electronic structure problem by using phase estimation to sample molecular eigenstates and estimate eigenvalues [5,6]. Even at small problem sizes of around a hundred qubits, efficient and accurate solutions to this problem could prove transformative for various fields of study and the development of technologies such as better batteries, pharmaceuticals and industrial catalysts.
In order to represent molecular systems on a quantum computer one usually discretizes the many-body wavefunction using a basis of single-particle functions referred to as orbitals. The vast majority of quantum chemistry calculations use either plane wave orbitals, or more elaborate orbitals that are commonly composed of linear combinations of Gaussians. Plane waves are often chosen in calculations of periodic materials and lead to highly structured Hamiltonians. The work of [7] showed that exploiting this structure leads to asymptotic advantages for quantum query model [42]. Our analysis of the phase estimation algorithm is nearly identical to that in [8], which realizes a proposal suggested in [22,23] based on qubitization [25]. We make heavy use of the unary iteration technique introduced in [8] (see also a similar idea in [43]) as well as the QROM based state preparation and coherent alias sampling technique that was originally developed in [8] and then improved to lower T gate complexity in [44]. Finally, a key aspect of our algorithm is to leverage the sparse nature of the Coulomb operator, using a low rank representation recently discussed in [36].
In the case where we limit the number of ancilla qubits used, mostly using the system qubits as "dirty" ancilla, our algorithm can obtain chemical accuracy for FeMoco with about 2 × 10 13 Toffoli gates, using the active spaces of either Reiher, Wiebe, Svore, Wecker, and Troyer (RWSWT) [29] or Li, Li, Dattani, Umrigar, and Chan (LLDUC) [45]. If we allow a large number of ancilla then the number of Toffoli gates achieved with our most efficient approach is about 2 × 10 11 for the RWSWT orbitals, or 8 × 10 10 for the LLDUC orbitals. Throughout we focus on complexities in terms of Toffoli counts, because the non-Clifford gates we use are exclusively Toffolis. The cost in terms of T gates will be four times as large but since we are bottlenecked by Toffolis we can directly distill Toffoli states, which is possible with roughly the same cost as distilling two magic states for T gates [37]. Although we improve upon the distillation spacetime volume required by [29], at 10 −3 error rates we still require about three million qubitweeks of state distillation, which improves over previous results by roughly a factor of seven hundred, but is still substantial.
The paper is organized as follows. In Section 2 we review how it is possible to truncate the Coulomb operator to low rank, and establish notation. In Section 3 we describe the Hamiltonian as a linear combination of unitaries, and how to perform the state preparation and controlled unitary operations. We give calculations of the complexities obtained with the low rank truncation for FeMoco in Section 4. In Section 5 and Section 6 we discuss techniques that can be used to further lower the cost of qubitization based quantum chemistry simulations by leveraging unstructured sparsity that may exist in the Coulomb operator. We conclude in Section 7. In Appendix A, Appendix B, and Appendix C we discuss technical details relating to how the qubitization state preparation oracle is implemented. In Appendix D we discuss the scaling of the λ parameter for more general chemical systems. In Appendix E we give the details for minor contributions to the cost, and in Appendix H we give circuits and exact costings for arithmetic.

Low Rank Tensor Factorization of the Coulomb Operator
In this section we review representations of the Coulomb operator based on low rank tensor decompositions. These ideas have existed in some form in the classical electronic structure literature for decades [46][47][48][49][50][51], and were recently discussed in the context of Trotter based electronic structure simulations in [36].
We first define the electronic structure Hamiltonian in an arbitrary second quantized basis as where a † p and a p are fermionic creation and annihilation operators for spin-orbital φ p (r). The scalar coefficients h pq and h pqrs are the one-and two-electron integrals over the basis functions, h pqrs = dr 1 dr 2 φ * p (r 1 )φ * q (r 2 )V (r 1 , r 2 )φ r (r 2 )φ s (r 1 ), where U (r 1 ) and V (r 1 , r 2 ) are the nuclear and electron-electron potentials, respectively. In Eq. (2) we have implicitly defined T pq and V pqrs with respect to the usual integrals by rearranging the Coulomb operator in so-called "chemist notation" with a † a a † a instead of a † a † a a, and absorbing the factor of 1/2 into the coefficients. Reordering operators according to the fermionic anticommutation relations {a † p , a q } = δ pq also slightly changes the one-body coefficients. Assuming real basis functions (such as molecular orbitals), T pq and V pqrs are real and have the symmetries [52], which are important for properties of the tensor decompositions we will discuss. The rank-4 tensor V has N/2 elements along each axis and we can reshape V (e.g. using "numpy.reshape") into an N 2 /4 × N 2 /4 matrix W which has composite indices pq (representing the first electron) and rs (representing the second electron). This procedure is commonly referred to in the applied math literature as the matricization of a tensor. The W matrix is symmetric and positive semidefinite. It is important to emphasize here that we have focused on the spatial orbital representation of the two-electron integrals in the chemist ordering. In the physicist ordering, the resulting matrix has full rank, and no reduction of cost is possible. Similarly, introduction of fermionic symmetries induced in the full spin-orbital Hamiltonian into the coefficients (e.g. removing coefficients of terms like a † i a † j a k a k ) will destroy the required structure for efficient simulation. This is because we are exploiting the low rank nature of the underlying spatial Coulomb interaction through matrix factorization, and it is easy to lose this structure if one is not careful. We will diagonalize it as where g ( ) is the th eigenvector of W having size N 2 /4 and ω ≥ 0 is its associated eigenvalue.
Since we are taking V and hence W to be real, g ( ) will also be real. The rank of W is denoted L.
If W were of full rank then it would be the case that L = N 2 /4; however, the integrals that one encounters in molecular electronic structure applications contain considerable structure. As a consequence of this structure, it turns out that W is not full rank, and instead L ∈ O(N ). The physical basis for this result is the pairwise nature of the Hamiltonian interactions, arising from the Coulomb kernel in a real-space representation. This property is regularly exploited in classical approaches to electronic structure in techniques such as "density fitting" [47,48] which is commonly performed using a Cholesky decomposition [49][50][51] (which is similar to the diagonalization in Eq. (6) but is numerically more efficient and permits different left and right eigenvectors).
We use the notation g ( ) pq to denote the entry of g ( ) indexed by the composite index pq (the same composite index we used to flatten V into W ). The eigenvectors g ( ) inherit certain symmetry properties of V , with the result that g ( ) pq is symmetric between p and q. We can express the twoelectron operator in terms of g While common in electronic structure, this representation was first proposed in a quantum computing context in [15]; however, that work did not appear to appreciate the low rank aspect, which was first exploited for advantage in quantum computing in [36]. Whereas there are O(N 4 ) seemingly distinct coefficients on the left side of this equation, there are O(N 2 L) = O(N 3 ) distinct coefficients on the right side of the equation. Due to symmetry, the number of independent coefficients for each is N 2 /8 + N/4, giving a total number of L(N 2 /8 + N/4). The work of [36] also discussed further factorizations of the Hamiltonian based on the results of [53]. There, they showed that one can also diagonalize and truncate the matrix with elements g ( ) pq which turns out to have only O(log N ) significant eigenvalues in certain asymptotic limits. Furthermore, one can rotate into the basis where the operator is diagonal using O(N log N ) operations. One might think this would be useful for linear combinations of unitaries based quantum simulation, and with sufficient cleverness, that might be the case. However, we do not focus on that second factorization in this work because it adds many intricacies, is less well understood than the first factorization, and might not offer an asymptotic advantage in T complexity for technical reasons related to the improved scaling that comes from using the improved "QROAM" discussed later in this work.

LCU based simulation
A number of techniques for simulating Hamiltonian evolution are based upon the linear combination of unitaries approach [42,54]. This approach enables one to achieve a sum of unitaries that yields another unitary operation. Say the operation to perform is given in the form U = j w j U j , where w j are real and positive. First a control register is prepared in the state j w j /λ |j , where λ = j w j is needed for normalization. We call this preparation operation "prepare". Then a controlled U j operation is performed on the target system, an operation we will call "select". The inverse prepare operation is performed, then if the control system is measured in the state |0 , then the operation U will have been applied to the target system. This operation only has probability 1/λ 2 of success, so to achieve U with unit success probability one can use ∼ λ steps of oblivious amplitude amplification.
This formalism was generalized by the block encoding, or "qubitization", formalism of [25], where one can take the Hamiltonian to be a linear combination of unitaries, and use quantum signal processing [55] to obtain Hamiltonian evolution. For quantum chemistry, we are typically interested in the eigenvalues of the Hamiltonian. In that case, instead of performing the Hamiltonian evolution, one can instead consider performing phase estimation on a single step from the qubitization formalism of [25]. This step corresponds to expressing the Hamiltonian as a linear combination of unitaries using two prepare operations and one select operation, as well as a reflection as one would do for oblivious amplitude amplification. The eigenvalues of this step will then be e ±i arccos(E k /λ) , where E k are the eigenvalues of the Hamiltonian. The complexity is fundamentally dependent on the quantity λ = j w j . The overall complexity will be proportional to λ multiplied by the complexity of the select and prepare operations.

The Hamiltonian as a linear combination of unitaries
We can map a † p a q + a † q a p and the number operator n p = a † p a p to qubits using the Jordan-Wigner transformation as where X, Y and Z are the Pauli operators, the subscripts indicate the qubits these operators act on, and A p ZA q is shorthand for A p Z p+1 · · · Z q−1 A q . Thus, our Hamiltonian can be represented as the linear combination of unitaries: The terms in the first line in this expression correspond to the one-body operator T , and the terms in the second line correspond to the factorized two-body operator. Here the ranges of the summations are the same as for H. Given the Hamiltonian expressed as a linear combination of unitaries, we can now give the expression for λ. In the following we will use λ T to refer to the sum of weights for the one-body term as in Eq. (2), and use λ W to refer to the sum of weights for Coulomb operator in its factorized form, as on the right-hand side of Eq. (7). We have λ = λ T + λ W , with Here the factors of 2 and 4 in front of the sums are due to summation over the up and down spins. By simulating the factorized Hamiltonian we are slightly increasing λ over what it would be if we used the Hamiltonian in its original form from Eq. (2). Denoting by λ V the sum of weights for the potential term in the form Eq. (2), one has However, we do not expect any difference between the asymptotic scaling of λ V and λ W with respect to N . Because these quantities directly scale the complexity of our approach, techniques for reducing the effective value of λ are potentially important. Perhaps the simplest idea to reduce λ might be to optimize it under rotations of the single-particle basis (and to accordingly rotate the initial state using the Givens rotation technique in [56]). Another example of such a technique is to modify the Hamiltonian by adding to it a linear combination of n-representability [57] equality constraints that have provably zero expectation value. This strategy was introduced and shown to be effective in Section V of [58]. Other techniques that might be useful for this include mean-field background subtraction [59] and the use of soft pseudopotentials [60]. However, one should make sure that these methods are applied in a way that does not increase the rank of the Coulomb operator, which would be counterproductive for the overall complexity. Note finally that interaction picture techniques [10] do not appear to be helpful here because while λ V λ T , the V operator cannot be fast-forwarded.
In order to perform phase estimation via the qubitization/LCU approach, we need to be able to perform the state preparation prepare and controlled unitaries select. The techniques to achieve these operations are described in the following subsections.

State preparation
The state we would need to prepare is where λ is as defined in Eq. (10), |+ = (|0 + |1 )/ √ 2, α and β are spin labels, and θ ( ) pq are used to obtain the correct signs on the terms. They are given by The first register gives , with |0 reserved for the first term. The first term gives the first two terms in Eq. (9), with the first term in Eq. (9) obtained for p = q and the second term obtained for p = q. The |+ state on the second qubit selects between 1 and Z p,σ for p = q. We use p < q and p > q to select between X p,σ ZX q,σ and Y p,σ ZY q,σ for p = q, in a similar way as in [8]. The second term gives the third term in Eq. (9), with the sums over p, q and r, s yielding the square.
The two |+ states in registers 2 and 3 select between 1 and Z p,α for p = q and between 1 and Z r,β for r = s.
There are (L + 1)(N 2 /8 + N/4) = O(N 3 ) unique coefficients, which indicates the state preparation can be performed with similar complexity. Using the QROM and subsampling techniques from [8], the T complexity can be expected to be O(N 3 + log(1/ )), where is the required precision. By using a more sophisticated preparation scheme it will be possible to reduce the number of T gates, as will be described below.
To perform the state preparation, we start with all registers in the |0 state, then perform the following steps.
1. Prepare a superposition over the first register, to produce the state 2. Perform a Hadamard on the second register, and a Hadamard on the third register controlled on the state of the first register being | for > 0, giving 3. Prepare a superposition over register six controlled on the first register. For the first register in the state |0 , we prepare weights |T pq |, and for | with > 0 we prepare weights proportional to |g ( ) pq |. The state is then We will allow total error . Because there are a number of steps, each step will have an allowable error some fraction of . Here we aim to estimate the leading-order term in the complexity, and the allowable error will only appear in logarithms, so we will simply give log(1/ ), rather than subdividing the allowable error between the different steps. Throughout we will use log to indicate logarithms to base 2.
For the state preparation in step 1, the approach in [8] gives complexity in terms of Toffolis L + O(log(1/ )). The complexity in step 1 will be negligible compared to the complexity of later steps. The second step is just controlled operations on two qubits, and has negligible complexity compared to the other steps.
Steps 3 and 4 have the greatest complexity. A simple method is to use the unary iteration procedure as described in [8] (Section IIIA) combined with the state preparation procedure in [8] (Section IIID). The unary iteration procedure allows us to progressively perform an operation controlled on a register being |0 , then |1 , and so forth, with the overall complexity of the control in terms of the number of Toffoli gates being L (since there are here L + 1 possible values). That unary iteration procedure is performed on the first register, and for each value of this register, the state preparation is performed on the sixth register (for step 3) or the seventh register (for step 4).
There is a subtlety in that the values of p and q range over N/2 values, which need not be a power of 2. The result from [8] is for contiguous binary values. In the case where there are two subregisters, then we can iterate through the register for p, and use its output qubit as the control for the unary iteration over the register for q. The complexity for iterating over p is N/2 − 1, and for each of the N/2 values of p the complexity of iterating over q is N/2 − 1. That gives a total complexity for iterating over p and q that is N 2 /4 − 1.
As a result the complexity of the state preparation is N 2 /4 + O(log(1/ )) for each in both steps 3 and 4. In each case the complexity is N/2 + O(log(1/ )) for zero on the first register for step 3 only. The total complexity is then N/2 + LN 2 /2 + O(L log(1/ )). We can significantly reduce the complexity using three techniques.
A) Take advantage of the symmetry of g C) Use the QROM of [44] which allows one to further reduce Toffoli complexity at the cost of extra ancilla.
To take advantage of the symmetry, we can initially prepare a state proportional to Then we have this state tensored with a register in a |+ state. For preparation on register six, the |+ state is on register two (step 3), or for preparation on register seven the |+ state is on register three (step 4). We can then perform a swap between the registers storing p and q controlled by this register, giving state This gives the correct weighting for each of the terms in the superposition. As always with these state preparations for LCU, the prepared state is permitted to be entangled with junk registers. For p = q the additional ancilla may be regarded as a junk register, whereas for p = q this register will be used to distinguish between 1 and Z p,α operations. The To explain technique B for reducing the complexity, the state preparation is performed in the following way.
(i) Create an equal superposition over j for the register where we are performing the state preparation.
(iii) Perform an inequality test between the probability register and an ancilla in an equal superposition state.
(iv) Perform a controlled swap between the register where we are performing the state preparation and the alternate index register based on the result of the inequality test.
We also need to create superpositions over the spin registers, but that can be done trivially with Hadamards. If we were to iterate through and perform state preparation for each value of , we would be performing the entire procedure for each value of . The insight here is to note that we can call the QROM for all , then perform the inequality test and controlled swap. That means we only need to perform the inequality test and controlled swap once, instead of L times. Technique C for reducing the complexity is the most significant. The dominant cost in the procedure is that of the QROM, which has cost of (2L + 1)(N 2 /8 + N/4) Toffolis if we use the procedure of [8]. That is, we need to output (L + 1)(N 2 /8 + N/4) or L(N 2 /8 + N/4) numbers in QROM, with outputs of size log(N 2 ) + log(1/ ) + O(1). Here log(N 2 ) is the size of the register for the alternate values, and the size of the register for the probability is log(1/ ) + O(1).
The complexity in terms of Toffolis can be reduced by using a more advanced QROM based on that of [44]. This QROM uses a combination of the QROM of [8] and a technique for trading between spatial complexity and gate complexity in a fashion that accomplishes something reminiscent of what authors aspired to demonstrate with the original concept of "QRAM" [61]. Thus, here we will refer to the more advanced QROM of [44] as "QROAM". In the following we will use d for the number of entries that we must look up using the QROM (here we have d ≈ L(N 2 /8 + N/4)), k for an arbitrary power of two, and M for the size of the output in qubits (here we have M = log(N 2 / ) + O(1)). Then the complexity for computing the QROM is d/k + M (k − 1), and for uncomputing the QROM is d/k + k (see Appendix C). Moreover, it is possible to choose the k for the uncompute to be different from that for the compute step. The number of additional ancillae needed is (k − 1)M . It is also possible to use ancillae that are already being used for some other purpose, called "dirty" ancillae. Using these dirty ancillae, the compute cost is over twice as much, 2 d/k + 4M (k − 1) (see Appendix A), and the uncompute cost is changed to 2 d/k + 4k. The compute and uncompute are used for the state preparation and inverse state preparation, so the combined cost is what needs to be considered.
The results we use here are improved slightly over those in [44]. The Toffoli count achieved in [44] is 2d/k + 8M k (from the last column and last row of Table II, after dividing by 4 to account for the fact they are counting T gates and also after substituting d, k and M for N , λ and b respectively). Our corresponding Toffoli count is 2d/k + 4M (k − 1). The factor of two improvement in the M k term is because we use a linear depth swapping network instead of a logarithmic depth swapping network. A logarithmic depth network requires spreading control qubits for parallel CSWAPs over many ancillae, but because the ancillae are dirty each CSWAP must be toggled-controlled which involves repeating the operation twice. The small improvement from M k to M (k − 1) in the ancilla count is due to using |+ states instead of a spare register in order to ensure the output is only toggled once. There is also an improvement from M k to M (k−1) in the Toffoli count, but that is due to a more careful accounting of the number of controlled swaps needed.
The most significant improvement we make is the application of measurement based uncomputation, as described in Appendix C, which removes the dependence on M in the complexity when uncomputing a lookup. The principle is similar to that used to reduce the Toffoli complexity of addition in [62]. Instead of just reversing the circuit for a table lookup, you can perform X measurements on the output qubits. Based on the measurement result you can perform a classically-conditioned phase fixup. This procedure also means that the ancillae used by the forward QROAM only need to be used temporarily, and can be erased after the QROAM and reused.
There is a subtlety in using these results in that the QROAM is for a single control register which can take a contiguous set of values. In contrast, here we have three registers with , p, and q. In this case we can simply convert to a single contiguous register for the iteration. We can compute a value for a single contiguous register s from , p, and q as The p(p + 1)/2 term takes account of the fact that we are preparing p and q only for p ≥ q. We can use QROAM directly on s with just an additional logarithmic overhead for the arithmetic. We will consider two cases. First is that where we attempt to minimize the cost in terms of Toffolis, but use a large number of ancillae. In that case, for the compute we can take k ≈ d/M , in which case the cost of the compute step is approximately 2 √ dM . For the uncompute, we can take k ≈ √ d, which gives an uncompute cost of approximately 2 √ d, for a total cost of the compute and uncompute of 2 We find that this approach needs a number of extra ancillae As L = O(N ) the complexity in terms of Toffolis is O(N 3/2 log(N/ )), and a similar number of ancillae are needed.
Alternatively, if we are attempting to minimize the number of additional ancillae needed, we can use "dirty" ancillae instead (ones that are not initialized to zero). Fortunately, we happen to have N dirty ancilla lying around because the system register is not acted upon while implementing the state preparation operation. Moreover, there are multiple steps of state preparation that are performed, and there are qubits that will be used in some steps of state preparation that can be used as dirty qubits for the other steps of state preparation. We will find that we can take the number of dirty qubits to be somewhat larger than N , but not a lot larger. Assuming the number of dirty qubits is about N , we can take k ≈ N/M . Then we would have compute cost . For the uncompute step we can take k ≈ N , giving cost approximately LN/4. In both cases the costs need to be multiplied by 2 to account for steps 3 and 4.
Finally we consider the cost of outputting |θ ( ) pq in register four and |θ ( ) rs in register five. This use of QROM can simply be combined with that in steps 3 and 4. For example, for step 3, when calling the QROM for the state preparation, output the value of θ ( ) pq , as well as that for the alternate values of p and q. Then, when doing the controlled swap, also swap these registers. There is a net increase in the size of the output of 2 qubits, and one extra Toffoli for the controlled swaps. This cost is negligible compared to the overall cost in steps 3 and 4.

Controlled unitaries
For the controlled unitaries (the select circuit) in the case of using only the first diagonalization, we need to implement the terms in the Hamiltonian as in Eq. (9). The general principle is that we do a pair of operations, each of which has X p ZX q and Y p ZY q for p = q, with the term selected by whether p or q is larger. For p = q we use an ancilla qubit to select between 1 and Z p . The way the state preparation is chosen, this can be performed in the same way for = 0 and > 0, because the ancillae will only select the identity operation. The operations we need are Note that the selected operations we need are similar to those in [8,63], and we can use a similar approach. The complexity is linear in N , and will therefore be smaller than the complexity of the state preparation. The technique is shown in Figure 1 for select 1 , and select 2 is equivalent. If p < q then the Y operation in Z . . . ZY acts on the same qubit as one of the Zs in the Z . . . ZX operation. As a result, the Y gets multiplied by Z and becomes ZY = −iX. Therefore the operation implemented is of the form −iX p,α Z . . . ZX q,α . If p > q then the X operation in Z . . . ZX acts on the same qubit as one of the Zs in the Z . . . ZY operation. Thus we have X times Z on that qubit giving XZ = −iY . Therefore the operation implemented is of the form −iY p Z . . . ZY q . If p = q then all the Zs cancel leaving only X p,α Y p,α = iZ p,α . Now note that the register q 1 is 0 for p > q and 1 for p < q. Before and after the ranged operations, we perform an inequality test between p and q, controlled on q 1 , with the result that an extra ancilla is in the state |1 unless p = q and q 1 = 0. That register is used as a control for the ranged operations, so if p = q and q 1 = 0 then the identity is performed. We then apply an S gate on this ancilla, with the result that the operations performed are X p,α Z . . . ZX q,α for p < q, Y p,α Z . . . ZY q,α for p > q, and −Z p,α for p = q and q 1 = 0. For p = q and q 1 = 0 this ancilla is zero, so the phase on the identity is unchanged. This yields the desired operations with the correct phases, and lastly the controlled Z on the θ register gives the (−1) θ1 phase factor. Figure 1: The circuit needed to perform a controlled select1 operation. We have omitted the registers this operation does not act upon for simplicity. The unitaries labeled as − → Z Aj apply the operation Z0 · · · Zj−1Aj to the target register, depending on the value from the input register, using the technique shown in Figure 9 of [8]. This operation can be achieved using an inequality test, followed by a ranged operation via the technique shown in Figure 8 of [8]. The controlled select2 operation is completely equivalent except with different control registers.

Complexity
Let us denote the upper bound on the error required for the eigenvalue estimation by ∆E. Then following [8] we find the complexity of the estimation is the cost of each LCU step times 2 m , where Moreover, the error that is allowable for the implementation of each LCU step is Some minor costs are as follows.
1. The cost of the controlled operation in Figure 1 is 2N Toffoli gates for the two controlled ranged operations, and 2 log N for the two inequality tests. These need to be done twice for a total of 4N + 4 log N .
2. In the state preparation we initially need to prepare superpositions over , p, q, r, s, with ≤ L, N/2 > p ≥ q, N/2 > r ≥ s. This can be achieved by creating an equal superposition over ranges that are powers of two using Hadamards, then flagging success using inequality tests. The number of Toffolis needed for these inequality tests will correspond to the number of qubits. Amplitude amplification can give the desired result with amplitude close to 1. The reflection in the amplitude amplification also needs a number of Toffolis corresponding to the number of qubits, so for m steps of amplitude amplification the number of Toffolis is 3m + 1 times the number of qubits. This needs to be multiplied by two because there is preparation and inverse preparation at each step.
3. The state preparation needs an inequality test, which has cost µ for µ bits of precision in the keep probability, and a cost corresponding to the number of qubits for the controlled swap. A controlled swap on a pair of qubits can be performed with a single Toffoli and CNOTs.
4. For the state preparation we also do controlled swaps of the p and q, as well as r and s registers. These two controlled swaps cost 2 log(N/2) Toffolis. 5. Computing the function in Eq. (20) requires multiplication by a constant, a regular multiplication, and three additions. The division by 2 can be achieved by trivially shifting the bits. The multiplication can be achieved with 2 log(N/2) 2 Toffolis.
In the remainder of this section we quantify the costs of the QROM needed for the state preparation, which is the main contributor to the complexity, then give the total cost.

RWSWT orbitals
The prominence of the RWSWT paper [29], which was the first work to rigorously estimate the T complexity of any quantum algorithm for chemistry makes it an important benchmark. Unfortunately, LLDUC [45] later argued that there were substantial problems with the orbitals chosen for the RWSWT paper. For the reasons discussed in the paper by LLDUC, we believe that future papers should compare against this work using only the LLDUC integrals. But in order to more easily compare with past work, here we analyze the complexity of simulating both RWSWT and LLDUC FeMoco active spaces. Note that at 152 spin-orbitals the LLDUC active space is also significantly larger than the 108 spin-orbital RWSWT active space.
Our approach is to choose L by observing the effect of truncation on two different efficient classical correlated approximate methods for molecular electronic structure: Configuration Interaction at the singles and doubles level (CISD) and Møller-Plesset perturbation theory to second order (MP2). For a review of both methods, see [52].
We perform CISD/MP2 on the exact Hamiltonian and then perform CISD/MP2 on the truncated Hamiltonian for various truncations and track the discrepancy. In Figure 2 we plot how the energies converge for both the RWSWT [29] and LLDUC [45] integrals. Specifically, we plot the correlation energy (the difference between the mean-field energy and the exact energy) for the   Top: difference ∆Ec between truncated and untruncated correlation energy, for the FeMoco cluster with MP2 and CISD methods (red, blue). The grey shaded region represents chemical accuracy; thus, for both active spaces we expect L = 200 is sufficient for our purposes. Bottom: λW as a function of L. For RWSWT [29], λT = 1,490 a.u., λV = 8,373 a.u. and the maximum value of λW is 34,552 a.u.; thus, we take λ = 36,042 a.u. For LLDUC [45], λT = 3,446 a.u., λV = 4,168 a.u. and the maximum value of λW is 20,746 a.u.; thus, we take λ = 24,192 a.u.
reason that the mean-field energy converges much faster than the correlation energy and so it is easier to see the trend this way. In general, CISD and MP2 are very different methods and we would expect truncation to affect them differently. However, both methods appear well converged, for both RWSWT and LLDUC integrals, by L = 200. This gives us confidence that the exact ground state energy of the truncated Hamiltonian would also be consistent with the exact ground state energy of the untruncated Hamiltonian by this point. Accordingly, we choose 200 as the rank of our Coulomb operator. Note that the RWSWT and LLDUC integrals have very different properties, and so we assume it is a coincidence that both converge around the same value of L.
For the integrals of RWSWT with this truncation we obtain λ = 36,042 a.u. (see Figure 2). For both active spaces, we will focus on obtaining the standard "chemical accuracy", corresponding to ∆E = 0.0016 a.u. [52]. For the integrals of RWSWT, that gives ≈ 1.7 × 10 −8 . The output size for the QROM is approximately log(N 2 / ) for N = 108, giving about 40. More specifically, the output size for the probabilities in the QROM is given by Eq. (36) in [8]. In that equation, only the first term is significant, giving That would give µ = 26 bits, except we have three steps of state preparation, which means the number of qubits for the probabilities needs to be increased by 2 to µ = 28. With N = 108 we need two registers of size log(N/2) = 6, as well as two single qubit registers for the θ values, for a total of M = 42 qubits for steps 3 and 4. As we take L = 200, for step 1 (preparing the superposition over the register), only 8 qubits are needed for , for a total of 36 qubits output. We will use the QROAM for steps 3 and 4, as these steps have the dominant complexity.

Dirty ancillae
If we are attempting to minimize the number of qubits used, it is convenient to combine steps 1 and 3. That is, we use a state preparation over , p, and q simultaneously, and output alt values for these three indices. Then there are only two steps of state preparation, and we can reduce the number of qubits for the keep probabilities to µ = 27. That means the state preparation has an output size of M 1 = 8 + 12 + 2 + 27 = 49. There will be M 2 = 41 qubits used for the output in step 4, because there is one fewer qubit for the keep probability. If we use dirty qubits, then in the first state preparation we can use the N = 108 system registers as well as the M 2 = 41 ancilla registers that will be used as output in the next step. We can therefore take k = 4, which uses (k −1)M 1 = 147 qubits, and fits within that size. Similarly, for the second state preparation, we are able to use the output registers from the first state preparation as dirty qubits. The cost of the QROAM compute would be 2 d/4 + 12M . For the uncompute, we can take k = 128, giving cost 2 d/k + 4k = 2 d/128 + 512.
Taking L = 200 gives d 1 = (L + 1)(N 2 /8 + N/4) = 298,485 for the first preparation, and Toffoli cost For the second state preparation d 2 = L(N 2 /8 + N/4) = 297,000, giving Toffoli cost The total is 309,154. The minor costs result in another 1,534 Toffolis, for a total of 310,688 (see Appendix E). We find that the number of qubits for the phase estimation is so we obtain an overall complexity (in terms of Toffolis) There is a total of 378 logical qubits needed (see Appendix E).

Large ancilla count
Alternatively, we can use a large number of ancilla qubits in an attempt to minimize the Toffoli count. In the compute step it is optimal to take k = 64, and in the uncompute step it is optimal to take k = 512. The combined complexity of compute and uncompute for each step is then In this case it is better to use steps 3 and 4 as described above, with a separate state preparation for in step 1. Since there are three steps of state preparation, we should take M = 42. The Toffoli complexity for step 1 is only 200 using normal QROM. With the d values given above, we obtain a complexity 8,405 for step 3 and 8,379 for step 4. The minor costs are increased to 1,594.
That gives an overall Toffoli complexity of Altogether there are 3,024 qubits used (see Appendix E).

LLDUC orbitals
An alternative active space for FeMoco was advocated for in [45]. This work argued that the active space Hamiltonian from RWSWT did not properly capture the electronic structure of FeMoco and was classically solvable. LLDUC introduced an alternative Hamiltonian for the FeMoco active space with N = 152 spin-orbitals. There it is found that λ T = 3,446 a.u. and λ W = 20,746 a.u., for a total of λ = 24,192 a.u. The smaller value of λ means that the number of qubits for the probabilities should be µ = 27 regardless of whether we merge steps 1 and 3 or not. Since N is larger than before, we now need one additional qubit for each of the orbital numbers, for a total of M = 43 qubits.
The cost of the second preparation is approximately The total of the minor costs is 1,818 for a total of 608,968. This time we find that log( √ 2πλ)/(2∆E) is very slightly larger than 25 It would be unreasonably inefficient to round up to m = 26. Instead we can allow very slightly less error in other parts of the algorithm (which does not affect the complexity significantly because the algorithm depends on that error logarithmically) and take m = 25. Then we get a total cost The total number of logical qubits is 437.

Large ancilla count
If we use a large number of ancilla qubits, we should again take the k sizes as 64 and 512 in the compute and uncompute steps, respectively. We again perform steps 1 and 3 separately for this approach. The M will be 43 for both steps 3 and 4. Then using d/64 + 63M + d/512 + 512 gives Toffoli costs 13,560 and 13,508 for steps 3 and 4, respectively. This time the additional step of preparation needs 200 Toffolis for QROM. The minor costs are increased to 1,872, for a total of 29,140. That gives an overall complexity in terms of Toffolis The total number of qubits is 3,143.
In summary, the Toffoli costs are as given in Table 1. In this table we have given approximate formulae including only the leading terms, and taken k = 4 for the QROM compute circuits.

Exploiting sparsity in the Coulomb operator
Next we provide a strategy for further reducing constant factors when qubitizing the non-factorized quantum chemistry Hamiltonians. It is also possible to apply this strategy to the factorized form, but the result is worse for the case of FeMoco, so we will not address it here. This strategy will likely reduce the T complexity in practical situations, but only by constant factors. We focus on the form of the Coulomb operator which has the truncated form The purpose of this truncation is to induce sparsity in the operators by removing near-zero elements. The idea of reducing quantum simulation costs by exploiting sparsity in the Coulomb operator was first explored in [64], but in the context of Trotter based methods. The idea of the approach here is to choose the value of c to be as large as possible while still leaving classical correlated approximations such as CISD within chemical accuracy (essentially the same procedure we used for choosing L shown in Figure 2). This is shown in Figure 3. We will define L pqrs . In general, we expect that L (c) ; which is to say that this truncation should not asymptotically change the sparsity of the operators. However, in practice we do expect to find that L (c) V < N 4 /16; which is to say that we do expect there to be additional sparsity in these operators. While the matter of exactly how much sparsity exists is highly system dependent, we might sometimes desire an algorithm that exploits this sparsity, even if it is highly unstructured.
It is possible to perform a state preparation that has cost dependent on the number of nonzero elements, at the cost of a slightly larger number of ancillae. Consider the state preparation of [8], which has the following steps.
1. Create an equal superposition over the system registers |j .
2. Use QROM indexed on the system registers to output alt values and keep values  The total number of ancillae used by this state preparation is 2 log d + 2µ + 1.
In order to perform the preparation in the sparse case, instead of iterating over the index register, we can output the index register in the same way as the alternate index register. That is, we have a register which iterates over the number of nonzero amplitudes in the state we aim to prepare. Denoting that number d, our steps are as follows.
1. Create an equal superposition over the register indexing the nonzero entries.
2. Use QROM indexed on this first register to output ind, alt, and keep values 3. Use another ancilla in an equal superposition over 2 µ values, and perform an inequality test between this register and the keep register. Based on the result of this inequality test, swap the contents of the ind and alt registers.
The desired state will then be produced in the second register where we output ind. The value ind j is simply the index for the jth nonzero amplitude. The correctness of this state preparation routine follows immediately from the correctness of the routine in [8], because after the QROM lookup the state in the registers excluding the first is equivalent to that in [8]. The Toffoli complexity of this modified state preparation procedure for sparse states depends on the number of nonzero entries rather than the dimension. For our application, the non-factorized form of the Hamiltonian with truncation of the Coulomb operator can be expressed using the Jordan-Wigner representation. Using Eq. (2) and the sym- where Here we have again used the symmetries T pq = T qp , V pqrs = V qprs = V pqsr . Using p < q versus p > q to distinguish between X p,σ ZX q,σ and Y p,σ ZY q,σ is a convenient way to perform the controlled operations as described in Figure 1. This is of the form of a linear combination of unitaries, and the state we need to prepare is of the form The first register is just a convenient replacement for the | register, and is used to distinguish between the one and two body terms. The second register is used to distinguish between 1 and Z p,α for p = q, and the third register is used to distinguish between 1 and Z r,β for r = s. The second, third, and fourth registers will be used to generate the symmetries of the state. The fifth register is used to give the appropriate sign of the T pq and V (c) pqrs weightings, which are now combined instead of having two separate registers as before.
We can again use symmetry to reduce the number of coefficients that need to be prepared. For V pqrs there is symmetry between exchanging p and q, between exchanging r and s, and between exchanging the p, q and r, s pairs, as described in Eq. (5). For T pq there is symmetry between p and q. For this reason we will initially only prepare amplitudes for p ≤ q, r ≤ s, and p ≤ r for V , and p ≤ q for T .
To simplify the description, we will introduce some notation, We will also allow ζ to be used with four subscripts, which means where the notation pq < rs indicates that either p < r or p = r and q < s, and pq = rs indicates that p = r and q = s. That is, if these are the composite indices for the matrix W , then pq < rs indicates the upper triangle, and pq = rs indicates the diagonal. In terms of this notation, the state produced will initially be |0 |+ |0 |0 Using ζ ensures that the number of nonzero terms is about 1/2 as much for T , and about 1/8 as much for V . These sparse entries can be prepared by the technique described above for sparse state preparation. For T the only terms prepared here are those where p ≤ q, and for V only where p ≤ q, r ≤ s, and pq ≤ rs. Next we perform three steps.
1. Swap the p, q and r, s registers controlled on the state of the fourth register.
2. Swap the p and q controlled on the state of the second register.
3. Swap the r and s controlled on the state of the third register.
To show the effect of this, we will first show it for the first component of the state, with 0 in the first register and weightings dependent on T . The first and third controlled swaps have no effect because third and fourth registers are zero.
where we have defined the state labelling Using similar notation we can describe the effect of the controlled swaps between p, q and r, s in a compact way.
The reasoning is exactly the same as for the T component of the state. The net effect is that we remove the ζ pqrs and the control register is now in the state |κ pqrs . The controlled swaps with p, q and r, s have similar effect, giving the state as in Eq. (48), except the second, third and fourth registers are not in |+ states. Nevertheless, for p = q the second register is in an equal superposition of |0 and |1 , so we can we use it to select between 1 and Z p,σ for the controlled operations. Similarly for r = s the third register is in an equal superposition of |0 and |1 . These features are sufficient to give the linear combination of unitaries required. Hence we can prepare the desired state with about 1/8 as many nonzero entries using the sparse preparation procedure, and controlled swaps to generate the entries that are identical due to symmetries. For the controlled operations that need to be performed, note that the form of the Hamiltonian without factorization has a † p,σ a q,σ and a † r,σ a s,σ operations that are mapped to the Pauli operators in exactly the same way as for the factorized form. This means that these controlled operations can be performed using exactly the same select 1 and select 2 operations as for the factorized form.
6 Complexity for sparse preparation 6

.1 RWSWT orbitals
Next we use the results of Section 5 to determine the complexity in the case of using a sparse preparation with the non-factorized Hamiltonian. For the integrals of RWSWT we have λ T = 1,490 a.u. and λ V = 8,373 a.u. for a total of λ = 9,863 a.u. The number of qubits for the phase estimation is and the output size for the keep probabilities for the QROAM is There are eight registers of size log(N/2) , because the sparse preparation scheme needs to output ind values and alt values of p, q, r, s. There are also 2 qubits needed for the two output values of θ (one for the ind and one for the alt values of p, q, r, s), as well as 2 qubits used for ind and alt values of the first register which distinguishes between T and V . As a result the QROAM output size is M = µ + 8 log(N/2) + 4. The number of orbitals is N = 108, giving log(N/2) = 6 and therefore M = 77.
Here we just consider the use of a large number of ancillae to minimize the Toffoli cost. The cost of the preparation is then and of the inverse preparation is Using a truncation threshold of 0.0002 a.u., we find that 3,300,568 nonzero values of V (c) pqrs are needed. From the symmetries, the number of unique nonzero values is about 435,023. Adding to that the N 2 /8 + N/4 = 1485 unique terms for T , we get 436,508. Then the optimal values are k 1 = 2 6 and k 2 = 2 9 , giving a preparation cost of 11,672 and an inverse preparation cost of 1,365, for a total of 13,037. There are 746 Toffolis for the minor costs (see Appendix E) giving a total of 13,783. That gives an overall complexity in terms of Toffolis The total number of logical qubits needed is 5,103.

LLDUC orbitals
The number of logical qubits used is 2,904. The Toffoli performance of this algorithm is the best of the alternatives we have so far considered. As this is the most promising for implementation, we consider a further improvement by better allocating the allowable error between the different parts of the algorithm. In this case In [8] the error is given in Eq. (23) which is equivalent to Here prep is the error allowed in the state preparation and QFT is the gate synthesis error allowed in the inverse quantum Fourier transform. The formula for m is obtained by equally allocating the allowable (squared) error between π/2 m+1 2 and ( prep + π QFT ) 2 . If we instead allow approximately 26% more error from the phase estimation, then it is possible to use m = 23 (which reduces the number of logical qubits to 2,903). The allowable error in QFT can be reduced to compensate, which has a negligible gate cost because it is not multiplied by 2 m . As a result, the complexity in terms of Toffolis then becomes 2 m × 9995 ≈ 8.4 × 10 10 .
This is a full order of magnitude improvement over the low rank factorization method, and three orders of magnitude improvement over the approach of [29].

Discussion
The cost of performing Toffoli gates may be estimated as follows. The efficient CCZ factory from [37] has a rectangular footprint of 12 × 6 logical qubit patches. The paper specifies a code distance of d = 31 which, in the rotated surface code, means each patch covers 2 · 31 2 ≈ 2,000 physical qubits. The factory outputs a CCZ state every 5.5d cycles which, assuming a surface code cycle time of 1 microsecond, is once every 170 microseconds. Thus, the spacetime volume of the factory is 2,000 · (12 × 6) physical qubits times 170 microseconds which equals roughly 24 qubitseconds. Every Toffoli we perform requires at least this much spacetime volume. With the same overhead one can use these techniques to produce two magic states.  Table 1: The leading order Toffoli and spatial complexities for the algorithms using dirty ancilla or allowing a large ancilla count. The formulae given are approximate, with the counts determined in a more exact way explained in the text. Here N is the number of spin-orbitals, L is the rank of the Coulomb operator, λ is the 1-norm of the Hamiltonian as defined in Eq. (10), ∆E is the target precision for phase estimation, M is the output size for the QROAM, µ is the number of qubits used for "keep" probabilities, and m is the number of qubits used for the phase estimation.
The lowest Toffoli count that we report in Table 1 is 8×10 10 . Combined with the 24 qubitsecond spacetime volume for distilling a CCZ state, the spacetime volume of the algorithm is about three megaqubitweeks. Three megaqubitweeks of spacetime volume means that if we use "only" three million physical qubits then the computation must run for at least a week; if we want the computation to finish in a day, we need at least 23 million qubits. We want computations to finish in less than a day, and we don't want to use 23 million qubits. This implies that, when attempting to move our algorithm into the regime of practical computations, we should focus on optimizations that reduce spacetime volume (such as exploiting symmetries in the Hamiltonian to reduce the size of QROM reads) instead of optimizations that convert spacetime volume (such as performing parallel distillation of states, which reduces time at the cost of space).
For the purpose of simulating arbitrary basis chemistry Hamiltonians, our approach is the best scaling and also the most practical shown to date. Despite us using a larger and more accurate active space, we have certainly improved over the 10 14 T gates required by [29]. Based on the numbers above, a lower bound on the spacetime volume of the [29] algorithm is roughly two gigaqubitweeks of distillation, which is about seven hundred times more than our approach. This ignores storage, routing, and Clifford operations, but this level of comparison seems appropriate considering that the large spacetime volume of state distillation is still likely to dominate the overall cost.
While our many ancilla algorithm has O(N 3/2 ) spatial complexity and would require a few thousand logical qubits to simulate FeMoco, at three megaqubitweeks of spacetime volume required just for state distillation, our algorithm is bottlenecked by the Toffoli complexity, and not by the logical qubit costs. Thus, despite the increased spatial costs, we regard the many qubit algorithm as more practical than the dirty ancilla algorithm. However, recent work [65] has suggested that by distinguishing between qubits that are, and are not, used for error detection, one can use a lower code distance for most of the logical qubits required for magic state distillation. While explicit factories realizing these improvements have not yet been laid out, these improvements could significantly reduce our estimates of the required spacetime volume for distillation, perhaps to less than a megaqubitweek.
One might wonder if there are other techniques that could be used to reduce the Toffoli count at the expense of extra ancilla. In our algorithm, the Toffolis are coming from our use of QROM. Unfortunately, a lower bound proven in [44] suggests that no further tradeoffs of this type will be possible that asymptotically reduce the Toffoli count of the QROM we are using. A more promising approach might be to utilize additional structure in order to reduce the effective number of coefficients that must be read from QROM. We presented one such strategy based on leveraging sparsity in the Coulomb operator, and another based on leveraging the rank deficiency of the Coulomb operator.
In some cases, sparsity in the Hamiltonian arises due to locality of interactions (especially for large systems) [64], but sparsity also arises for reasons having to do with the symmetry point groups of molecules in real space and the symmetry point groups of molecular orbitals in an active space (this would likely be the origin of sparsity in the FeMoco active space, for instance). It might also be possible to exploit these symmetries (which are embedded in the coefficient tensors) using techniques from group theory. Likewise, it might be possible to further exploit the low rankness of the Coulomb operator by using the second tensor factorization discussed in [36].
Another avenue for reducing the total cost would be to reduce the size of the λ parameter. For example, a technique described in Section V of [58] may help reduce the λ parameter by exploiting n-representability conditions. Yet another approach might be to incorporate meanfield background subtraction [59]. One could also explore options for selecting the active space differently. For instance, by using different orbitals that are more local, or more symmetric, one could potentially induce more sparsity in the Hamiltonian. Or, one could try to select active space orbitals with a goal of reducing λ. In both of these contexts, the use of pseudopotentials might help [60].
Of course, it might be the case that the best path forward uses a different algorithm or representation entirely. For instance, it is possible that Trotterization based on the Trotter step of [36] would provide a significant improvement over the Trotter results in [29]. In practice, Trotter errors (which determine the number of Trotter steps required) depend sensitively on the structure of the Trotter step, and this has not yet been analyzed for the approach of [36]. However, this seems like a promising direction. In that context one might also wonder if using higher-order Trotter formulae could further bring down the costs, as observed for simulating simpler models in [43].
Given the high overhead that appear to be required for simulating FeMoco in a molecular orbital basis, we might wonder how practical it would be to perform the simulation in the plane wave basis. No approach based on second quantization is sensible here because one would need perhaps N = 10 6 plane waves to simulate a FeMoco active space to sufficiently high accuracy. Nevertheless, the first quantized algorithm of [11] continues to look competitive. While the constant factors still need to be worked out, that approach scales as O(N 1/3 η 8/3 /∆E) where the number of electrons is η. If N = 10 6 , η = 54 (as in the RWSWT active space) and ∆E = 0.0016 a.u., the quantity N 1/3 η 8/3 /∆E is roughly 10 9 . Thus, if further symmetries can be exploited in this simulation to further reduce costs (given that there will be other overheads), we expect this might be a viable way forward; however, more work is clearly needed.

A Cost of computing table lookups assisted by dirty ancillae
In [44] it is explained how to perform an efficient table lookup (a QROAM read) assisted by dirty ancillae. Here "dirty" ancillae are used to mean ancillae that need not be initialized to some known state before the procedure, and are returned to their initial state at the end. We provide an equivalent technique, and prove the following result. Proof. The technique proceeds as follows (see Figure 4).
• Allocate a register r 0 with M clean qubits in the |+ state. This register will ultimately store the output.
• Let l be the superposed integer value of the bottom log k qubits in the address register. Using a series of M (k − 1) controlled swaps, permute the registers r 0 , ..., r k−1 such that r 0 ends up where r l was. The other registers can be permuted in any order. Call this swapping procedure S.
• Let h be the superposed integer value of the top log(d/k) qubits of the address register. Perform a table lookup with address h targeting the r 0 , ..., r k−1 registers. The data for register r l at address h is equal to the data from the original table at address h · k + l. In effect, this is reading many possible outputs at once. Call this lookup process T .
• Perform the inverse of the swapping procedure S.
• Because the qubits in r 0 were in the |+ state, they were not affected by T (which only targeted them with controlled bit flips). Apply Hadamard operations to r 0 so that it will be affected by the next T , which will also uncompute the dirt XOR'ed into the other registers.
• Perform S then T then S −1 again.
• r 0 is now storing the output. The other registers are restored. Return the borrowed r 1 , . . . , r k−1 registers.
The swapping subroutine S has a Toffoli count of M (k − 1). The table lookup subroutine T has a Toffoli count of d/k . We compute/uncompute S four times and perform T twice. Therefore the total Toffoli count is 2 d/k + 4M (k − 1). The space cost of the procedure is (k − 1)M dirty ancillae for workspace and log(d/k) clean ancillae hidden in the implementation of T .
Note that the transformation also uses M qubits to store the output. The value of k that minimizes the Toffoli count is approximately 2d/M . In practice the number of available dirty qubits often bounds k to be a much smaller value. The value of k must be greater than 2 in order to have a Toffoli count lower than a standard lookup. Figure 4: The sequence of operations used for computing table lookups with dirty ancillae. The output register is r0, and the registers r1 to r k−1 are the dirty ancillae. Note that the scheme in [44] (see Fig. 1(d)) uses "Swap" for the operation we here call S, and "Select" for the operation we here call T . The sequence of operations we use here is different than in [44], resulting in a reduction in the number of registers needed.

B Cost of computing table lookups assisted by clean ancillae
When clean ancillae are available, the optimization from Appendix A can be performed more efficiently. The result is as follows.

Theorem 2.
Given a function f : Z d → Z M 2 and k a power of 2 satisfying 1 < k < d, it is possible to apply the transformation using d/k + M (k − 1) Toffoli gates and (k − 1)M + log(d/k) clean ancillae.
Proof. The steps are as follows.
• Let h be the superposed integer value of the top log(d/k) qubits of the address register. Perform a table lookup with address h targeting the r 0 , . . . , r k−1 registers. The data for register r l at address h is equal to the data from the original table at address h · k + l. In effect, this is reading many possible outputs at once.
• Let l be the superposed integer value of the bottom log k qubits in the address register. Using a series of M k controlled swaps, permute the registers r 0 , . . . , r k−1 such that r l ends up where r 0 was. The other registers can be permuted in any order.
• r 0 is now storing the output.
• Note that every computational basis value of the address register results in a specific computational basis value for registers r 0 , ..., r k−1 at this point. We have performed the equivalent of table lookup targeting r 0 , ..., r k−1 . We can use the uncomputation strategy from Appendix C to uncompute this effective lookup. The first thing done by that strategy is to measure all output qubits in the X basis. We will not be using any of the qubits from registers r 1 , ..., r k−1 until that point so they can be measured now instead of later. This frees the clean ancillae for other uses. Keep the measurement results so they can be used by the uncomputation process.
The swapping subroutine of this algorithm has a Toffoli count of M (k − 1). The table lookup subroutine has a Toffoli count of d/k . We perform each exactly once, therefore the total Toffoli count is d/k + M (k − 1). The space cost of the procedure is (k − 1)M clean ancillae for workspace and log(d/k) clean ancillae hidden in the implementation of the table lookup subroutine.
Again the transformation uses M qubits to store the output. The value of k that minimizes the Toffoli count is approximately d/M . In practice the number of available clean qubits may bound k to be a much smaller value.

C Efficient uncomputation of table lookups using measurement based uncomputation
Next we consider the cost of uncomputing a table lookup, reversing the procedure described in the preceding two appendices. The result is as follows. Proof. Whenever a quantum circuit uncomputes a qubit q by performing a series of unitary operations that result in the qubit being in the |0 state, the circuit can be optimized by application of the deferred measurement principle. For example, suppose the last operation involving q is a CNOT targeting q. Then the circuit can be optimized by measuring q in the X basis before the CNOT, then replacing CNOT with a Z gate conditioned on the measurement result. This is an optimization because, in the surface code, Z gates require no spacetime volume. This general pattern of "take an X-axis interaction that clears a qubit and use an X basis measurement and a classically-conditioned phase fixup operation instead" applies to many constructions, including table lookups. Let's consider the nature of the phase fixup task we must perform when we eagerly measure the output qubits of a table lookup in the X basis.
We can think of each entry in the table as corresponding to a multi-control multi-target CNOT operation. There is one control (or anti-control) for each address qubit, and one target (or skip) for each output qubit. Because we will be measuring the output qubits, it is helpful to flip our perspective and think of the output qubits as the controls and the address qubits as the targets. For example, suppose the entry at address 2 of the table is the bit string 10011000. This means that the table lookup must toggle the qubits at offset 0, 3, and 4 of the target register conditioned on the address register storing 2. Or, equivalently, this means that the table lookup must negate the amplitude of the |2 state of the address register, conditioned on the X 0 · X 3 · X 4 observable of the qubits in the target register. Because we measured the output qubits in the X basis, we know the value of the X 0 · X 3 · X 4 observable and thus know whether or not we need to negate the amplitude of the |2 state of the address register. In fact, using the observables we measured, we can figure out for every state |j of the address register whether or not |j needs to be negated. Performing the necessary negations uncomputes the table lookup.
Let S be the set of all address register states that need to have their amplitude negated. This will be some arbitrary subset of the states from |0 to |d − 1 . We now transform the task of negating all the states in S into the task of performing a table lookup of size d/2 with output size 2. Let q be the least significant qubit of the address register, and u be a clean ancilla qubit in the |1 state. Apply a CNOT from q onto u, then apply a Hadamard transform to each. Now define a "fixup table" F with entries F j each specifying two output bits defined by S: After preparing q and u, perform a table lookup from F onto q and u. The address of the lookup is all of the qubits of the address register, except for the least significant qubit. Because of how we defined F and how we prepared q and u, this negates the phase of all states from S. This uncomputes the table lookup, and we finish the uncomputation by uncomputing the preparation of q and u. See Figure 5 for a quantum circuit showing an overview of the process.
This technique implements the uncomputation of a table lookup with address size d and output size M in terms of computing a table lookup with address size d/2 and output size 2. Reducing the address and output size increase the effectiveness of the techniques explained in Appendix B and Appendix A, since a larger value of k can be used.
By combining this technique and Appendix B, if k additional clean ancillae are available, the Toffoli cost of uncomputing a table lookup with address size d and output size M can be reduced to d/k + k. The procedure to use is shown in Figure 6, where r 0 is initially |0 , which is flipped to |1 by the X gate. The controlled swap S shifts that |1 to position l, giving a one-hot unary encoding of the value in the bottom log k qubits in the address register. The Hadamards and T then yield the correct phase factor. The reverse controlled swap then erases the unary encoding. The first controlled swap has cost k − 1, with the cost reduced as compared to the table lookup because the outputs are single qubits instead of M -qubit registers. The T has cost d/k , and the final controlled swap may be performed with Clifford gates. The reason is that a unary encoding may be erased using measurements and Clifford gates. There are k ancillae used for the unary encoding, and as before there are log(d/k) ancillae needed for the implementation of T .
Combining this technique with Appendix A instead, if k additional dirty ancillae are available, the Toffoli cost of uncomputing a table lookup with address size d and output size M can be reduced to 2 d/k + 4k. The procedure to use is shown in Figure 7, which is a slight modification over that for the table lookup. As before r 0 is a clean register, and r 1 to r k−1 can be dirty registers, though now they need just be qubits. The first S, T , and S † have no effect on the target qubit r 0 . Then the Z gate changes the state of this qubit to (|0 − |1 )/ √ 2, so the next T yields the correct phase factor. At the end the Z and H return the state of r 0 to |0 . There are k − 1 dirty ancillae used, one clean ancilla for r 0 , and log(d/k) clean ancillae needed for the implementation of T .
In the case of clean ancillae, the Toffoli cost is minimized for k ≈ √ d, where the Toffoli count is 2 √ d. For dirty ancillae, the cost is minimized at k ≈ d/2, where the Toffoli count is √ 32d.
To explain in more detail how to erase the unary register using Clifford gates, see Figure 8 to Figure 10. Figure 8 shows how to map binary to unary using controlled swaps. The sequence of controlled swaps is the same as in [44]. If we first flip the top qubit of the unary register to 1, the controlled swap network moves the 1 to the correct position. To uncompute the unary register, we do this process in reverse. We expand each controlled swap into a CNOT-Toffoli-CNOT construction, as shown in Figure 9. The construction is considerably simplified because for the second CNOT in each group the control is known to be in the |0 state, so the CNOT can be omitted entirely. Furthermore, when the output of a Toffoli is known to be zero we can use the uncomputation trick instead, leaving us with a circuit using no non-Clifford gates, as shown in Figure 10. Figure 9: The inverse of the circuit for mapping binary to unary, with the controlled swaps implemented using CNOTs and Toffolis. Figure 10: The circuit in Figure 9 with the Toffolis replaced with measurements and controlled operations, and the second CNOT in each group omitted.

D The scaling λ in general contexts
Here we briefly discuss how we expect λ might scale for more general systems, beyond the FeMoco system considered in the rest of this work. In the plane wave basis, one can obtain a clean bound of λ = O(N 2 ) [7]. One might roughly expect similar scaling in a more general context, based on intuition about electrons interacting pairwise which would imply the spectral norm of the Hamiltonian should go roughly quadratically in N . However, for arbitrary basis sets the relationships between λ, basis size, basis type, molecular structure, etc. are very complex and difficult to rigorously bound. Here we use numerics to provide insight into how λ scales. In Figure 11(a) we show how λ scales for a chain of Hydrogen atoms when basis resolution is fixed and the system grows towards the thermodynamic (large system size) limit (e.g. a chain with many Hydrogen atoms). We focus on an atomic chain because chains approach their thermodynamic limit much faster than other configurations of atoms. We see in all of our numerics that λ T ≤ λ V ≤ λ W . For the hydrogen chains we see that λ V = O(N 2.2 ) and λ W = O(N 2.5 ). It is interesting that λ W has slightly asymptotically worse scaling than λ V ; however, the difference is far less than the factor of N that we save by performing the low rank truncation (resulting in us scaling like λ W instead of like λ V ).
Slightly different behavior occurs when we fix the system (in this case the H 4 system) and grow towards the continuum limit (the limit of an arbitrarily large basis), shown in Figure 11(b). Here, we see that λ V = O(N 2.7 ) and λ W = O(N 3 ). Overall the scaling of λ is a bit worse but λ W /λ V still scales as roughly O(N 0.3 ).
In both of these numerical experiments we find that λ is scaling worse than Ω(N 1.5 ), which is the condition we require for our O(N 3/2 λ) scaling to be better than the O(λ 2 ) scaling of [27]. Still, it is clear that more research is required to fully understand the relationship between λ and molecular structure, and how the size of λ might be reduced.

E Detailed costings
Here we give the details for the minor Toffoli costs and the numbers of logical qubits used.

E.1 RWSWT orbitals
For the case where the low rank factorization approach is used, and the small number of dirty qubits is used, the minor costs are as follows.
2. For the initial equal superposition state preparation, we have 8 qubits for , and 6 qubits for each of the p, q, r, s registers for a total of 32. To obtain a final amplitude close to one it is convenient to also prepare an ancilla in an equal superposition of 15 out of 16 basis states (on 4 qubits). Then the number of qubits is increased to 36, and two steps of amplitude amplification takes 454 Toffolis (see Appendix F). The final amplitude is close enough to one that it does not affect the complexity at the precision we are giving.
3. The inequality tests and controlled swaps for the state preparation need to be done twice for each of the two preparations, because there is a cost to prepare and a cost to unprepare. We are preparing 21 qubits in the first step, and 13 qubits in the second step. The total cost is therefore 2(2µ + 21 + 13) = 176.
4. The controlled swaps used for the symmetries have a cost of 4 log(N/2) = 24 Toffolis, taking into account preparation and inverse preparation.

5.
A careful accounting of the number of Toffolis needed to evaluate s in Eq. (20) gives 105 (see Appendix G). We incur this cost four times, because we have two state preparations and two inverse preparations. The total is therefore 4 × 105 = 420.
The total of these costs is 1,534. The numbers of qubits used are as follows. Adding all these qubit counts together gives 378.
Next we consider the case where the large number of ancilla qubits is used. The minor costs are the same as before, except now there are three inequality tests and controlled swaps needed for the state preparation, and µ is increased to 28. In preparing the superposition over , the size of the registers acted upon by the controlled swap is log L = 8. In preparing the superposition over p and q, there are two registers of size log(N/2) = 6, as well as the qubit register storing θ ( ) pq , for a total of 13. The number of qubits in preparing the superposition over r and s is the same. Therefore the Toffoli cost of the inequality tests and controlled swaps is increased from 176 to 6µ + 2(8 + 13 + 13) = 236. The minor costs therefore are increased to 1,594.
Evaluating the number of qubits used is similar to the case for the dirty ancillae considered above. 1. The system is represented on N = 108 qubits.
2. We are preparing a state with a number of qubits log L + 6 + 4 log(N/2) = 38. 6. The QROAM uses (k − 1)M + log(d/k) = 2,659 qubits. These are erased, so we need only count the cost of those used in step 4 (since those may reuse the qubits used for step 3).
7. Each state preparation needs µ = 28 qubits to store a superposition state to perform the inequality comparison, as well as a single qubit to store the result of the inequality comparison. We need not count the cost of the qubits used in step 4, because we may reuse qubits from the QROAM.
8. The number of qubits needed for the phase estimation m = 26.
4. The controlled swaps used to generate the symmetries have a cost of 4 log(N/2) = 28.
These minor costs have a total of 918. For the number of qubits, we have the following.
3. Preparation of the equal superposition state uses 3 ancilla qubits, plus there is a flag qubit for success.
4. The 18 qubits that we iterate over for the QROAM.
5. The qubits needed for the QROAM k 1 M , which includes the output M registers and another (k 1 − 1)M working output registers. We need to subtract 2 + 4 log(N/2) = 30 qubits that are part of the state being prepared that we have already accounted for. Here k = 32 and M = 84 giving 2,658.
7. Each state preparation needs µ = 24 qubits to store a superposition state to perform the inequality comparison. These do not add to the cost because we can reuse qubits used in the QROAM.
8. The number of qubits needed for the phase estimation m = 24.
These give a total of 2,904.

F Preparation of equal superposition states
Here we explain in more detail how the number of Toffolis to prepare equal superposition states were determined. In the case of low rank factorization, we aim to prepare an equal superposition Here we take all variables to start at zero, because that is how they would be encoded in practice.
(In the body of the paper we took p, q, r, s to start from 1 for simplicity.) A way to prepare this state is to initially use Hadmards to prepare equal superpositions over all registers, over larger ranges that are powers of 2. Then we perform inequality tests to check that ≤ L, p ≥ q, p < N/2, r ≥ s, r < N/2 are satisfied. The amplitude for success can then be brought close to 1 by amplitude amplification.
With N = 108 (the RWSWT orbitals), there are 6 qubits for each of p, q, r, s, and 8 qubits for , for a total of 32. That means the state flagged by success is This has amplitude of sin φ = (N 2 /8 + N/4) √ L + 1 2 16 ≈ 0.321.
If we were to use k = 2 steps of amplitude amplification, then the amplitude would be sin((2k + 2)φ) ≈ 0.998.
qubits. For the preparation over , p, q, we prepare an equal superposition over all registers, then perform inequality tests to give a state flagged on success This has amplitude of sin φ = (N 2 /8 + N/4)(L + 1) 2 11 ≈ 0.374.
The strategy to compute this function efficiently is as follows.
1. Copy p into the output register.
2. For qubit k encoding p, use it to control addition of 2 k p to the output register. After this the value in the output register is p(p + 1).
3. Discard the least significant qubit in the output register. This qubit must be zero, and the effect of discarding it is to divide by 2, giving p(p + 1)/2.

4.
Add q to the output register.
5. For bit k of N 2 /8 + N/4, if it is nonzero add 2 k to the output register.
For the controlled addition we can control copying of the register to be added to a new register, then control addition of that ancilla register to the output register. Normally, if there are n qubits then that will result in n Toffolis in addition to the Toffolis for the addition. The ancilla register can be erased without further non-Clifford gates by measurement in the X basis and classically controlled Clifford gates [62]. In our case, we can save one Toffoli because we are controlling copying p by a qubit from p. Copying that qubit of p to the ancilla controlled on itself is just achieved by a CNOT gate. The cost of the additions just corresponds to the number of qubits that need be acted upon minus one.
In Eq. (88) we have combined the terms like p 0 p 1 , p 1 p 0 and given a factor of 2.
1. Copy p 1 to p 5 into the output ancilla, which has no Toffoli cost. That gives the first term in square brackets in Eq. (89).
Adding all these costs together gives a total Toffoli cost of 5 + 6 + 7 + 4 + 6 + 3 + 5 + 2 + 4 + 1 + 2 + 1 + 10 + 9 + 8 × 5 = 105. Next consider N = 152, so p and q are encoded in n = 7 qubits and have maximum possible values of 75. Then we have N 2 /8 + N/4 = 2926. In this case it is more efficient to use the fact that 2926 = 2048 + 1024 − 128 − 16 − 2 and perform some subtractions. The expression p(p + 1)/2 can be written in terms of p j as p(p + 1)/2 = [p 1 + 2p 2 + 2 2 p 3 + 2 3 p 4 + 2 4 p 5 + 2 5 p 6 ] + [p 0 + 2p 0 p 1 + 2 2 p 0 p 2 + 2 3 p 0 p 3 + 2 4 p 0 p 4 + 2 5 p 0 p 5 + 2 6 p 0 p 6 ] + 2[p 1 + 2 2 p 1 p 2 + 2 3 p 1 p 3 + 2 4 p 1 p 4 + 2 5 p 1 p 5 + 2 6 p 1 p 6 ] + 2 3 [p 2 + 2 2 p 2 p 3 + 2 3 p 2 p 4 + 2 4 p 2 p 5 + 2 5 p 2 p 6 ] + 2 5 [p 3 + 2 2 p 3 p 4 + 2 3 p 3 p 5 + 2 4 p 3 p 6 ] + 2 7 [p 4 + 2 2 p 4 p 5 + 2 3 p 4 p 6 ] + 2 9 [p 5 + 2 2 p 5 p 6 ] + 2 11 p 6 . (90) Then the sequence of elementary steps we need to perform is as follows. first carry qubit (the third line in Figure 12 and Figure 13) is zero, and the CNOTs it controls have no effect and may be omitted. As a result, we may perform the circuit as before except starting from i 1 and t 1 instead of i 0 and t 0 . This means that we save a Toffoli. More generally, if there are k trailing zeros for one of the numbers (it is a multiple of 2 k ), then we may save k Toffolis. Next consider the case where one of the numbers is given classically, so is a constant. We can use the same circuit, and use NOT gates to prepare i in the desired state corresponding to our classically given number. In practice we could make further simplifications to reduce the number of gates, but in most cases they do not reduce the number of Toffolis. One which does is to note the value of i 0 . If i 0 = 1, then the first Toffoli can be replaced with a CNOT, saving a single Toffoli. That means that, for adding a constant, we always have a Toffoli cost 1 less than adding a variable. Moreover, if i 0 = 0, then we can make the same simplification as described above for variables. That means that if the constant is a multiple of 2 k then we can save k Toffolis. This saving is in addition to the saving of 1 because we are adding a constant.
In order to perform subtraction, one can simply reverse the circuit for addition, as shown in Figure 14. This circuit is for modular subtraction of n-qubit variables, and can also be used for non-modular subtraction if it is known that t ≥ i. The cost to subtract two n-qubit numbers is n−1. Again we can make a saving if it is known that i is a multiple of a power of 2. If i is a multiple of 2 k , then we can save k Toffolis, via exactly the same reasoning as for addition. Similarly, if i is a constant, then we can save a further Toffoli.
As was noted in [66], subtraction can be used for inequality testing as well. That is, if we perform modular subtraction t − i, then the carry qubit will carry the information about whether i > t. In Figure 16, we show the circuit for an inequality test on two 4-qubit variables. We have