Improved Fault-Tolerant Quantum Simulation of Condensed-Phase Correlated Electrons via Trotterization

Recent work has deployed linear combinations of unitaries techniques to reduce the cost of fault-tolerant quantum simulations of correlated electron models. Here, we show that one can sometimes improve upon those results with optimized implementations of Trotter-Suzuki-based product formulas. We show that low-order Trotter methods perform surprisingly well when used with phase estimation to compute relative precision quantities (e.g. energies per unit cell), as is often the goal for condensed-phase systems. In this context, simulations of the Hubbard and plane-wave electronic structure models with $N<10^5$ fermionic modes can be performed with roughly $O(1)$ and $O(N^2)$ T complexities. We also perform numerics revealing tradeoffs between the error and gate complexity of a Trotter step; e.g., we show that split-operator techniques have less Trotter error than popular alternatives. By compiling to surface code fault-tolerant gates using lattice surgery and assuming error rates of one part per thousand, we show that one can error-correct quantum simulations of interesting, classically intractable instances with only a few hundred thousand physical qubits.


I. INTRODUCTION
The physics of interacting electrons in the presence of external fields predicts the properties of many materials as well as the dynamics of most chemical reactions. While the dynamics of many such systems appear intractable for classical computers, quantum computers were originally introduced as universal simulators capable of efficient quantum dynamical simulation [1,2]. Later work [3] combined the quantum phase estimation algorithm [4] with quantum simulation of fermions [5] to show that quantum computers can efficiently prepare ground states and estimate ground state energies of these fermionic systems whenever an initial state can be prepared with non-vanishing overlap on the ground state (as is often the case for systems of interest [6]).
Of particular interest has been quantum simulations of the molecular electronic structure problem [7]. The first quantum circuits for these simulations [8] had O(N 10 ) scaling where N is the number of single-particle basis functions [9]. Most early work in this area focused on performing time evolution of Hamiltonians in a Gaussian basis, by means of Trotter-Suzuki-based methods [10,11]. By tightening bounds on the Trotter error [9,[12][13][14] and optimizing the implementation of Trotter steps [15,16] this cost was brought down to an empirically observed O(N 6 ) scaling, which would likely be even lower using the techniques of [17] and [18].
Recent work has further reduced these scalings via application of linear combinations of unitaries [19] methods such as Taylor series [20,21] and qubitization [22] either in combination with quantum signal processing [23] or used directly in phase estimation [24][25][26][27]. Another important innovation in the quantum simulation context has been the use of plane wave, rather than Gaussian, bases [28,29]. Using plane waves simplifies the quantum computation and also extends simulations to the condensed-phase. The best quantum algorithms for plane wave basis electronic structure achieve either O(N 2 ) gates with O(N log N ) space [30] or O(N 3 ) gates with O(N ) space [26]. There are also many papers which focus on these simulations in first quantization [31][32][33][34][35], the most efficient of which [36] obtains O(η 8/3 N 1/3 ) gate complexity and O(η log N ) space complexity, where η is number of particles.
Here, we will also explore quantum simulations of the Hubbard model [37], one of the simplest models of interacting electrons. Despite this simplicity, the Hubbard model demonstrates a range of correlated electron behaviors, and is In Appendix F, we construct circuit primitives for phase estimation from Trotter steps for the electronic structure problem. In Appendix G, we numerically evaluate the commutators which contribute to the Trotter error, allowing us to upper bound the shift in the ground state energy using the results of Appendix E. We then combine these results in Section III to compute upper bounds on the fault-tolerant costs of simulation. We use our Trotter analysis to place a numerical upper bound on the number of T gates required for fault-tolerant phase estimation for a range of system sizes. Together with the maximum number of logical qubits required, this allows us to determine the number of surface code physical qubits needed for distilling T states, the number of physical qubits to store and route data, and the execution time of these algorithms. Additional details on the precision targets we use, further numerical data, as well as some nuances of the minimization procedure, are detailed in Appendix H. Finally, we discuss and summarize our results in Section IV.

II. HAMILTONIANS AND TROTTER STEPS
This paper will analyze and reduce the cost of several approaches to simulating fermionic Hamiltonians of the form where a † p and a p are fermionic creation and annihilation operators and n p = a † p a p is the number operator for the corresponding spin-orbital. Mapping to qubits under the Jordan-Wigner transformation [62,63], Eq. (1) becomes This form includes a range of Hamiltonians. Note that there has also been a large body of work on using different fermionic encodings [64][65][66][67][68][69][70]; however, these methods do not appear to offer significant advantages in terms of the T complexity of simulations because they do not reduce the number of rotations that must be simulated. First, we consider this Hamiltonian when the coefficients correspond to the plane wave electronic structure Hamiltonian. In the plane wave dual basis [28], the coefficients of Eq. (1) are T pq = δ σp,σq ν k 2 ν cos(k ν ·r q − k ν ·r p ) 2 N U p = − j,ν =0 4π ζ j cos(k ν ·R j − k ν ·r p ) Ω k 2 2π cos(k ν ·r p − k ν ·r q ) where each orbital index p is associated with a spin label σ ∈ {↑, ↓} and an orbital centroid r p = p (2 Ω/N ) 1/d defined in terms of the number of spatial dimensions d and the computational cell volume Ω [28]. The momentum modes are defined as k ν = 2πν/Ω 1/d with ν ∈ [−(N/4) 1/d , (N/4) 1/d ) d . Throughout, N is the number of spin-orbitals. We will focus on instances of this Hamiltonian corresponding to periodic materials such as lithium hydride metal, diamond, graphite, crystalline silicon, etc., as well as the special case of U p = 0, which is the uniform electron gas ("jellium"). When dealing with molecular potentials, R j and ζ j are the position and charge, respectively, of the j th nucleus. Because these only change Eq. (2) through the terms U p Z p , they enter our algorithms as a layer of singlequbit rotations per Trotter step, adding essentially no cost to the Trotter step circuit. However, this layer changes the Trotter error and hence the total simulation cost.
The Hubbard model corresponds to the case where T pq are nonzero only for nearest neighbors on a lattice, V pq is the on-site interaction U if p and q correspond to the same orbital (with opposite spin), and U p = µ is the chemical potential. Ignoring the chemical potential, the Hubbard Hamiltonian is [37] The first term describes hopping between adjacent lattice sites ( i, j denotes a sum over nearest neighbor lattice sites) and the second is the on-site interaction. Despite its simplicity, the Hubbard model demonstrates a wide range of correlated electron behaviors, including metal-insulator transitions and a superconducting phase, and further is believed to be a candidate model of high-temperature superconductivity in cuprates [38,39]. Both these Hamiltonians can be simulated using the algorithms of [28] and [29]. The first of these is a splitoperator method: it uses a network of Givens rotations or the fast fermionic Fourier transform (FFFT) to alternate between simulating evolution under the kinetic energy terms in the momentum basis and simulating evolution under the potential energy terms in the position basis. (The FFFT was first described in [42] and has also been studied under the name of the "spectral tensor network" [71,72]; the Givens rotation network was first studied in [43] and further developed in [29]). In the momentum basis, the kinetic energy terms can be simulated using only single-qubit rotations; in the position basis, the potential energy terms can be simulated in linear depth using a swap network [28]. This algorithm operates on a planar array of qubits. By comparison, the second algorithm uses a linear array of qubits, and constructs a swap network which simulates the kinetic and potential operators without the FFFT [29].
We discuss the split-operator algorithm, including the generalization of the fermionic Fourier transform using Givens rotations introduced in [29], in greater depth in Appendix C, and the fermionic swap network in Appendix B. In Appendix C 2 we show how to reduce the cost of simulating the potential energy operator within the split-operator method from O(N 2 log(1/ )) to O(N 2 + N log N log(1/ )), where is the rotation synthesis precision, using Hamming weight phasing. We will discuss in the next subsection the fault-tolerant costs of each Trotter step in terms of the number of arbitrary rotations, Toffoli, and T gates; we describe this in greater detail in Appendix D.
We do not make use of several recent techniques that could lead to improvements in the asymptotic limit, but that have large constant factors and are thus not helpful for the system sizes we consider. For instance, the work of [30] introduced a technique for computing the diagonal part of the plane wave basis electronic structure Hamiltonian with only O(N ) gates; however, the need to compute the discrete Fourier transform of the potential introduces very large constant factors and increases the spatial complexity to O(N log N ). Haah et al. have shown an improved algorithm based on locality of lattice models by dividing simulation into blocks of evolution in disjoint regions, though the number of blocks becomes large in two and especially in three dimensions [45]. Tran et al. [73] determined the gate count of the Haah et al. algorithm when the interactions decay as a power law, though there is no asymptotic speedup for the Coulomb interaction. Likewise, the work of [46] achieves nearly N o(1) scaling for Hubbard models but by using arbitrarily high-order formulas which increases gate complexity by a factor of 5 k for order k. For the extensive error case we achieve the same scaling using low-order formulas. While randomized Trotter orderings can yield significant improvements for simulation [18,74], these would come at the cost of the improvements we gain from Hamming weight phasing, which requires specific Trotter orderings for full utility.
Finally, we note that while the Hamiltonian of Eq. (1) exactly describes any electronic structure system (including molecules) with basis set discretization error asymptotically equivalent to Gaussian-based molecular orbitals, there are other discretization schemes (such as finite difference discretization) which also take the form of Eq. (1), but with different coefficients from those in Eq. (3). Continuing research on basis functions may yield further examples compatible with these algorithms while also being better suited to molecules. For instance, the basis sets described in [75] are compatible with our algorithms while also being significantly more accurate for molecules than plane waves.

A. Gate costs per Trotter step
We now summarize how we compute the costs, in terms of the number of arbitrary rotations, Toffoli, and T gates, of the different Trotter steps. We include a more complete description in Appendix D, including upper bounds on the number of the different costly gates, as well as more complete descriptions of the two Trotter steps following Appendix B and Appendix C. To begin, we remove zero-angle rotations from both Trotter steps, and additionally neglect the cost of the final layer of operations from each Trotter step, since this final layer may be merged with the first layer of operations in the next Trotter step. Now, we compute the cost of each Trotter step. This must be performed numerically so that we can do it for a range of numbers of ancilla qubits used for Hamming weight phasing. The core idea behind Hamming weight phasing [60] is that the total phasing operation from a group of n equiangular rotations can instead be applied with only log 2 n + 1 arbitrary rotations and at most n − 1 ancilla qubits and n − 1 Toffoli gates (or equivalently, 4n − 4 T gates). We review Hamming weight phasing in greater detail in Appendix A.
We compute the gate costs per Trotter step with fixed numbers of ancilla qubits assigned for Hamming weight phasing. Specifically, we only consider using 2, 14, 30, or 126 ancillae for this phasing. This means that we can combine at most 3, 15, 31, or 127 simultaneous equiangular rotations, respectively, into 2, 4, 5, or 7 arbitrary rotations, at a cost of 2, 14, 30, or 126 Toffoli gates. This is efficient because the cost of an arbitrary rotation is much greater than the cost of a Toffoli gate. We then determine the numbers of arbitrary rotations that must be synthesized, T, and Toffoli gates, by iterating through the terms in the Hamiltonian in the order specified by the particular Trotter step. Where possible, we combine arbitrary rotations in each layer. The number of arbitrary rotations and Toffoli/T gates in each Trotter step are then totally determined by the simulation order, which we describe next. For all systems and for both Trotter steps, we generate the Hamiltonians and determine simulation order using OpenFermion [76].
First, for the Hubbard model, we simulate all the on-site interactions in a single layer in which we apply Hamming weight phasing. Then, we run through the fermionic swap network (see Appendix B) until we reverse the initial spin-orbital ordering, applying rotations corresponding to hopping terms as we go. More specifically, as we apply the fermionic swaps, we accumulate a set of qubits which need to be acted on. This set is made up of those spin-orbitals whose Hamiltonian terms appear in that section of the fermionic swap network. We continue accumulating qubits in this way until one qubit is about to be acted on a second time: at that point, we apply all the arbitrary rotations to the accumulated qubits, and reset the set of qubits. We continue this until we reverse the initial mode ordering, and then reverse the process to complete the second-order Trotter step. The final set of rotations from the hopping operators (in the middle of the Trotter step) can be merged, and the rotations from the on-site interactions at the end of one Trotter step can be merged with the same terms at the beginning of the next Trotter step. Second, for jellium (the uniform electron gas), we run through the fermionic swap network. Unlike in the Hubbard model, there are sufficiently many terms that we must apply the rotations from each fermionic simulation gate immediately, and can only apply Hamming weight phasing within that layer of fermionic simulation gates. Finally, for the various materials, we generate the Hamiltonians and run the same simulation procedures as for jellium, but with all the single-qubit rotations repeated in the middle of the Trotter step.
The gate counting for the split-operator Trotter step is slightly easier to describe. For jellium and the materials, we develop a new scheme using the translation-invariance of the interaction part of the Hamiltonian to significantly reduce the Trotter step cost using Hamming weight phasing (Appendix C 2). For the Hubbard model, because the interaction is exclusively on-site, the split-operator Trotter step simulates it exactly as the fermionic swap network Trotter step does, but without the swaps. The single-qubit rotations which simulate evolution under the kinetic energy operator are grouped together as much as possible into layers for Hamming weight phasing, given the number of ancilla qubits assigned for that task. Finally, we need to change from the position to the momentum basis and vice versa: using Hamming weight phasing, arbitrary rotations which might appear in the FFFT are catalyzed rather than synthesized (see Appendix A 3) using only a handful of Toffoli gates, and when the Givens rotation procedure is used we apply Hamming weight phasing within it as much as possible given the number of ancilla qubits.

A. Trotter-Suzuki errors
Consistent with past work in molecular simulation [12,14], we focus on the second-order Trotter formula given by which holds for sufficiently small t. The algorithms of Section II naturally implement the second-order Trotter formula if run forward for time t/2 and then again in reverse. In Appendix D as well as the previous subsection, we describe how to compute the T-costs of optimized circuits for second-order Trotter steps with both the split-operator and fermionic swap network algorithms. Low-order Trotter decompositions may be useful in variational algorithms to approximate the ground state energy [28,77]. However, for non-heuristic algorithms using Trotterization to simulate time dynamics or to prepare eigenstates via phase estimation, we must ensure that the discretization errors incurred by the Trotter-Suzuki decomposition can be controlled.
In Appendix E we show that for a second-order Trotter step, the difference between the exact and effective unitary evaluations is bounded by where we call W the "Trotter error norm", and the sums run over the terms in the Hamiltonian. Unlike a similar bound in [12], the result above is non-perturbative and holds for all values of t. Our derivation in Appendix E is similar to a derivation in Appendix B of [9]; however, our result is tighter by constant factors, matching the factor of 1/12 from the Baker-Campbell-Hausdorff expansion discussed in [12]. Additionally, we are able to keep the norm on the outside of some of the sums, which is another reason it is tighter than the bound in [9]. Computing the Trotter error norm W allows us to bound the number of applications of the Trotter step circuit, and hence the gate count, required for phase estimation to a desired precision. Of particular interest to us is the problem of sampling eigenvalues using phase estimation. As we show in Appendix E, the maximum shift in the unitary eigenphases (which encode the energies) from Trotterization is for all n, where E n and E eff n are corresponding eigenvalues of H and H eff , respectively. Thus, the error in the eigenphases from Trotterization is roughly W t 3 whenever W t 3 W 3 t 9 /24. For simplicity, we will assume we can approximate the errors by W t 3 so long as W t 3 ≤ 1, implying the first-order term is at least 24 times larger than the next order correction. As we shall see, the Trotterized evolution time is t = ∆E T S /W , where ∆E T S accounts for energy errors from the Trotter-Suzuki approximation, so for the condition to hold we need (∆E T S ) 3 ≤ W .
Rather than approximating the error as W t 3 , we could instead use the exact expression with the arctan, which holds for W t 3 ≤ √ 2; however, here we focus on the simpler form from the series expansion. The first reason for this is that the condition W t 3 ≤ 1 is strongly satisfied for all the numerics in this paper. The second reason is that it is much easier to perform the optimizations of Appendix G if we can assume the error is given by W t 3 . Unlike the work of [47] which uses Monte Carlo sampling to estimate Trotter errors, we exactly numerically evaluate W for all systems we consider, and are not subject to sampling errors. While the work of [14] performed similar numerics, that work focused on arbitrary basis chemistry (which involved much more complicated error operators), and as a result, was not able to make calculations for larger than N = 30, whereas here we are able to go as high as N = 512. Still, the bounds in Appendix E can dramatically overestimate the error in the limit of small t. For this reason, it is important to bear in mind that the estimates we provide for the cost are almost certainly quite pessimistic.
As discussed in Section II, up to a single layer of rotation gates implementing the local external potential terms, the same circuit simulates Trotter steps of any molecule in the plane wave basis. For simplicity, much of our study focuses on the case when these gates are dropped (U p = 0), corresponding to simulation of the uniform electron gas (jellium) discussed in detail in [28]. There, it is argued that the scientific importance of jellium, the classical difficulty of its simulation, and its history as a benchmark for classical electronic structure methods, position it as an intriguing system through which to contrast quantum and classical simulations. There are open questions about the physics of jellium (especially pertaining to the nature of the biased errors introduced to control the sign problem in quantum Monte Carlo) which one could begin to study on a quantum computer with fewer than one hundred logical qubits. Additionally, we compute the Trotter errors after re-introducing the external potential, so as to determine the costs of simulating several different periodic materials.
Jellium has a single phase parameter, the density given by η/Ω in Eq. (3), which scales the system between the limits of strong and weak correlation. Classically, jellium is particularly challenging to study near half-filling for densities where the average electron radius (the Wigner-Seitz radius) is approximately r s = 10 Bohr radii. We conduct most of our numerics for jellium in this regime, computing the Trotter errors for spinful jellium with and without spin in two and three dimensions at r s = 10. Additionally, we compute the Trotter error for jellium in 3D, varying the Wigner-Seitz radius in logarithmically spaced steps from r s = 0.5 to r s = 32 Bohr radii. For the Hubbard model, we determine the costs of simulation in the intermediate and strongly coupled regimes, U/τ = 4 and U/τ = 8, respectively, with the hopping integral τ = 1. We base our energy precision requirements on classical state-of-the-art ground state energy estimates for the Hubbard model at these parameter values for weakly doped systems (filling fraction 0.875) [39].
For all systems studied, our numerics were performed using code which we have contributed to the open-source package OpenFermion [76]. However, the larger system sizes studied here may be difficult to access without distributed calculations. We discuss full details of these calculations in Appendix G.

B. T gate requirements for Trotterized phase estimation
In order to give concrete T gate counts for the cost of phase estimation, we need to discuss how errors propagate throughout the algorithm. We focus here on quantum simulation within a fault-tolerant architecture, similar to [47], but provide further details about the optimal balance between these errors. Neglecting individual gate errors, the sources of error for estimating the energy in the simulation are: 1. Trotter-Suzuki errors ∆E T S , from the Trotter approximation to e −iHt due to Hamiltonian terms not commuting.
From Eq. (6) and Eq. (7), we know that the error in the eigenvalue of the simulated Hamiltonian obeys We numerically computed the Trotter error norm W as shown in Section III A for a variety of systems for the fermionic swap and split-operator algorithms outlined in Section II.
2. Phase estimation errors ∆E P E , due to uncertainty in the value returned by phase estimation; i.e. errors due to not computing enough bits of the phase. We use the adaptive phase estimation techniques of [58,59] to reach a root mean squared error of ∆E P E t using applications of the fundamental simulation circuit. This relies on using directionally-controlled evolution [9] to reduce prior estimates on the cost of phase estimation [58] by a factor 2. This approach uses a single control qubit. It is possible to match the ultimate lower bound of π/(2∆E P E t) using multiple control qubits, as in [26]. The number of control qubits needed would be the log of the dynamic range needed. The difference in performance can be accounted for by dividing the gate counts we present by a factor 1.52. Alternatively, the approach of [78] allows median error performance of ∆E P E t using 1.65/(∆E P E t) applications of the fundamental simulation circuit and just a single control qubit; this could similarly be accounted for by dividing our gate counts by 1.45.
3. Circuit synthesis errors ∆E HT , due to the fact that there is typically some approximation in compiling arbitrary rotations R z (θ) into single-qubit Clifford and T gates. Relative to the other two sources of error, the impact of circuit synthesis error can be reduced at substantially lower cost. If one wishes to synthesize a single-qubit rotation, R z (θ) within error then the number of T gates required using repeat-until-success synthesis is on average T synth ≈ 1.15 log 2 (1/ ) + 9.2 [79]. Following the arguments in [47], such errors in the individual rotations add at most linearly to the error in the phase estimated. The phase estimated is the energy eigenvalue multiplied by the time step used for the Trotter decomposition. Thus, if the error desired in the estimate of the energy eigenstate is ∆E HT , the cost of circuit synthesis is approximately T gates per arbitrary rotation, where N r is the number of rotations used in a single Trotter step.
In the worst case, these errors add linearly [47]. Thus, to guarantee that the total error is at most ∆E, we assume The total T-cost of phase estimation is the product of these, plus the number of "direct" T and Toffoli gates N d (those T/Toffoli gates which appear in the circuit before synthesis) multiplied by N P E . Again, we convert Toffoli to T gates in computing N d at a cost of 2 T gates each [80]. Of the different variables described so far, N r and N d are the only fixed values: irrespective of the desired total precision ∆E, N r and N d are set by the Trotter step circuit and number of ancillae used for Hamming weight phasing. N P E depends on ∆E P E and ∆E T S , and N HT depends directly on ∆E HT and through the time t on ∆E T S . The T count for phase estimation is thus We numerically minimize this cost for each system, subject to the constraint 0 < ∆E T S + ∆E P E + ∆E HT ≤ ∆E, where each ∆E is positive. We perform this minimization for two different precision targets: relative precision, where ∆E is set as a fraction of a proxy for the ground state energyẼ 0 , and thus scales with system size, and absolute precision, where ∆E is set as some fixed constant irrespective of the system size. We discuss these precision targets, as well as some subtleties of the minimization and other numerical results, in more detail in Appendix H. In both cases, in minimizing and plotting the cost, we convert Toffoli to T gates at a cost of 2 T gates per Toffoli. This is because using the techniques of [80], one can distill a Toffoli state at twice the cost of a magic state.
We plot the minimized T-count in Figure 1a for the uniform electron gas in two and three dimensions to relative precision ∆E = 0.005Ẽ 0 (comparable to errors associated with the sign problem in quantum Monte Carlo [81,82]) and absolute precision ∆E = 0.0016 Hartree (i.e. "chemical accuracy"), which shows that we can realistically expect to simulate classically intractable instances of two-and three-dimensional jellium (up to ∼300 qubits) with Trotter methods using fewer than one billion T gates. Counts for the Hubbard model to the same relative precision and to absolute precision ∆E = τ /100 are shown in Figure 2. In Figure 3 we plot the minimized T-counts for phase estimation of several different periodic materials.
We separately plot data for the fermionic swap network and split-operator Trotter steps; within the split-operator Trotter step, we separate data at system side lengths which are a power of two from the rest of the data. When the side length is a power of two, the fermionic fast Fourier transform (FFFT) can be applied allowing significantly more efficient basis changes than the Givens rotation procedure. Because this is not a significant component of the cost for the uniform electron gas, we do not separate these data points for that algorithm. We discuss this difference in costs further in Appendix C. As observed in [12,14,27,47], estimates based on bounds on the Trotter error operator norm are often extremely loose, sometimes by several orders of magnitude [14]. 16

FIG. 1.
Loose but rigorous upper bound on the number of T gates in the circuit which performs our Trotterized phase estimation simulation of the uniform electron gas (UEG, or jellium) at Wigner-Seitz radius of 10 Bohr radii at half filling η = N/2 based on Eq. (12). (a) shows data for the relative precision target (to within half a percent of the total system energy) as a function of the number of spin-orbitals (qubits); (b) shows data for the absolute precision target of 0.0016 Hartree ("chemical accuracy"). Bounds were computed for jellium both in 2D and in 3D, using both the fermionic swap network (FS) and split-operator (SO) Trotter steps. Exponents of a power law fit are shown in the caption for each system. 14 ancilla qubits are used for Hamming weight phasing for all systems. We can see that for the extensive error target the T cost scales between O(N 3/2 ) and O(N 2 ) and for the intensive error target the T cost scales between O(N 3 ) and O(N 7/2 ). denotes the split-operator Trotter step using the fast fermionic Fourier transform to change bases (only for side lengths which are binary powers), SO (red circles (triangles) for U/τ = 4 (8)) denotes the split-operator Trotter step using Givens rotations to change bases, and FS (blue circles (triangles) for U/τ = 4 (8)) denotes the fermionic swap Trotter step. The fermionic swap Trotter step more efficiently combines repeated rotations for the Hubbard model than for jellium. Lines show power law fits to the data, with the exponents of the fits shown in the caption for each system. 14 ancilla qubits are used for Hamming weight phasing for all systems. We see scaling between O(1) and O(N 1/2 ) for the extensive error simulations and scaling between O(N 3/2 ) and O(N 2 ) for the intensive error simulations.
An interesting observation from Figure 2a is that the required number of T gates scales sublinearly in the number of Hamiltonian terms. This is because the number of Trotter steps required to achieve the extensive error target  FIG. 3. Loose but rigorous upper bound on the number of T gates in the circuit which performs our Trotterized phase estimation simulation for a range of materials (a) to within half a percent of the total system energy, and (b) to absolute precision ∆E = 0.0016 Hartree. Exponents of a power law fit are shown in the legend for each system in the form (xF S /xSO), where xF S is the exponent for the fermionic swap network Trotter step and xSO is the exponent for the split-operator Trotter step. Lines show power law fits to the data, with the exponents of the fits shown in the caption for each system. 14 ancilla qubits are used for Hamming weight phasing for all systems.
actually decreases with N , implying that the allowable error ∆E = Θ(N ) grows at a rate comparable to the Trotter error norm. As discussed in Section I, fixed relative error is often a sensible target. Despite this, for sufficiently large system sizes, the number of circuit repetitions in phase estimation will eventually reach one (it cannot go lower), at which point the number of gates will return to linear scaling in the number of terms in the Hamiltonian. We analyze this further in Appendix H and show that the number of phase estimation repetitions will not reach this limit until system sizes of hundreds of thousands of spin-orbitals. The full T-cost (Eq. (12)) scales as O(N r √ W /∆E 3/2 ) in terms of the Trotter error norm W , the number of arbitrary rotations per Trotter step N r , and the target precision ∆E. In Appendix G we show that the Trotter error norm W scales cubically with N (the number of spin-orbitals) for the uniform electron gas and linearly with N for the Hubbard model. Our relative precision target is ∆E = Θ(N ) for both the uniform electron gas and the Hubbard model and our absolute precision target is ∆E = Θ(1). The number of arbitrary rotations in a Trotter step N r scales linearly for the Hubbard model with the fermionic swap network Trotter step and either quasilinearly or quadratically for the uniform electron gas and the Hubbard model with the split-operator Trotter step (due to the gates required for changing bases). From these scalings, we expect the T-costs in the absolute precision case to be O(N 7/2 ), O(N 2 ), and O(N 3/2 ) for the uniform electron gas, the Hubbard model with fermionic swap network Trotter steps, and the Hubbard model with the split-operator step, respectively. Changing to relative precision reduces these scalings by N 3/2 in all cases. This is consistent with what we see in Figure 1 and Figure 2.
We can understand the condition from the previous subsection W t 3 = (∆E T S ) 3 /W ≤ 1, required for the error to be well estimated by W t 3 as in Eq. (7), in terms of these scalings. In the absolute precision case the condition becomes more strongly satisfied with growing system size; however, in the relative precision case the quantity W t 3 will grow. In particular, for the Hubbard model W is only O(N ), so (∆E T S ) 3 /W = O(N ) and we must be careful to work with systems where the value remains small. For the Hubbard model with U/τ = 4, (∆E T S ) 3 /W will not reach unity until hundreds of thousands of spin-orbitals, far larger than any system we study.

C. Surface code resource estimate analysis
Any resource estimate depends sensitively on the assumed hardware. We shall assume the availability of a large planar square array of qubits with nearest-neighbor interactions capable of implementing all physical gates with nonuniform error rates sufficiently low that the overall performance is approximated by a uniform gate error rate of p = 10 −3 or p = 10 −4 . We will assume the gates are sufficiently fast to implement a full round of surface code error detection in 1 µs. Finally, we will assume that a decoder with error suppression performance comparable to [83] is available that is capable of extracting the logical measurement value from potentially many physical measurements associated with a given logical qubit in no more than 10 µs. This is 100 times slower than has been previously assumed [48], as recent (unpublished) research suggests that it will be challenging to achieve even this level of performance.

Physical qubits Execution time (hours) System
Ancillae Logical qubits Toffoli gates T gates p = 10 −3 p = 10 −4 p = 10 −3 p = 10 −4 for the relative precision target, assuming physical operation error rates of p = 10 −3 and p = 10 −4 . We include data varying the number of ancilla qubits used for Hamming weight phasing. FH indicates Hubbard model data with U/τ = 4, and UEG indicates data for the spinful uniform electron gas with a Wigner-Seitz radius of 10 Bohr radii. For all systems, the total number of logical qubits for the system is double the number of grid points plus the number of ancillae. When generating T states, superscript a indicates that a single round of 15-1 distillation was used, superscript b indicates 15-1 followed by CCZ state generation and catalyzation was used [80]. Scheme a requires fewer qubits, but can be slower than scheme b.

Physical qubits Execution time (hours) System
Ancillae Logical qubits Toffoli gates T gates p = 10 −3 p = 10 −4 p = 10 −3 p = 10 −4  TABLE II. Fault-tolerant resource estimates (number of logical ancillae, Toffoli gates, T gates, physical qubits, and execution time) for the absolute precision target, assuming physical operation error rates of p = 10 −3 and p = 10 −4 . We include data varying the number of ancilla qubits used for Hamming weight phasing. FH indicates Hubbard model data with U/τ = 4, and UEG indicates data for the spinful uniform electron gas with a Wigner-Seitz radius of 10 Bohr radii. For all systems, the total number of logical qubits for the system is double the number of grid points in it plus the number of ancillae. When generating T states, superscript a indicates a single round of 15-1 distillation was used, superscript b indicates 15-1 followed by CCZ state generation and catalyzation was used, superscript c indicates two rounds of 15-1 distillation were used [80]. Note that scheme a, while requiring fewer qubits, can be slower than scheme b.
We will use lattice surgery [49,50,80,84] to protect our computations, and will make use of distillation to produce |CCZ states and catalyzation to efficiently produce |T states. These states enable Toffoli and T gates, respectively. Only a single factory producing states serially will be assumed, to minimize qubit overhead. We use the approach described in [80], and fall back to the methods of [50] if a total logical error probability below 0.3 is not attainable during the execution of the phase estimation algorithm. For both techniques, we iterate through level-1 and level-2 distillation code distances of 15 through 51 in steps of 2 to find the scheme and code distances that minimize the total number of physical qubits.
Our resource estimates for phase estimation with the relative precision target (0.5% of the approximate system energy) are in Table I. We include the total number of logical qubits (including system qubits, and ancilla qubits used for Hamming weight phasing and phase estimation), numbers of Toffoli and T gates, and the computational volume in physical qubit-hours for error rates p = 10 −3 and p = 10 −4 . We only give results for whichever Trotter step algorithm gives better performance -the fermionic swap network is superior for the Hubbard model when the side length is not a power of two, and the split-operator step superior in the other cases. Table II shows the same data with absolute precision targets of 0.0016 Hartree for the uniform electron gas (jellium) and τ /100 for the Hubbard model. The error-corrected data clarify the time-space tradeoff in Hamming weight phasing: by increasing the number of logical ancilla qubits, we can reduce the execution time by as much as an order of magnitude. Because the number of physical qubits used for state distillation is generally comparable to the total number of physical qubits, the increase in the number of physical qubits to support the additional ancillae is typically small, only ∼ 25% of the original requirement.

IV. DISCUSSION
Prior to this work, the most practical techniques for performing fault-tolerant quantum simulations of condensedphase correlated electron models were those introduced in [26]. That work deployed methods based on linear combinations of unitaries which scaled roughly as O(λN/∆E) where λ is the sum of the absolute values of the Hamiltonian terms; λ = O(N ) for the Hubbard model and λ = O(N 2 ) for the plane wave basis electronic structure Hamiltonian. In this paper we demonstrate low-order Trotter methods for these problems scaling as roughly O(N 3/2 /∆E 3/2 ) for the Hubbard model and O(N 7/2 /∆E 3/2 ) for jellium. While the methods of [26] are more practical when targeting an intensive error, ∆E = Θ(1), the methods here are more practical when targeting an extensive error, ∆E = Θ(N ). For example, for Hubbard model simulations with extensive error the methods of [26] have O(N ) T complexity whereas the methods here have O(1) T complexity (at least until N > 10 5 at which point the scaling increases).
It is also interesting to compare these asymptotics to other recent work on Trotter based simulations. For instance, the work of [46] will also have effective O(1) scaling for the Hubbard model in the extensive error target regime, but at the cost of very large constant factors. Another comparison is the method of [18] which essentially scales as O(λ 2 /∆E 2 ) for the purposes of phase estimation. For extensive error simulations that method will have T complexities of roughly O(N 2 ) and O(1) for jellium and the Hubbard model, respectively, which matches our scaling here.
Our work has also made significant progress towards reducing the resources required to perform the first classically intractable electronic structure calculation on a fault-tolerant device that could be realized with the error rates and connectivities accessible to superconducting qubits. In addition to the low scaling of our methods, as discussed above, we put significant effort into optimizing constant factors through strategies such as the Hamming weight phasing techniques introduced here. We also further reduced surface code compilation overheads by using lattice surgery [50,80,84] rather than topological braiding (as in [26]). Ultimately, we found that even at 10 −3 error rates one can error-correct classically intractable instances of these problems with only a few hundred thousand physical qubits.
We conclude with a brief discussion of open research questions relevant to this work. First, one could investigate the T-costs of higher-order Trotter formulas. Other work has shown these to be advantageous even for small system sizes [51,52]. Numerically evaluating the Trotter error norm would be challenging for higher-order formulas, but even analytic bounds would be interesting to compare with those presented here. Second, rather than looking at upper bounds, one could study the true Trotter error in circuit simulations (using inefficient exact classical calculations and extrapolating, or with efficient approximate classical calculations): this could reduce the number of Trotter steps, and hence our T-count estimates, by orders of magnitude. Finally, one could attempt to extrapolate these quantum simulations to zero Trotter error with even fewer steps than we use here by interpolating the trend between Trotter errors and Trotter number, as is common in quantum Monte Carlo electronic structure simulations.
is, multiple rotations R z (θ) can be performed simultaneously in the circuit. For n parallel equiangular rotations, the technique reduces the number of arbitrary rotations which must be synthesized to log 2 n + 1 , at the cost of at most 4n − 4 additional T gates (or n − 1 Toffoli gates) and n − 1 ancilla qubits. Because synthesizing arbitrary rotations is very costly on a fault-tolerant quantum computer, this can lead to large savings. Following that, we briefly discuss how the method works with a limited number of ancillae r < n − 1 as well as some possible extensions.

Combining arbitrary parallelizable rotations by Hamming weight phasing
Consider a circuit where the same single-qubit rotation R z (θ) = exp(−iθZ/2) is simultaneously applied to three different qubits. The action on the logical states is a different phase, depending only on the Hamming weight of that logical state. To be explicit, 1. the all-zero state |000 picks up the phase −3θ/2, 2. the three states of Hamming weight one |001 , |010 , and |100 each pick up a phase −θ/2 3. the three states of Hamming weight two |011 , |101 , and |110 are all phased by θ/2, and finally 4. the all-one state |111 picks up the phase 3θ/2.
The phases on each logical state are identical for the two procedures. However, because arbitrary rotations must be synthesized using costly T gates, reducing the number of rotation gates in the circuit reduces its fault-tolerant cost.
This idea readily generalizes to the case of n repeated equiangular rotations R z (θ) appearing in parallel in a circuit: rather than applying the n original arbitrary rotations, we can compute the Hamming weight of the relevant qubits, and instead apply log 2 n + 1 arbitrary rotations R z (θ), R z (2θ), R z (4θ), . . ., to the Hamming weight [60]. We call this technique Hamming weight phasing.
At this point, Hamming weight phasing would appear to give an improvement for free. However, there are two sources of cost we have not mentioned. These costs are an additional number of T gates and ancilla qubit requirement, both arising from the use of adder circuits to compute the Hamming weight. Using the adder circuits of [60], we can compute and uncompute the Hamming weight using at most n − 1 Toffoli gates (or equivalently, 4n − 4 additional T gates) and n − 1 ancilla qubits, corresponding to n − 1 single-bit adders. (Each single-bit adder uses one ancilla qubit, and costs one Toffoli gate or 4 T gates [60].) The strategy for this summation is as follows.
We begin with n qubits to which a rotation R z (θ) must be applied. We call these initial qubits "weight-1" qubits: our goal is to combine them such that we are left with one qubit which captures each digit of the Hamming weight, i.e., one qubit each of weight 1, 2, 4, 8, . . ., 2 log 2 (n) . Then, we apply the rotation R z (wθ) to each qubit depending on its weight w, and uncompute the Hamming weight, returning each qubit to its initial state.
We combine the qubits to form the Hamming weight in stages. First, we pick a group of 3 weight-1 qubits: call these three qubits a, b, and c. We initialize one ancilla qubit in the |0 state, and apply the adder building block of [60] to the three qubits and the ancilla. The first two qubits are unaffected (their final state remains a and b), but the third qubit and the ancilla qubit are changed to the 1s and 2s bits of (a + b + c): (a + b + c) 0 and (a + b + c) 1 , respectively. After the adder, we have introduced one ancilla qubit (now a weight-2 qubit, storing (a + b + c) 1 ), and accounted for the weight-1 qubits a and b, thereby decreasing the number of weight-1 qubits unaccounted for in the Hamming weight by two. We repeat this process of choosing groups of 3 weight-1 qubits and combining them until either two or one weight-1 qubits remain. At least one weight-1 qubit must remain because the adder primitive always leaves a qubit of weight 1.
At this point, we have applied (n − 1)/2 temporary adders, thereby introducing (n − 1)/2 ancilla qubits (now of weight 2) and incurring a cost of (n − 1)/2 Toffoli gates. n − 2 (n − 1)/2 weight-1 qubits remain. If n is odd, only one weight-1 qubit remains, and we are done: this is the 1s digit qubit in the Hamming weight. If n is even, two weight-1 qubits (call them a and b) still have to be accounted for. We introduce an ancilla and apply the temporary adder to the two remaining weight-1 qubits. We are left with the input a, (a + b) 0 -now the weight-1 bit of the Hamming weight-and the ancilla in the state (a + b) 1 , which now has weight 2. In the worst case (the case where n is even), we applied n/2 adders, introduced n/2 ancilla qubits, and used n/2 Toffoli gates to reduce to a single weight-1 qubit. We repeat this procedure for the weight-2 qubits. There are at most n/2 weight-2 qubits, so the cost of reducing to a single weight-2 qubit for the Hamming weight register is at most n/2 /2 temporary adders.
We continue to repeat this procedure for the higher-weight qubits, determining the Hamming weight bit-by-bit, until we reach a stage where there are exactly two qubits of weight 2 log 2 n −1 . At this point, we apply the final adder, and introduce a final ancilla qubit for the most significant bit of the Hamming weight. The worst case for the entire procedure is when n is a power of 2. When n is a power of 2, the number of weight-k qubits is always even up to the final weight to be combined, weight k = 2 log 2 n −1 = n/2, for which there are 2 qubits. In general, the number of qubits of weight k is n/k for each k, except for the single k = n qubit. So in the worst case the number of adders required is half this expression summed over k, The total cost to form the Hamming weight register is thus at most n − 1 Toffoli gates (equivalently, 4n − 4 T gates) and n − 1 ancilla. This allows us to reduce the number of parallel equiangular rotations from n to log 2 n + 1 .
Uncomputing the Hamming weight requires no further T gates or ancillae [60]. Finally, note if θ = mπ/2 k for integers k and m that there is no reason to continue the addition past weight 2 k−2 : at that point, each qubit only needs a T gate applied to it, and further adders eliminate only one T gate, versus the 4 T gates used by the adder. We discuss a specific application of this idea later in Appendix A 3.
The total tradeoff for Hamming weight phasing depends on the cost of synthesizing arbitrary rotations. With T synth the number of T gates required to synthesize a single-qubit arbitrary rotation R z (θ), Hamming weight phasing reduces the number of T gates required by Even for small n, the break-even point for Hamming weight phasing to yield an improvement is T synth 10. For comparison, the number of T gates required to synthesize the single-qubit rotation R z (θ) within error synth using repeat-until-success circuits is approximately T synth ≈ 1.15 log 2 (1/ synth ) + 9.2 [79].
The additional cost that must be considered is that of allocating n − 1 ancilla qubits for use in Hamming weight phasing. If n is sufficiently large, these qubits could alternatively be put to use distilling T gates. Together with the additional space requirement, this must be balanced against the savings of Hamming weight phasing.
In the following subsection, we consider the problem of Hamming weight phasing with stronger restrictions on the number of ancilla qubits we can introduce. We present two schemes: one with a constant number of ancillae r, and one using √ n + log 2 n ancilla qubits. In both cases, the T gate requirement increases while the ancilla qubit requirement becomes more favourable.

Hamming weight phasing with limited ancilla
The ancilla requirement of Hamming weight phasing may be prohibitive. We discuss two possible modifications with lower ancilla requirements in this subsection. First, we consider Hamming weight phasing when limiting to a constant number of ancilla qubits r < n − 1. Second, we discuss a method where we compute the Hamming weight of size-√ n subsets of the qubits at a time, and then sum those subset Hamming weights, rather than directly computing the full Hamming weight. This second method reduces the number of ancilla qubits required to √ n + log 2 n . For both modifications, the number of ancilla is reduced at the cost of requiring more Toffoli or T gates. First, we consider limiting to a constant number of ancilla qubits r < n − 1. Because the number of parallel equiangular rotations n is greater than r + 1, it is no longer possible to combine all the rotations at once. The problem is that we do not have enough ancilla qubits to compute the Hamming weight. However, we can still combine r + 1 rotations at a time into log 2 (r + 1) + 1 arbitrary rotations, and repeat the process until we have combined all n arbitrary rotations. Within each repetition, there are at most r adders using all r ancilla qubits, reducing r + 1 arbitrary rotations to to log 2 (r + 1) + 1 at a cost of r Toffoli (4r T) gates. The number of repetitions required to include every rotation is n/(r + 1) . The total cost is at most 4r n/(r + 1) T gates and n/(r + 1) log 2 (r + 1) + 1 arbitrary rotations. So even with only a constant r ancilla qubits, we reduce the number of T gates by at least T synth n − n r + 1 log 2 (r + 1) + 1 − 4r n r + 1 , where T synth is the number of T gates per arbitrary rotation required for synthesis. This is the simplest approach to reducing the number of rotations with a limited number of ancillae. Second, we present a method for reducing the number of arbitrary rotations by computing the Hamming weight of groups of qubits, and adding it to a second register storing the total Hamming weight. We divide the original n rotations into groups of size √ n . For each group, we compute the Hamming weight using √ n − 1 ancillae, add it to a second register of log 2 n + 1 qubits, and then uncompute the Hamming weight of the group. The same √ n − 1 ancillae are used to compute the Hamming weight of each group. After the Hamming weight of all n/ √ n groups has been computed and added to the total, we apply phases to the total Hamming weight as before. Then, we recompute and subtract each of the group Hamming weights to uncompute the total Hamming weight register. This reduces the number of ancillae required from n − 1 as in the previous section, where the Hamming weight of all n initial rotation qubits was computed at once, to √ n + log 2 n .
The number of T gates required when we divide into groups of size √ n , and add to a total Hamming weight register, is as follows. Computing the Hamming weight within each group can be done at a cost of at most 4 √ n − 1 T gates. Adding that group Hamming weight to the total Hamming weight register costs at most 4 log 2 n T gates.
Uncomputing the group Hamming weight after adding it to the total register (to continue on to the next group) costs no T gates; however, to uncompute the total Hamming weight we must subtract the Hamming weight of each group. Because we uncompute the group Hamming weights to free up the √ n ancillae, this subtraction requires us to sequentially recompute and subtract each group Hamming weight. For the final group added to the total, we do not need to recompute the group Hamming weight because we can subtract immediately after phasing the total Hamming weight, without uncomputation. Thus, we compute the Hamming weight of n/ √ n − 1 groups twice, at a total cost of 8 n/ √ n − 1 √ n − 1 T gates, and we compute the Hamming weight of the final group once at a cost of 4 √ n−1 T gates. We add and subtract the Hamming weight of each group to the total Hamming weight register once, at a total cost of 8 log 2 n n/ √ n T gates. All of these values are upper bounds on the cost. The full T-cost of the √ n -grouping method is then We can get a better understanding of this by simplifying using the inequality n/ √ n < √ n + 3/2 for n ∈ N >0 .
This shows that the number of T gates required to reduce the original n rotations to log 2 n + 1 rotations is (loosely) upper bounded by 8n + 8 √ n log 2 n + 12 log 2 n − 8.
The grouping method thus requires slightly more than double the 4n − 4 T gates of the method described in the previous section. On the other hand, it only requires √ n + log 2 n ancilla qubits, compared with the n required for that method. We can use the different methods presented in this section within the same circuit, depending on how many rotations we wish to combine at a given point. Say we have at most n = 500 arbitrary repeated rotations to combine in a circuit. Then we assign √ n + log 2 n = 30 ancilla qubits for combining those 500 rotations via Hamming weight phasing.
If at some other point in the circuit we have 50 rotations to combine, we can do so using the existing 30 ancillae and 4r n/(r + 1) = 4 × 24 × 2 = 192 T gates, reducing the number of arbitrary rotations to be synthesized from 50 to 10. Further improvements in the number of ancilla qubits required are possible using other grouping strategies, but are beyond the scope of this work. We refer the interested reader to work on pebble games (see, e.g., Chapter 10 of [85], or [86]).

Catalyzing two √ T gates using 5 T gates
In this subsection, we present an application of Hamming weight phasing discussed in [80]. We mentioned in Appendix A 1 that when the rotation angles are θ = mπ/2 k for integers k, m, it is most efficient to only combine until the rotations are T gates. These rotations appear in some of the circuits in the paper (specifically, in the FFFT for side lengths which are a power of two), and we present a circuit tailored to √ T and T √ T gates in this section. This circuit uses Hamming weight phasing with an ancilla qubit to catalyze √ T gates on two other qubits using only one Toffoli and one T gate, or 5 T gates, given the seed state √ T |+ . The same circuit can catalyze two √ T 3 = T √ T gates for the same fault-tolerant cost by changing the seed state and applying an S gate on the ancilla qubit in addition to the T gate. The circuit is constructed by moving the √ T gate of the standard adder-based Hamming weight phasing circuit through to the beginning to create the seed state √ T |+ . The seed state must be synthesized at the beginning of the entire computation at full cost, but can be used anywhere after that point-it is not consumed as we generate √ T gates. We show the catalysis circuit and its adder-based counterpart in Figure 4.
Circuit to catalyze two √ T gates using 5 T gates (left), and equivalent adder-based circuit (right). The left circuit is constructed from the right by moving the √ T gate through to the beginning using circuit identities. While the right circuit applies three √ T gates using a √ T and 5 T gates, the left circuit takes one of the √ T gates as input in the form of a seed state, and applies the other two √ T gates using 5 T gates. We use the "cornering" notation for temporary logical and of [60] (see Figure 5). The same circuit catalyzes two √ T 3 = T √ T gates if we replace the seed state with √ T 3 |+ and add an S gate on the ancilla qubit after the T gate.
We use the "cornering" notation of [60,80] for computing and uncomputing the logical and to indicate that the ancilla qubit is free for use elsewhere in the circuit outside that part of the computation. We review this notation in Figure 5 and present equivalent Clifford+T circuits.
FIG. 5. Circuits with cornering notation for computing and uncomputing logical and operations, together with the equivalent circuits in the Clifford+T basis [60]. The cornering notation indicates that the qubit is only needed during that section of the circuit. In Figure 4, the ancilla qubit is available for use before computing and after uncomputing the logical and. (a) Computing the logical and requires one Toffoli or 4 T gates; (b) uncomputing it requires none.

Appendix B: Trotter steps by fermionic swap network
We review in this appendix the fermionic swap network Trotter step developed in [29]. The idea for this Trotter step is that the Hamiltonian Eq. (1) can be simulated in linear depth in a network which changes the Jordan-Wigner ordering as it runs, such that the qubits corresponding to each spin-orbital are adjacent to the qubits corresponding to each other spin-orbital in the system at some point in the network.
The fermionic swap network does this by reversing the initial Jordan-Wigner ordering of the spin-orbitals with fermionic swaps using odd-even transposition sort (parallel bubble sort), simulating terms in the Hamiltonian that act on neighboring qubits as it goes. The sorting algorithm alternates between two different layers of swaps. In the first layer, each qubit in an odd-numbered position is fermionically swapped with the even-numbered qubit to its right. In the second layer, each qubit in an even-numbered position is fermionically swapped with the odd-numbered qubit to its right. This is shown for 5 qubits in the leftmost column of Table III. After the final set of swaps, the original Jordan-Wigner ordering has been reversed: because this reversal was performed using only nearest-neighbor operations, for each term in the Hamiltonian there must have been a point in the network where the spin-orbitals the term acts on were mapped to adjacent qubits. Now, suppose in a particular layer of the swap network that spin-orbital p (encoded by qubit i p in the Jordan-Wigner ordering) undergoes a fermionic swap with spin-orbital q (encoded by qubit i q ). Since all fermionic swaps occur between adjacent qubits, either i q = i p + 1 or i p = i q + 1. Thus, the fermionic operator V pq n p n q can be simulated by the qubit operators V pq (1 1 − Z ip − Z ip+1 + Z ip Z ip+1 )/4, and the fermionic operators T pq (a † p a q + a † q a p ) can be simulated by the qubit operators T pq ( This shows how to simulate all the terms in the Hamiltonian involving spin-orbitals p and q. We integrate these evolutions with the fermionic swap, and refer to the combined unitary as the "fermionic simulation gate" following [29]. The fermionic simulation gate performs three operations: it simulates evolution for time t under T pq (X ip X ip+1 + Canonical Ordering 1 2 (X1X2 + Y1Y2) 1 2 (X2X3 + Y2Y3) 1 2 (X3X4 + Y3Y4) 1 2 (X4X5 + Y4Y5) ϕ1ϕ2 ϕ3ϕ4 ϕ5 TABLE III. The five layers of swaps for the five spin-orbital fermionic swap network. The boxes in the leftmost column of each row indicate the pairs of qubits involved in a fermionic simulation gate in a given layer. The right columns show which kinetic energy fermion operators the different nearest-neighbor qubit operators correspond to at that layer. The final ordering is shown in the last row of the table for completeness. All 5 2 = 10 possible kinetic energy interactions appear at some point in the table, and are thus applied during the steps shown. The fermionic simulation gate also applies the potential energy interactions Vpqnpnq at that time. Thus, these five layers implement an entire Trotter step of the Hamiltonian. A slash through a cell indicates that no interaction is performed between that pair of qubits in the layer.
, and finally fermionically swaps the two spin-orbitals. In matrix form, this gate is We can see then that Table III actually depicts an entire first-order Trotter step if we interpret the boxes in the canonical ordering as the unitary F t (i p , i q ). If they are nonzero, the external potential terms U p n p can be simulated by applying single-qubit rotations after any layer of fermionic simulation gates.
The second-order (symmetric) Trotter step can be constructed from the Trotter step described so far by running it once forward, and symmetrizing by running it again in reverse. The spin-orbitals are restored to their original ordering by the second-order Trotter step. This Trotter step can simulate any Hamiltonian of the form Eq. (1), including the Fermi-Hubbard model, the uniform electron gas, and the electronic structure problem. For the Fermi-Hubbard model on an n × n lattice without periodic boundary conditions, we only need to run through the first O( √ n) layers of fermionic simulation gates [29].
With periodic boundary conditions, the full network must be run to ensure that interactions between the top and bottom rows of the lattice are simulated. For the electronic structure problem, the external potential terms U p n p arise from the charges of the nuclei ζ j . These nuclear charges are all that distinguish various molecules and materials from the uniform electron gas. However, because they contribute only to the terms U p n p , they enter the simulation only as a layer of single-qubit rotations appended to the Trotter step circuit.

Appendix C: Trotter steps of the split-operator algorithm
In this appendix, we review how to simulate Trotter steps of the Hamiltonian Eq. (1) using a variation of the split-operator algorithm introduced in [28]. The general principle of a split-operator algorithm is to divide ("split") the Hamiltonian into parts which can be more easily simulated separately. In this case, the algorithm separately simulates the kinetic energy operator in the momentum basis, and the potential energy operator in the position basis. We use an additional circuit to change back and forth between the position and momentum bases. In the momentum basis, the kinetic energy operator is diagonal, and moreover can be simulated using only single-qubit rotations. In the position basis, the potential energy operator is diagonal, containing only terms of the form Z p Z q . While this can be simulated using a swap network as in [28] or [29], we will instead develop a specialized network tailored to take full advantage of Hamming weight phasing.
We begin this appendix by describing how to construct the split-operator Trotter step. We then introduce our specialized network for taking full advantage of Hamming weight phasing when simulating the potential energy terms. Following that, we present the circuits used in the split-operator algorithm to change between the position and momentum bases. When the system side length is a power of 2, we use the fermionic fast Fourier transform (FFFT) as in [28,42]. We compute the T-cost of the FFFT using rotation catalysis as discussed in Appendix A 3. For other system side lengths, we use an alternative approach to diagonalizing the kinetic energy operator using Givens rotations [29,43]. While this second approach has fewer restrictions on the kinetic energy operators it can diagonalize than the FFFT, in the fault-tolerant setting, it is more costly because it uses more arbitrary rotations.

Constructing the split-operator Trotter step
We discuss here how to construct the split-operator Trotter step. Specifically, we will describe the circuit for the second-order (first-order symmetric) split-operator Trotter step.
Recall from Eq. (1) that we wish to simulate evolution under a fermionic Hamiltonian of the form In the context of the electronic structure problem, T pq , U p , and V pq are real numbers defined by integrals over molecular orbitals, arising from the kinetic energy, external potential, and electron-electron Coulomb repulsion, respectively. The algorithms we discuss can simulate arbitrary Hamiltonians of this form, not necessarily arising from any physical system. The creation and annihilation operators index the spin-orbitals through vectors p and q. This secondquantized Hamiltonian can be mapped by the Jordan-Wigner transform to a qubit Hamiltonian of the form where we have omitted the identity term because it is trivial to simulate. Here, X p , Y p , and Z p are the Pauli X, Y , and Z gates, respectively, acting on the qubit indexed by p in the Jordan-Wigner ordering. Z ri · · · Z r i+k is the string of Pauli Z gates between qubit p and qubit q in the Jordan-Wigner order, needed to preserve the appropriate commutation relations for the transformed fermionic creation and annihilation operators. The coefficientsT pq ,Ũ p , andṼ pq in the qubit Hamiltonian can be directly computed from the coefficients without tildes in the second-quantized Hamiltonian.
The split-operator Trotter step splits evolution e −iHt under the Hamiltonian H into separate evolutions under the kinetic and potential energy parts of the Hamiltonian. The reason this is useful is that the terms in the kinetic energy part of the Hamiltonian (those with coefficientsT pq ) can be diagonalized by an efficient circuit transformation C, leading to a sum of single-qubit terms, that is, The diagonalizing circuits C and C † change between the position and momentum bases using either the FFFT or Givens rotations. The idea of the split-operator method is that the Trotter-Suzuki approximation can be applied to divide evolution under the total Hamiltonian H into two evolutions: one under the kinetic energy terms that is e −iT t , and one under the potential energy terms that is e −i(U +V )t . Symmetrizing these evolutions to give the second-order Trotter step, we approximate time evolution under the full Hamiltonian by blocks of evolutions separated by the circuit C and its inverse C † , The equation above describes a single second-order Trotter step: the approximation can be refined by dividing into t such steps, each of time t/r.
In fact, we have two choices for how to symmetrize the split-operator Trotter step: as opposed to Eq. (C6), in which we approximate evolution for r Trotter steps as e −iHt ≈ (e −iT t/2r e −i(U +V )t/r e −iT t/2r ) r , we can instead approximate it as e −iHt ≈ (e −i(U +V )t/2r e −iT t/r e −i(U +V )t/2r ) r . Including the circuits for diagonalizing the kinetic energy operator, a single Trotter step of this form is At first glance, the number of gates required for the two Trotter steps might appear to be different. The number of terms in T is linear in the number of spin-orbitals, whereas the number of terms in U + V is quadratic in the number of spin-orbitals, which would lead us to expect that Eq. (C7)-in which e −i(U +V )t appears twice-might require many more gates than Eq. (C6), in which e −i(U +V )t only appears once. On the other hand, C or C † appear four times in Eq. (C6) but only twice in Eq. (C7), which might lead us to believe that Eq. (C6) could use more gates. However, this intuition is deceptive: so long as the number of Trotter steps r is large, since the beginning and end of both split-operator steps are the same and hence can be merged together, the difference in the costs is negligible. The result of this merging is that the cost of a simulation with r Trotter steps is nearly identical for the two competing orderings of terms. Only in the final Trotter step do the two orderings differ in cost: Eq. (C6) applies C one more time and e −i(U +V )t one fewer time than Eq. (C7). Both variants apply C, C † , e −iT t , and e −i(U +V )t once in each of the remaining r − 1 Trotter steps, far outweighing this final difference.
Rather, it is differences in the number of Trotter steps r required for accurate simulation-dictated by the Trotter error of the two steps-that dominate the difference in cost. For all systems considered, we construct the Trotter step depending on which split-operator step yields the smaller Trotter error. For the system parameters chosen, simulating V followed by T yields a smaller Trotter error norm for the uniform electron gas with Wigner-Seitz radius r s 1, whereas T followed by V yields a smaller Trotter error for the Hubbard model and the uniform electron gas with Wigner-Seitz radius r s 1. There is little significance to this beyond the relative sizes of the norms T and V (see Eq. (G2), Eq. (G3), and the following discussion).
Implementing the different circuits in the split-operator steps is straightforward. Each term e −iTpZpt , e −iŨpZpt , or e −iṼpqZpZqt requires one arbitrary rotation, whether we implement time evolution, or the evolution directionally controlled on an ancilla |c |ψ → |c e −iHt(−1) c |ψ that we shall use for phase estimation (Appendix F). Thus, with N spin-orbitals, at most N (N + 3)/2 arbitrary rotations are needed to simulate all the terms in the Hamiltonian, and the Trotter step circuit can be completed with one application each of the basis change circuit C and its inverse C † . We address two remaining questions in this appendix: first, how to order the simulation of the terms in pq e −iṼpqZpZqt so as to take full advantage of Hamming weight phasing, thereby reducing the number of arbitrary rotations required; and second, how to construct the circuit C using either the FFFT or Givens rotations.

Reducing the cost of simulating the potential energy operator using Hamming weight phasing
In this subsection we present a technique for reducing the number of arbitrary rotations needed to implement evolution under the interaction part of the potential energy operator pq e −iṼpqZpZqt in the split-operator Trotter step. Evolution under the terms p e −iŨpZpt can be done using only a single arbitrary rotation on each qubit. By comparison, pq e −iṼpqZpZqt in general appears to require O(N 2 ) arbitrary rotations for N qubits. We will show here how to use Hamming weight phasing to reduce this cost for the electronic structure problem to O(N log N ) arbitrary rotations, at a cost of (N − 1)(N − 2) additional Toffoli or 4(N − 1)(N − 2) additional T gates. Because the number of T gates needed to synthesize each arbitrary rotation is much larger than 4, this greatly reduces the fault-tolerant cost of simulating these terms.
The idea behind our technique is that in the basis of [28], the coefficientsṼ p,q are translation-invariant for the electronic structure Hamiltonian, that is, for any index vector s, V p,q =Ṽ p+s,q+s . (C8) We will harness this property so that we can apply Hamming weight phasing to groups of angles that are as large as possible. Hamming weight phasing allows M parallel rotations by the same angle to be reduced to log 2 M + 1 rotations by different angles, and 4M − 4 additional T gates (alternatively, M − 1 additional Toffoli gates), using M − 1 ancilla qubits. We can fully make use of Hamming weight phasing if we can order the terms e −iṼpqZpZqt in the simulation such that equiangular rotations are grouped together as much as possible.
We do this by performing the evolution pq e −iṼpqZpZqt in layers such that the value of everyṼ pq simulated within a given layer is the same. In this case, the arbitrary rotations used in each layer are all by the same angle, allowing them to be fully combined using Hamming weight phasing. We accomplish this as follows. We choose the interactions simulated in each layer according to the set of pairs of qubits (p, p + s) with fixed s, with the index vector p running over all qubits. The translation-invariance ofṼ p,q guarantees, for all qubit indices p, that V 0,s =Ṽ p,p+s . (C9) We first consider the case where the number of qubits N is even; when N is odd the layers differ only slightly from the even N case. For N qubits (with N even), there are N such pairs of qubits (p, p + s), given by the N − 1 different vectors s on the grid. However, because each of these interactions is between two qubits, at most N/2 of them can be performed simultaneously in a layer. We show a diagram of the two layers used for s = (0, 1) in Figure 6 for the case of 16 qubits on a 4 × 4 grid. Note that the vector indices on the qubits are evaluated modulo 4 for the 4 × 4 grid. In each layer, we are able to simulate N/2 interactions of equal strength, yielding N/2 equiangular rotations combinable using Hamming weight phasing. In general, there are N − 1 different index vectors s that must be iterated over. (The final vector, s = 0, does not appear because it corresponds to a qubit interacting with itself.) Each index vector is applied in two layers, leading to 2N − 2 layers each simulating N/2 interactions. In each of the 2N − 2 layers, Hamming weight phasing can be applied over every simulated term e −iṼp,p+sZp+sZst , because each coefficientṼ p,p+s is the same, and hence each of the N/2 rotation angles is the same.
This allows Hamming weight phasing to be applied over every rotation in each layer, reducing the number of arbitrary rotations required per layer from N/2 to log 2 (N/2) + 1 , at a cost of an additional 2N − 4 T gates or N/2 − 1 Toffoli gates. Over all the 2N − 2 layers, the T-cost to simulate all the two-qubit terms in a step of Trotter evolution is where T synth = O(log(1/ )) is the number of T gates to synthesize each arbitrary rotation, with the required synthesis precision. When the number of qubits N is odd, we instead must simulate each of the N − 1 index vectors using three rather than two layers. Because the number of qubits is odd, one qubit is left out in each layer as compared to the even case-its partner is already matched with another qubit. As a result there are 2N − 2 layers of (N − 1)/2 equiangular rotations and N − 1 layers with a lone rotation in the simulation. We use Hamming weight phasing to reduce the (N − 1)/2 equiangular rotations per layer to log 2 (N − 1)/2 + 1 at a cost of an additional (N − 1)/2 Toffoli or 2N − 2 T gates. Over all the layers, the T-cost to simulate all the two-qubit terms in a step of Trotter evolution is for the case of odd N . For both even and odd N , we reduce the cost of the arbitrary rotations to O(N log N log(1/ )) using at most 4(N − 1) 2 additional T gates or (N − 1) 2 Toffoli gates. This is to be compared with the O(N 2 log(1/ )) cost were we to not apply Hamming weight phasing, or were the interaction terms not translation-invariant. This improvement applies to any quantum simulation of electronic structure using the basis of [28], or more generally to any simulation of a two-qubit interaction with the translation-invariance property Eq. (C8).

The fast fermionic Fourier transform
The fermionic fast Fourier transform (FFFT) is one way of implementing the circuit C which diagonalizes the kinetic energy operator. In this appendix, we define the FFFT, present circuits for it, and compute the T-cost of using it to change from the position to momentum bases using rotation catalysis.
The effect of the fermionic Fourier transform is a single-particle rotation with r p = p(2Ω/N ) 1/d and k ν = 2πν/Ω 1/d the centroid of the orbital given by the index vector p, and k ν a momentum mode as in Section II. The circuit for the FFFT is composed of two types of local gates. The first is the fermionic swap gate, which has the effect of exchanging two orbitals, while maintaining proper anti-symmetrization. Under the Jordan-Wigner transform, the fermionic swap gate f swap has the matrix representation i.e. it swaps two qubits and applies a sign −1 to the |11 state. The second gate used for the FFFT has the matrix representation where n is the number of qubits the FFFT acts on, and k is an integer. This has the effect of a rotation on the first qubit followed by a Hadamard transform in the single-particle subspace. Our notation for this gate is slightly different from other works discussing the FFFT [28,42] which denote the same gate by F k (with the number of qubits n left implicit). In this sense our presentation is closer to that in [71]. The FFFT on n qubits can be implemented recursively in depth O(n log n) using only the fermionic swap gate f swap and the gates F k,n for integer k [28]. This recursive structure constrains the FFFT to operate on systems whose side lengths are powers of 2. The strategy for this recursive circuit parallels the recursive structure of the classical fast Fourier transform, and can be broken into four stages.  We present the circuit for the 8-qubit FFFT on a nearest neighbor architecture in Figure 7, emphasizing the different stages in the circuit. Stage 3 is only a single layer of gates and we merge it into stage 2.
So far, we have discussed the structure of the FFFT circuit. Next we will compute the number of T gates required to implement it to determine the cost of the FFFT on a fault-tolerant quantum computer. We do this by presenting explicit circuits for the gate F k,n , and counting how many of them appear in the circuit.
The recursive structure of the n-qubit FFFT (where n is necessarily a power of two) means that it uses n/2 F k,n gates in log 2 n recursive layers, yielding (n log 2 n)/2 F k,n gates in total. The second gate used in the circuit, the fermionic swap gate, appears n(n − 2)/4 times directly in the interleave and de-interleave steps of the n-qubit FFFT, as well as in the interleave/de-interleave steps of the two recursive n/2-qubit FFFTs. In total, this is f swap gates. In fact, f swap gates are free in this model, requiring only Clifford gates, and we will not discuss them further here. By comparison, any fault-tolerant circuit construction for F k,n will require T gates. We present one such construction in Figure 8-note that T † = ZST . F k,n requires two T gates when 4k/n is an integer, three T gates if instead 8k/n is an integer, and two T gates and an arbitrary rotation otherwise. In the particular case that 16k/n is an integer, the arbitrary rotation is a √ T gate. Recall from Appendix A 3 that provided that √ T gates appear two at a time in the circuit, we can implement them using only 2.5 T gates each.
Fault-tolerant circuit for the two-qubit fermionic Fourier transform gate F k,n in the n-mode FFFT, up to global phase. The T and S gates are T = Rz(π/4) and S = T 2 . If 4k/n is an integer, the rotation at the beginning is at most an S gate; if instead 8k/n is an integer, it is an additional (third) T gate. The largest case we consider is applying the FFFT to a system with side length n = 16: in that case, √ T gates can appear in the rotation. These can be synthesized in pairs using only 5 T gates per pair, rather than the cost of a full arbitrary rotation.
This means that the T-cost of the FFFT circuit is 8 for the 4-qubit case, 26 for the 8-qubit case, and 71 T gates + 4 √ T gates = 81 T gates for the 16-qubit case, not accounting for the cost of preparing the seed √ T |+ state. Next, we will shift our attention to the FFFT in multiple dimensions. Following that, we will show how to minimize the T-cost of the 2D FFFT using rotation catalysis, when the system being Fourier transformed is large enough that the FFFT has non-trivial rotations (rotations which must be synthesized).

a. Changing basis in multiple dimensions
We use the fermionic Fourier transform or Givens rotations to implement the circuit C to diagonalize the kinetic energy terms in the Hamiltonian in the split-operator algorithm. Earlier in this section, we discussed the number of arbitrary rotations required for the fast fermionic Fourier transform on n qubits in one dimension. For a d-dimensional system, however, the 1D FFFT circuit must be applied along each of the system dimensions. Though we have not yet discussed it, this also holds if we implement C using Givens rotations, in the sense that whether we use the FFFT or Givens rotations, the d-dimensional basis change is composed of the same pattern of 1D basis changes. The discussion in this section applies identically to both approaches.
In each dimension, the 1D basis change must be applied once for each value of the other coordinates. For example, for a hypercube with side length n in d dimensions, transforming along one dimension requires n d−1 applications of the 1D basis change. Repeating this once for each dimension, the 1D basis changing circuit must be applied dn d−1 times, in d layers of n d−1 basis changing circuits.
More concretely, for a 4 × 4 × 4 grid, we apply 4-qubit FFFT is applied 16 times to Fourier transform the x dimension. The 1D 4-qubit FFFT is applied to the 4 qubits with fixed (y, z) and varying x coordinate. There are 16 pairs (y, z) so the 4-qubit FFFT is applied 16 times. This applies the fermionic Fourier transform in the x dimension: the procedure is repeated for the y dimension (to each of the 16 groups of 4 qubits with the same x and z coordinates, with the y coordinate varying), and again for the z dimension (to groups with the x and y coordinates now fixed, and the z coordinate varying). In total, the 4-qubit FFFT is applied dn d−1 = 48 times for this system.
For a system with spin, there are still d parallel stages in which the 1D basis changing circuit is applied multiple times. However, in each stage, the 1D circuit is applied 2n d−1 times, twice as many as in the corresponding spinless system. This doubling occurs because of the two spin sectors.
In general, this means that the 1D basis transformation circuits are applied many times in parallel, whether we use the FFFT or Givens rotations. Because the Givens rotation circuits contain arbitrary rotations, this allows us to apply Hamming weight phasing (Appendix A) to reduce the number of these arbitrary rotations which must be synthesized. For the FFFT, because the rotations are always of the form R z (2πk/n) with n a power of two, Hamming weight phasing is less efficient. Instead, as we discuss in the next section, we can use rotation catalysis as in Appendix A 3 to apply e.g.
√ T and T √ T gates using only T gates, without the usual costs of rotation synthesis. We discuss this in the next subsection for the case of the 16 × 16 grid, both with and without spin.

b. Rotation catalysis and the 2D FFFT
Of the systems we study, there is only one grid size for which rotation catalysis (see Appendix A 3) is beneficial for the FFFT. This is because the FFFT only has a non-trivial rotation-a rotation which is not entirely composed of T, S, and Z gates-for systems with side lengths greater than 8. Given that the FFFT can only be applied when the side length is a power of 2, the first such side length is 16. At this size, the FFFT circuit has several √ T and √ T 3 = T √ T gates. The 16 × 16 grid appears for the largest spinless uniform electron gas system we consider, and with spin for the Fermi-Hubbard model.
Fourier transforming with side length n = 16 in d = 2 dimensions applies the 16-qubit (1D) FFFT 32 times without spin, and 64 times with spin. This represents two layers each of 16 (32) parallel applications of the 1D circuit without (with) spin. For the 16-qubit FFFT, the rotation R z (2πk/n) in the gate F k, 16 T} gate for k ∈ {1, 3, 5, 7}, respectively. Ignoring the Clifford gates, the 16-qubit FFFT uses two √ T and two T √ T gates. Recall from Figure 8 that every F k,n gate costs two T gates in addition to these, and that there are (n log 2 n)/2 = 32 F k,n gates in the 16-qubit FFFT. Further, for k = 2 and k = 6, the additional rotation in F k,16 contains a third T gate, and F 1,8 and F 3,8 contribute another 4 T gates (these gates appear twice, since the 8-qubit FFFT appears twice in the recursive circuit for the 16-qubit FFFT). There are thus 70 T gates in the 16-qubit FFFT.
Because the 16-qubit FFFT is applied 32 (64) times without (with) spin, changing between the position and momentum basis using it would appear to require us to distill 2240 (4480) T gates and additionally synthesize 128 (256) √ T gates. Instead, by using rotation catalysis, we can reduce this to two arbitrary rotations to create the seed states at the beginning of the computation, regardless of how many split-operator Trotter steps we apply. We then use these seed states to catalyze all the √ T and √ T 3 = T √ T gates at a cost of half a Toffoli and half a T gate each, rather than the ∼ 40-50 T gates per rotation required for synthesis. The only downside is the need for three ancilla qubits: two to store the seed states √ T |+ and T √ T |+ , and one to temporarily compute logical ands as in Figure 4. In total, converting between the position and momentum bases using the 2D fermionic Fourier transform on a 16 × 16 grid requires 2304 T and 64 Toffoli gates without spin and 4608 T and 128 Toffoli gates with spin. Because the other cases for which we use the FFFT (systems with side length 4 and 8) do not require arbitrary rotations in the FFFT, there is no advantage to these catalysis techniques. Finally, for side lengths that are not a power of two, we cannot construct the diagonalizing circuit C using the above FFFT construction. For these cases, we instead diagonalize using Givens rotations. We elaborate upon this diagonalization process in the next subsection.

Diagonalizing the kinetic energy operator using Givens rotations
Whereas the FFFT allows transformations of the fermionic mode operators based on the Fourier transform, Givens rotations allow arbitrary unitary transformations to be performed on the mode operators [29] for unitary U , with a † i and a i the initial creation and annihilation operators, respectively. This means we can construct a basis change circuit C capable of diagonalizing any number-conserving quadratic Hamiltonian pq T pq a † p a q . By comparison, implementations of the diagonalizing circuit C using the FFFT are limited to the case whereã † i is the Fourier transform of a † i ; the recursive structure (radix-2 decimation) further restricts the FFFT to act on a number of qubits n which is a power of 2. Other radices in the fermionic Fourier transform are possible in principle (and in fact their recursive base cases could be implemented using Givens rotations), but the corresponding circuits require more rotation gates than the radix-2 FFFT. Whereas the most challenging gates in the FFFT are rotation gates R z (2πk/n) with n a power of 2, as we shall see the majority of the rotation gates used when we construct C with Givens rotations are by arbitrary angles.
The reasoning behind the diagonalization of the kinetic energy operator using Givens rotations is as follows. By the Thouless theorem, applying the single-particle (n × n) unitary transformation U is equivalent to applying the 2 n × 2 n unitary This unitary can be implemented using a sequence of rotations of the form [29] The action of R ij (θ) is a Givens rotation by θ in the single-particle subspace. Appropriate choices of the angles θ allows us to diagonalize the matrix U , leaving n phases along the diagonal. The unitary U can be applied by applying these n phases to the qubit corresponding to that spin-orbital, followed by the generated sequence of Givens rotations in reverse. For n orbitals, n 2 Givens rotations are sufficient for each spin sector. The spinless case thus requires exactly n(n − 1)/2 Givens rotations, whereas for the case with spin n(n − 1) Givens rotations are required [29].
We show a fault-tolerant circuit implementing R ij (θ) between neighboring qubits in Figure 9. Each Givens rotation R ij (θ) requires 2 arbitrary rotations, so the total number of arbitrary rotations required per spin sector is n(n − 1). 9. Fault-tolerant circuit for Rij(θ) = exp[θij(a † i aj − a † j ai)] on neighboring qubits, corresponding to a Givens rotation in the single-particle subspace. Two arbitrary rotations Rz(−θ) are required. Because these rotations are in parallel, when the same Givens rotation is applied simultaneously many times (i.e. when changing bases in multiple dimensions) they can all be combined using Hamming weight phasing. For n orbitals, the circuit must be applied n(n − 1)/2 times in each spin sector.
In total, without spin, the Givens rotations require n(n − 1) arbitrary rotations for any single-particle basis change using this procedure, while with spin 2n(n − 1) are needed. The diagonal phases require n arbitrary rotations without spin and 2n arbitrary rotations with spin.
For diagonalizing the kinetic energy operator of a system with side length n in d dimensions, the discussion of Appendix C 3 a exactly carries over. The 1D (n-qubit) basis change circuit is applied several times in parallel, in d separate stages. Within each of the d stages, the 1D circuit is applied n d−1 times in parallel for the spinless system and 2n d−1 times in parallel for the spinful system. Hamming weight phasing can be applied across all these parallel circuits to reduce the number of arbitrary rotations.

Appendix D: Costs per Trotter step
The number of T gates and arbitrary rotations per Trotter step figure significantly in the total costs of the two algorithms. For the fermionic swap network-based Trotter step, no T gates are directly required: instead, each application of the fermionic simulation gate requires up to 4 arbitrary rotations which must be compiled into T gates. On the other hand, for the split-operator algorithm, changing basis using either the FFFT or Givens rotations may itself require T gates and arbitrary rotations. While the split-operator algorithm is in the momentum basis, evolution under the entire kinetic energy part of the Hamiltonian can be implemented by an arbitrary rotation on each qubit. While the algorithm is in the position basis, each interaction between spin-orbitals requires an arbitrary rotation, as well as an additional arbitrary rotation on each qubit to implement evolution under the single-qubit Z terms.
In this section, we discuss how to count the number of T gates and arbitrary rotations required per Trotter step for both the fermionic swap-based Trotter step and the split-operator Trotter step for all systems discussed in the paper.

The fermionic swap network
For the fermionic swap network simulating the uniform electron gas (jellium), the second-order (symmetric) Trotter step requires 4(N − 1) 2 arbitrary rotations for N qubits in the spinless case. Each application of the fermionic swap gate requires 4 arbitrary rotations (see Figure 11), and the fermionic swap gate is applied 2 N 2 = N (N − 1) times, twice for each pair of orbitals because of the symmetric step. This would imply a cost at most 4N (N − 1) arbitrary rotations. However, we can slightly reduce this cost by noting that (i) the arbitrary rotations in the first and final layers of the fermionic swap network can be merged in the symmetric step, and (ii) for the case with spin, many interactions do not occur between electrons in different spin sectors. If there are an equal number of interactions between same-spin and opposite-spin sectors at the start and end of the Trotter step, there are 3N − 4 arbitrary rotation gates total in the first and last layer of the step which may be merged.
The first of these observations allows us to eliminate 4(N − 1) arbitrary rotations from the fermionic swap network Trotter step. This reduces the number of arbitrary rotations to 4(N − 1) 2 for the case of the uniform electron gas without spin. In the case of jellium with spin, the key for the second observation is that the kinetic energy terms T pq are zero if σ p = σ q , i.e. if p and q belong to different spin sectors. In this case, F t (i p , i q ) only requires 2 arbitrary rotations. Only considering the potential energy terms, the fermionic swap network requires 2N (N − 1) arbitrary rotations. The kinetic energy only acts within the same spin sector. There are N/2 2 = N 2 /8 − N/4 pairs of orbitals within each spin sector requiring 2 arbitrary rotations for the kinetic energy terms, two spin sectors, and the symmetric Trotter step applies the corresponding rotations twice. In total then, the kinetic energy operator requires at most N 2 − 2N arbitrary rotations. Combining this with the 2N 2 − 2N arbitrary rotations for the potential energy operator, the fermionic swap network requires at most 3N 2 − 4N arbitrary rotations for N qubits for the system with spin. Using the improvement of the first observation, there are at most (N − 1)(3N − 4) arbitrary rotations to perform. Materials are nearly identical to jellium, but add up to N arbitrary rotations (one on each qubit) to the Trotter step.
For simulating the Hubbard model with spin, there are N/2 on-site interactions in the Hamiltonian (one between the two spin sectors for each of the N/2 spatial orbitals), and 2N hopping terms (two for each of the N spin-orbitals in the system). Each on-site interaction and each hopping term requires two arbitrary rotations to simulate. Symmetrizing for the second-order Trotter step adds an additional 4N arbitrary rotations for the hopping terms; however, the interaction terms in the symmetric step can be combined. Totaling these costs, at most 9N arbitrary rotations are required to simulate the Hubbard model.

The split-operator Trotter step
The costs for the split-operator Trotter step for the uniform electron gas and Hubbard model are as follows. For both systems, applying the kinetic energy operator in the momentum basis requires 2N arbitrary rotations, two for each qubit for the symmetric Trotter step. Because all terms in the potential energy operator commute, each rotation can be performed just once even for the symmetric step: N 2 + N arbitrary rotations are required at most. While the uniform electron gas saturates this requirement, the Hubbard model only needs 3N/2 arbitrary rotations for the potential. As discussed in Appendix C, because we work with a large number of Trotter steps, the cost of simulating terms in the symmetric split-operator Trotter step is approximately the sum of the cost of simulating the kinetic terms and the cost of simulating the potential terms. (For a single symmetric split-operator step, either the kinetic or the potential terms would be simulated twice, i.e. exp(−iHt) ≈ exp(−iT t/2) exp(−iV t) exp(−iT t/2) or exp(−iHt) ≈ exp(−iV t/2) exp(−iT t) exp(−iV t/2): when the number of steps is large, we can merge the halved term with the same term in the next Trotter step. We are thus neglecting in our counts the cost of the final simulation term, which cannot be merged in this way, but this is negligible when the number of Trotter steps is large.) In total, then, implementing the terms in the uniform electron gas Hamiltonian requires N 2 + 3N arbitrary rotations while the Hubbard model requires 7N/2 arbitrary rotations.
We have not yet discussed the cost of changing bases. Changing between the position and momentum bases represents an additional cost for the split-operator Trotter step. Recall from Appendix C that the Fourier transform must be applied dM d−1 times to transform a spinless system of side length M in d dimensions, and 2dM d−1 times for the same system with spin. When the side length is a power of two, the FFFT can be used, with a cost of 8 T gates for the 4-qubit case, 26 for the 8-qubit case, and 54 T gates and 4 arbitrary rotations for the 16-qubit case. Otherwise, the Givens rotation-based basis change algorithm must be used [29]. The basis change algorithm on M qubits requires 2 M 2 arbitrary rotations. In either case, the rotation from the position to the momentum basis and its reverse must be applied twice for the symmetric Trotter step.
Recall from Appendix A that Hamming weight phasing may be used to reduce the number of arbitrary rotations required for changing the basis, whether the FFFT or Givens rotation algorithm is used. The basis must be applied twice in the symmetric split-operator Trotter step, i.e. each of the following costs will be doubled in our final accounting.

Gate counts per Trotter step with Hamming weight phasing
The Trotter step gate counts we use are actually slightly smaller than those discussed in the previous two subsections. There are three reasons for this, which we describe in order of increasing complexity. First, we do not include zerorotations. Second, for both the fermionic swap and split-operator Trotter steps, we neglect the cost of the final term in the simulation (the final layer, for the fermionic swap Trotter step), since those arbitrary rotations can be merged with the first term or layer of the next Trotter step. Third, we use Hamming weight phasing with a fixed number of ancilla qubits to combine equiangular rotations where possible. We discuss these aspects further in Section II.

Appendix E: Improved second-order Trotter-Suzuki error bounds
Bounds on the error in Trotterized evolution typically assume that the evolution time t is small. Here, we present a tighter bound on the second-order Trotter error than that in Appendix B of [9], which, similar to the proof in that paper, does not require t to be small. Broadly, our proof is quite similar overall to that in [9], and we adopt similar notation so as to make it easier to identify where they differ. It has been pointed out that the bound described here can in principle be reached from the work in [46] using slightly different methods.
The second-order Trotter formula for e −iHt is given by Eq. (5), for the Hamiltonian H = L =1 H . We first consider the error when H is a sum of only two terms. We write Ht = A + B, and let H(x) = B + (1 − x)A. Then the error in the Trotter formula for A and B can be written as In bounding this, [9] use Additionally, A exp(−iH(x)) = exp(−iH(x))A(1, x). That then gives the expression for the derivative Next, using A(0, x) = A and where A and A are the respectively the first and second derivatives of A(u, x) with respect to u, we have that the terms in parentheses in Eq. (E5) are equivalent to Evaluating the derivatives gives and Substituting in the expression for H(x) gives Combining these results in Eq. (E5), we therefore have This has given a tighter bound by a factor of 1/12 compared to that given in [9], which matches the factor from a BCH expansion. Lastly, integrating over x gives Next, one can use the iterative approach from [9]. Let Using the triangle inequality Noting that U 1 = U and U T S 1 = U T S , and U L = U T S L , we have Now, using the bounds above with Lastly, since we are using the evolution to measure the energy, we need to bound that in terms of the difference in the unitaries. The difference in the eigenvalues of unitaries may be upper bounded as [87] Appendix F: Phase estimation circuit primitives from a Trotter step Phase estimation typically is assumed to use a controlled unitary rotation, however, [43] showed that phase estimation can be made more efficient by eschewing controlled quantum evolutions in exchange for circuits that control the direction of the time evolution. That is to say, we want a circuit such that |c |ψ → |c e −iHt(−1) c |ψ . We build such a circuit by exploiting the structure of the Trotter decomposition. If the symmetric Trotter formula is used then we can simulate the inverse time evolution by flipping the sign of each evolution in the decomposition. That is to say, if This does not hold for non-symmetric splittings such as the lowest-order Trotter formula. Thus, we simply need to find a circuit for controllably reversing the direction of each H j . Such circuits are simple to produce at no extra cost (compared to the Trotter step circuits) for the potential terms; this allows implementation of the directionally controlled circuit for the split-operator Trotter step. For that case, the directionally-controlled circuits can be constructed by adding a controlled-not gate (controlled on the phase estimation ancilla, targeting the qubit in question) on either side of each of the arbitrary-angle R z gates in the circuit. We show the directionally-controlled version of V t , the circuit for simulating the interaction terms −V pq Z p Z q /4, in Figure 10. The number of arbitrary rotations and T gates for the directionally-controlled circuit is thus the same as for the bare Trotter step. However, integrating the kinetic energy terms between electrons in a constant spin manifold into this, and hence constructing the directionally-controlled fermionic swap Trotter step, is more challenging.
FIG. 11. How to implement simultaneous F operations followed by fermionic swaps. Note that the phasing operation on the control has negligible cost. Although it may appear that there are O(M k · n 2 ) phasing operations applied to the control in the course of evaluating a Trotter step, all of the control phasing operations can be merged into a single phasing operation.
The simulation circuit in Figure 11 achieves this directional control for the fermionic swap network. It follows from the circuit for F t (i p , i q ) after noting that conjugating R z gates with not gates flips the direction of rotation. The result given in the circuit follows from this observation and elementary identities for simplifying chains of cnot gates. Importantly, the number of arbitrary rotations required for the circuit is identical to that of the corresponding uncontrolled circuit.
On the other hand, when the potential is simulated first, the resulting Trotter error norm is Depending on the norms of T and V this can yield different bounds on the Trotter error. So long as the number of Trotter steps r is large, the number of gates in the simulation does not depend significantly on whether T or V is simulated first, and the user can freely choose the case yielding the smaller error. The fermionic swap network is less flexible in this regard. Given these ordered sums, we then compute the double commutators, and total them to find the Trotter error norm. The order in the sums is specified by the Trotter step. Computing the error norms for larger sizes requires significant computational resources. For example, the Hamiltonian for one of the larger system sizes we study for the uniform electron gas (12 × 12 with spin) has 47,952 terms in the Hamiltonian and 44,496 terms in the Trotter ordering for the fermionic swap network.
This means that computing W requires the evaluation of as many as 10 14 double commutators. We partially ameliorate this by performing preprocessing to determine which double commutators [[H b , H c ], H a ] are trivially zero and do not need to be evaluated. The Trotter error norm for the larger systems can require significant memory to load into memory (> 100 GB). We work around this issue by parallelizing computation of across orthogonal components as follows. Writing the error norm as a sum of Pauli strings W = i a i P i , let the smallest qubit index P i acts on be q i . Then we can compute the sum of |o i | for which q i is some fixed value (i.e., the triangle inequality bound for a given qubit index), and the sum over q i of these values is equal to the full Trotter error norm. This allows us to control the maximum memory requirement of any partial computation.
Trotter error norms in units of cubed Hartree (Ha 3 ) are included for the uniform electron gas in 2D in Table IV and in 3D in Table V. We additionally plot the Trotter error norm for these systems in Figure 12. Table VI shows Trotter error norm data for the Hubbard model in 2D (with spin). The same data is plotted in Figure 13 with a power law fit. We include the same data for a range of Wigner-Seitz radii in 3D in Table VII and for a series of materials in  Table VIII. For the materials, we include in the table the Hartree-Fock energies computed in PySCF [88] which we use for our relative precision energy target.
All numerics (for all systems and for the different Trotter steps) were performed using code which we have contributed to the open-source package OpenFermion [76]. TABLE V. Uniform electron gas (UEG) Trotter error data, for both the fermionic swap (FS) and split-operator (SO) Trotter steps, at Wigner-Seitz radius rs = 10 in 3D with varying system side length. WX denotes the Trotter error norm for algorithm X. Hartree-Fock energies EHF serve as an upper bound on the true ground state energies for the systems considered. Energies are in units of Hartree (Ha), Trotter error norms are in cubed Hartree (Ha 3 ).

Appendix H: T-count minimization
Continuing from the discussion in Section III, we present in this appendix our final numerics to compute the total number of T and Toffoli gates required for our simulations. We wish to minimize the total number of T and Toffoli gates needed to determine the ground state energy to precision ∆E ≥ ∆E T S +∆E P E +∆E HT using phase estimation. ∆E T S accounts for Trotter-Suzuki errors, ∆E P E for phase estimation errors, and ∆E HT for circuit synthesis errors. Throughout, we convert Toffoli to T gates at a cost of 2 T gates per Toffoli gate [80].
The tradeoff between these error sources is non-trivial because the T-costs differ significantly as a function of the desired accuracy for each type of error. We discuss how to compute the T-cost as a function of these errors, and give the results of its numerical minimization subject to Eq. (11) for a variety of system sizes, in Section III B. Our full objective function there was given in Eq. (12) where W is the Trotter error norm, N r is the number of arbitrary rotations per Trotter step, and N d is the number of "direct" T and Toffoli gates (those T/Toffoli gates which appear in the circuit before synthesis) multiplied by N P E . Data on the Trotter error norms W for all systems and algorithms are given in Appendix G, and our approach to computing the numbers of arbitrary rotations N r and T/Toffoli gates which appear in the circuit before compiling arbitrary rotations ("direct" T/Toffoli gates) N d is discussed in Section II as well as in Appendix D. We work with two total precision targets ∆E for all the systems we analyze. In the first, we consider a relative precision, with the total precision ∆E set as a fraction (0.5%) of the ground state energy. This value is comparable to errors associated with the sign problem in quantum Monte Carlo [81,82]. The precision ∆E in this case scales with the system size. In the second case, we consider an absolute precision target set depending on the type of system, but independently of the particular system size.
For the relative precision case for the jellium systems, since we cannot directly compute the ground state energy, we take ∆E to be 0.5% of the system energy, computed as the number of electrons multiplied by the energy per electron using the correlation energy estimates of [89]. In atomic units, the energy per electron takes the form quantum Monte Carlo simulations in the low-density regime, and lower-bounds the magnitude of the ferromagnetic energy in the high-density regime. We then take ∆E to be 0.5% of this value, after multiplying by the number of electrons (η in the expression above). We use these energies in computing the T-cost for the uniform electron gas at a Wigner-Seitz radius of 10 Bohr radii in Figure 1, as well as for a range of Wigner-Seitz radii below in Figure 14. For the material systems, we take as our relative precision target the Hartree-Fock energy E HF (given in Appendix G). We compute these using PySCF [88]. The true ground state energy is E 0 ≤Ẽ 0 = E HF in general, and E HF < 0 for all material systems considered, so choosing ∆E = 0.005 E HF ensures that the precision better than 0.5% of the magnitude of the ground state energy. Loose but rigorous upper bound on the number of T gates in the circuit which performs our Trotterized phase estimation simulation of the 3D uniform electron gas, for a range of values of the Wigner-Seitz radius, to (a) within half a percent of the total system energy and to (b) absolute precision ∆E = 0.0016 Hartree. The colors of the rainbow (red through violet) correspond to logarithmically spaced Wigner-Seitz radii from 0.5 to 32 Bohr radii (0.5, 1, 2, . . ., 16,32). Lines show power law fits to the data. 14 ancilla qubits are used for Hamming weight phasing for all systems.
Finally, for our relative precision target for the Hubbard model, we choose 1.02 as an upper bound on the magnitude of the energy per site at U/τ = 4 and 0.74 as an upper bound on the magnitude of the energy per site at U/τ = 8, following Table V of [39]; again, this ensures that the precision tolerance is smaller than 0.5% of the total ground state energy. For our absolute precision targets, we take ∆E = 0.0016 Hartree for jellium and the material systems (this precision is known as "chemical accuracy"), and ∆E = τ /100 for the Hubbard model.
A possible criticism of the results of the optimization for the Hubbard model is that the number of Trotter steps decreases with system size in the relative precision case. Because of this, the scalings we present in Figure 2a do not hold in the asymptotic limit; rather, they only apply for a finite range of system sizes. We perform a power-law fit TABLE VII. Uniform electron gas (UEG) Trotter error data, for both the fermionic swap (FS) and split-operator (SO) Trotter steps, for a range of systems in 3D with varying Wigner-Seitz radius and side length. WX denotes the Trotter error norm for algorithm X. Hartree-Fock energies EHF serve as an upper bound on the true ground state energies for the systems considered. Energies are in units of Hartree (Ha), Trotter error norms are in cubed Hartree (Ha 3 ). For those systems for which EHF is positive, it is either exact or sufficiently large such that it is still a good proxy for E0 for the purposes of relative precision. on the number of Trotter steps for the Hubbard model. This indicates that the system size at which the number of Trotter steps required goes to one is hundreds of thousands of spin-orbitals, even in the smallest case. We plot the number of Trotter steps with the power-law fit for the Hubbard model in Figure 15.  The exponents of the fit are included in the caption. The fits intercept 1 well above 100,000 system qubits, far above the range of system sizes we consider.