Quantum Algorithm for Simulating Hamiltonian Dynamics with an Off-diagonal Series Expansion

We propose an efficient quantum algorithm for simulating the dynamics of general Hamiltonian systems. Our technique is based on a power series expansion of the time-evolution operator in its off-diagonal terms. The expansion decouples the dynamics due to the diagonal component of the Hamiltonian from the dynamics generated by its off-diagonal part, which we encode using the linear combination of unitaries technique. Our method has an optimal dependence on the desired precision and, as we illustrate, generally requires considerably fewer resources than the current state-of-the-art. We provide an analysis of resource costs for several sample models.


Introduction
Simulating the dynamics of quantum many-body systems is a central challenge in Physics, Chemistry and the Material Sciences as well as in other areas of science and technology. While for classical algorithms this task is in general intractable, quantum circuits offer a way around the classical bottlenecks by way of 'circuitizing' the time evolution of the system in question. However, present-day quantum computing devices allow for the programming of only small and noisy quantum circuits, a state of matters that places severe constraints on the types of applications these devices may be used for in practice. The qubit and gate costs of circuitization procedures have therefore rightfully become key factors in determining the feasibility of any potential application and increasingly more efficient algorithms are continuously being devised.
We propose a novel approach to resource-efficient Hamiltonian dynamics simulations on quantum circuits that we argue offers certain advantages, which directly translate to a shorter algorithm runtime, over state-of-the-art quantum simulation algorithms [1,2] (see Sec. 4 for a detailed comparison). We accomplish this by utilizing a series expansion of the quantum time-evolution operator in its off-diagonal elements wherein the operator is expanded around its diagonal component [3][4][5]. This expansion allows one to effectively integrate out the diagonal component of the evolution, thereby reducing the overall gate and qubit complexities of the algorithm as compared to existing methods.
In our approach, the time evolution is broken up into identical short-time segments, each of which is accurately approximated using a number of terms in the off-diagonal series that is logarithmic in the inverse of the required precision. Each segment is then executed with the help of the linear combination of unitaries (LCU) lemma [1]. Our algorithm enables the simulation of a wide range of realistic models, including systems of spins, bosons or fermions.
The paper is organized as follows. In Sec. 2, we introduce the off-diagonal expansion insofar as it applies to the time-evolution operator. In Sec. 3, we present the Hamiltonian dynamics algorithm that we construct based on the expansion and in Sec. 4 we provide a comparison between the present scheme and two of the leading approaches to quantum simulations, the Taylor-series based approach of Berry et al. [1] and the interaction-picture representation approach devised by Low and Wiebe [2]. We examine several examples in some detail. A summary and some conclusions are given in Sec. 5.
2 Off-diagonal series expansion of the time-evolution operator We next derive an expansion of the time evolution operator based on the off-diagonal series expansion recently introduced in Refs. [3][4][5] in the context of quantum Monte Carlo simulations. While we focus in what follows on time-independent Hamiltonians for simplicity, we note that an extension of the following derivation to include time-dependent Hamiltonians also exists [6].

Permutation matrix representation of the Hamiltonian
We begin by casting the Hamiltonian in the form where the D i operators are diagonal in some known basis, which we will refer to as the computational basis and denote by {|z }, P 0 := 1, and the P i operators (for i > 0) are permutation operators, i.e., P i |z = |z (i, z) where z = z, i.e., they do not have any fixed points (equivalently, their diagonal elements are all zero). While the above formulation may appear restrictive it is important to note that any Hamiltonian can be written in this form. In particular, for models of spin-1/2 particles (qubits), the D i 's are diagonal in the Pauli-Z basis, and the P i 's are a tensor products of Pauli-X operators, P i ∈ {1, X} ⊗N where N is the number of spins. We will refer to the principal diagonal matrix D 0 as the diagonal component of the Hamiltonian, while the set {D i P i } M i=1 of off-diagonal operators (in the computational basis) give the system its 'off-diagonal dimension'. We will call 'diagonal energies' the (real) numbers obtained by acting with D 0 on computational basis states: D 0 |z = E z |z . Similarly, by applying the generalized permutation operator D i P i on a basis state, we obtain D i P i |z = d i (z )|z , where d i (z ) will be in general a complex number (z depends on z and i). With these notations in hand, we move on to discuss the off-diagonal series expansion of the time-evolution operator.

Expansion of the time-evolution operator
We next consider the evolution of a state under a time-independent Hamiltonian H for time t. We expand the time evolution operator e −iHt using the off-diagonal series expansion.
We first consider the action of e −iHt on a basis state |z : (2) where in the last step we have also expanded the multinomial ( i D i P i ) n , and S n denotes the set of all (M + 1) n operators that appear in the expansion of the multinomial ( i D i P i ) n . We proceed by 'stripping' all the diagonal operators off the sequences S (n) j . We do so by evaluating their action on the relevant basis states, leaving only the off-diagonal operators unevaluated inside the sequence (for example, for the n = 2 sequence Collecting all terms together, we arrive at: where the boldfaced index i q = (i 1 , . . . , i q ) is a tuple of indices i j , with j = 1, . . . , q, each ranging from 1 to M and P iq := P iq · · · P i 2 P i 1 . In addition, similar to the diagonal energy E z = z|D 0 |z , we denote E z j = z j |D 0 |z j are the energies of the states |z , |z 1 , . . . , |z q obtained from the action of the ordered P i j operators appearing in the sequence P iq on |z , then on |z 1 , and so forth. Explicitly, P i 1 |z = |z 1 , P i 2 |z 1 = |z 2 , etc.
(Note that the sequence of states, and similarly the energies, should actually be denoted For conciseness we will be using the abbreviations |z 1 , can be considered the 'hopping strength' of P i j with respect to |z j (see Ref. [3] for a complete and detailed derivation). The infinite sum in parentheses in Eq. (3) evaluates to the efficiently calculable divideddifferences representation [7,8] ∞ n=q (−it) n n! k 0 ,...,kq where the complex coefficient e −it[Ez,...,Ez q ] is the divided difference of the exponential function over the multi-set of the energies {E z , . . . , E zq } [7,8] (more details can be found in Appendix A.1). We may therefore write where and where we have denoted (In the special case of q = 0, α (z) 0 (t) = e −itEz .) In Appendix A.2, we show that one can pull out a global phase from e −it[Ez,...,Ez q ] to obtain e −itEz e −it[∆Ez,...,∆Ez q ] where ∆E z j = E z j − E z (and specifically ∆E z = 0). Therefore, we can write α (z) iq (t) as: where the divided-difference inputs are now energy differences rather than total diagonal energies.
3 The Hamiltonian dynamics algorithm

Preliminaries
We first set some definitions and notations that will be used in the description of the algorithm. We denote the max norm of a matrix A by

Decomposition to short-time evolutions
To simulate the time evolution of e −iHt , we execute r times in succession a short-time circuit for the operator Hereafter we omit the explicit dependence on ∆t for brevity. We write where V z is given by Eq. (7) upon replacing t with ∆t. We can rewrite U as follows: We thus find that the off-diagonal expansion enables the effective decoupling of the evolution due to the diagonal part of the Hamiltonian from the evolution due its off-diagonal part, allowing us U as a product of U od and e −i∆tD 0 . In the special case where the offdiagonal part of the Hamiltonian is zero (thus, d iq = 0 for all i q ), our method reduces directly to simulating diagonal Hamiltonians on a quantum computer. The circuit implementation of the diagonal unitary e −i∆tD 0 can be done with a gate cost O(C D 0 ) where C D 0 is the gate cost of calculating a matrix element of D 0 [9] (see Appendix B for more details). This cost depends only of the locality of D 0 , and is independent of its norm.
To simulate U od we will use the LCU technique [1], starting with writing U od as a sum of unitary operators. To do that, we first note that |e −i∆t[∆Ez q ,...,∆Ez] | ≤ ∆t q /q! (this follows from the mean-value theorem for divided differences [8]). In addition, d iq /Γ iq are complex numbers lying inside the unit circle. Therefore, the norm of the complex number is not larger than 1. We can thus write β (z) iq as the average of two phases Using this notation, we can write U od as where and Φ (k) iq is a unitary transformation. Thus, Eq. (15) is the short-time off-diagonal evolution operator U od represented as a linear combination of unitary transformations.

The LCU setup
To simulate the evolution under U od on a finite-size circuit, we truncate the series, Eq. (15), at some maximal order Q, which leads to the approximate Since the coefficients of the off-diagonal operator expansion fall factorially with q (similar to the truncation of the Taylor series in Ref. [1]), setting ensures 1 that the error per evolution segment is smaller than /r: where the last step follows from the inequality q! ≥ (q/e) q . This choice ensures that the overall error is bounded by (as measured by the spectral-norm of the difference between the approximation and the true dynamics). We next provide the details of the circuit we implement to execute the LCU routine and the resource costs associated with it.

State preparation
The first ingredient of the LCU is the preparation of the state where |i q = |i 1 · · · |i q |0 ⊗(Q−q) is shorthand for Q quantum registers, each of which has dimension M (equivalently, a quantum register with Q log(M + 1) qubits). In addition, (2)]. We construct |ψ 0 in two steps: Starting with the state |0 ⊗Q we transform the first register to the normalized version of conditioned on the (q − 1)-th register being in the |1 state. The resulting state, up to normalization, is The gate cost of this step is O(Q). Next, we act on each of the registers with a unitary transformation that takes a |1 state to the normalized version of M i=1 √ Γ i |i . Finally we apply a Hadamard transformation on the last (qubit) register, resulting in the state |ψ 0 . The gate cost of this step is O(M ) [11]. Denoting the unitary transformation that takes

Controlled-unitary transformation
The second ingredient of the LCU routine is the construction of the controlled operation where |k is a single qubit ancillary state in the computational basis. The number of ancilla qubits here is Q log(M +1) +1. Equation (25) indicates that U C can be carried out in two steps: a controlled-phase operation (U CΦ ) followed by a controlled-permutation operation (U CP ). The controlled-phase operation U CΦ requires a somewhat intricate calculation of nontrivial phases. We therefore carry out the required algebra with the help of additional ancillary registers and then 'push' the results into phases. The latter step is done by employing the unitary whose implementation cost depends only on the precision with which we specify ϕ and is independent of Hamiltonian parameters [9] (for completeness we provide an explicit construction of U ph in Appendix C). With the help of the (controlled) unitary transformation we can write This is illustrated in Fig. 1. Note that U χφ is a 'classical' calculation sending computational Figure 1: A circuit description of the controlled phase U CΦ in terms of U χφ and U ph .
basis states to computational basis states. We provide an explicit construction of U χφ in Appendix D. We find that its gate and qubit costs are O(Q 2 + QM (C ∆D 0 + k od + log M )) and O(Q), respectively, where C ∆D 0 is the cost of calculating the change in diagonal energy due to the action of a permutation operator and k od is an upper bound on the 'off-diagonal locality', i.e., the locality of the P i 's [1,12]. The construction of U CP is carried out by a repeated execution of the simpler unitary transformation U p |i |z = |i P i |z . Recall that P i are the off-diagonal permutation operators that appear in the Hamiltonian. The gate cost of U p is therefore O(M (k od + log M )). For spin models, each P i is a tensor product of up to k od Pauli X operators. Applying this transformation to the Q ancilla quantum registers, we obtain |i q |z → |i q P iq |z with a gate cost of O(QM (k od + log M )). A sketch of the circuit is given in Fig. 2. We can thus conclude that the total gate cost of implementing U C is O(Q 2 +QM (C ∆D 0 +k od +log M )).

Oblivious amplitude amplification
To realize U od , the LCU technique calls for the execution of a combination of the state preparation unitary B and the controlled-unitary transformation U C which together form an oblivious amplitude amplification (OAA) procedure [1].
Let |ψ be the current state of the system, then under the action of W = B † U C B, the state becomes such that |Ψ ⊥ is supported on a subspace orthogonal to |0 ⊗Q+1 . If s = 2 and U od is unitary then the OAA ensures that . Under these conditions, the action of W (1 ⊗ e −i∆tD 0 ) on the state at time t, namely |ψ(t) , advances it by one time step to |ψ(t + ∆t) . This is illustrated in Fig. 3. In Ref [1], a robust version of OAA was given for the case of non-unitary U od and s = 2.
where Tr anc stands for trace over the ancilla registers [recall that s ≈ 2 as per Eq. (21)]. Thus the overall error after r repetitions is O(rδ), so we require δ = O( /r) to obtain an overall error of O( ). These conditions are satisfied with setting ∆t as in Sec. 3.1 and choosing Q as in Eq. (18). For convenience, we provide a glossary of symbols in Table 1. A summary of the gate and qubit costs of the simulation circuit and the various sub-routines used to construct it is given in Table 2.

Comparison to existing approaches and examples
In this section, we compare the resource costs of our algorithm against those of two stateof-the-art existing approaches, and further provide a brief analysis of the complexity of our algorithm for a number of physical models.
Since our approach is based on an application of the LCU technique, we first compare the resource costs of our algorithm's LCU sub-routine against that of the Taylor series-based method of Berry et al. [1]. One of main differences in costs between the two cost of calculating a diagonal energy (a single D 0 matrix element) C ∆D 0 cost of calculating the change to a diagonal energy due to the action of a P i C D cost of calculating a single D i matrix element (i = 0) For qubit Hamiltonians, these unitary operators will generally be tensor products of single-qubit Pauli operators (although of course in some cases, more compact decompositions can be found). The off-diagonal decomposition, on the other hand, casts the Hamiltonian as a sum of generalized permutation operators, as given in Eq. (1); a representation that is generally considerably more compact. (For example, for qubit Hamiltonians, all operators that flip the same subset of qubits are grouped together.) This in turn implies that the number of terms in the decomposition of the Hamiltonian will generally be considerably smaller in the off-diagonal representation (i.e., M L). This difference directly translates to reduced gate and qubit costs (a summary is given in Table 2).
Another key difference is in the respective dimensionless time constants. In the offdiagonal expansion approach the dimensionless time constant is given by In both approaches the dimensionless time determines the cutoff of the respective expansions, and controls the overall gate and qubit costs of the algorithm. Indeed, as we show below, in general one has M i=1 Γ i L i=1 c i , which directly translates to a reduced simulation cost in favor of the off-diagonal expansion. To be more quantitative, we provide an explicit comparison between the off-diagonal and Taylor expansions for a few spin models in Table 3. The 'price' we pay for the above savings is the additional O(Q 2 ) operations per time step required for calculating the divided-difference coefficient. However, we note, that since Q scales Table 3: A comparison between the proposed method and the Taylor series-based approach [1]. In the table, N denotes the number of qubits. The table illustrates two important features of the proposed method as compared to the Taylor series-based approach for Hamiltonians written in the Pauli basis. Firstly, the Taylor series-based approach treats diagonal and off-diagonal components of the Hamiltonian in the LCU algorithm on an equal footing, while our method requires only the off-diagonal part as an input to the LCU algorithm. This is shown in the row labeled by 'No. of LCU unitaries'. Secondly, in the Taylor series-based approach, each Pauli operator in the decomposition of H is considered as a unitary, leading to a dimensionless time that is proportional to the sum of the absolute values of all the coefficients in the decomposition. In our approach on the other hand, all the diagonal operators that act in the same way on basis states are grouped into a single diagonal operator (D j in the table). Therefore, in our algorithm, the dimensionless time is proportional to the sum of the norm of all 'grouped' diagonal operators (sans the diagonal component of the Hamiltonian). Due to this grouping, the dimensionless time of the present method will be in general extensively smaller than that of the Taylor series-based method. Having a smaller dimensionless time translates to savings in gate and qubit resources as well as to a shorter runtime of the algorithm.
logarithmically with T , and T is typically much smaller than T , the advantages arising from the use of divided differences asymptotically outweigh this added complexity.
As an alternative to the Taylor series-based algorithm, recently Low and Wiebe [2] have proposed a framework within which the dynamics is formulated in the interaction picture using a (truncated) Dyson series expansion. There, the time-ordered multi-dimensional integrals of the Dyson series are approximated via Riemann sums and implemented using control registers, ridding the simulation cost of most of its dependence on the diagonal component of the Hamiltonian. Our algorithm is similar in this way to the interaction picture approach, as the off-diagonal series expansion may be viewed as explicitly integrating the Dyson integrals (the reader is referred to Refs. [4,13] for more details pertaining to the relation between the off-diagonal series expansion and the Dyson series). There are however a few notable differences between the two algorithms, that translate to differences in resource scaling. The main difference is that the cost of the interaction picture approach still has a poly-logarithmic dependence on the norm of the diagonal part of the Hamiltonian while in our method the dynamics due to the diagonal part of the Hamiltonian is completely decoupled from that of its off-diagonal part. This decoupling ensures that our algorithm has no dependence on the norm of the diagonal component of the Hamiltonian. In addition, the poly-logarithmic dependence on the various problem parameters in the interaction picture algorithm is obtained under certain assumptions on the gate cost of implementing specific unitary oracles. The power of the logarithmic polynomial (γ in Ref. [2]) is left undetermined in general. As mentioned above in the context of the Taylor series-based algorithm, the price paid for the decoupling is an additional O(Q 2 ) operations per time step, with Q, the expansion order, scaling logarithmically with the algorithm's dimensionless time T , which does not depend on the diagonal norm of the Hamiltonian.
In the next subsections, we briefly analyze the off-diagonal circuit complexity for three models of scientific interest: the (Fermi-)Hubbard model, that of electronic structure and the Schwinger model.

The Fermi-Hubbard model
We first examine the asymptotic cost of implementing the Fermi-Hubbard model [14], which serves as a model of high-temperature superconductors. The Fermi-Hubbard Hamiltonian is given by describing N electrons with spin σ ∈ {↑, ↓} hopping between neighboring sites on a ddimensional hyper-cubic lattice whose adjacency matrix is given by ij with hopping strength t h . In addition, the model has an on-site interaction term with strength U between opposite-spin electrons occupying the same site.
The Fermi-Hubbard model can be mapped to qubits in a number of different ways [15][16][17][18]. For concreteness, we consider the Jordan-Wigner transformation (JWT) [15] which maps the second-quantized operator a jσ to an operator on j qubits according to so that a † jσ a jσ = (1 + Z jσ )/2. To write the Fermi-Hubbard Hamiltonian in the form of Eq.(1), we rewrite the JWT as Applying the transformation to the Hamiltonian, Eq. (33), we arrive at: where we have identified The product structure of D ijσ implies that their max-norm is simply given by t h for all i, j, σ. The number of off-diagonal terms is M = N d. Therefore the dimensionless time T of the simulation algorithm is T = tM t h = tN dt h . For comparison, in the Taylor series decomposition, the number of terms in the Hamiltonian is L = 3N + 2M , and the dimensionless time parameter is T ∼ 3N U t + 2T . Note that due to the independence of T on the on-site repulsion strength U , the off-diagonal expansion algorithm offers a favorable scaling as compared to the Taylor series-based LCU in the Mott-insulating regime U t h .

Hamiltonian simulation of electronic structure
Another model of major practical relevance is the simulation of electronic structure in the framework of which the stationary properties of electrons interacting via Coulomb forces in an external potential are of interest. This problem was recently analyzed in detail in Ref. [19], where a 'plane wave dual basis Hamiltonian' formulation was proposed, which diagonalizes the potential operator leading to a Hamiltonian representation with O(N 2 ) second-quantized terms, where N is the number of basis functions. Using JWT to map the model to qubits, one arrives at where, R j and r p denote nuclei and electron coordinates, respectively, ζ j are nuclei charges and k ν is a vector of the plane wave frequencies at the ν-th harmonic of the computational cell in three dimensions whose volume we denote by Ω (see Ref. [19]). The permutation matrix representation dictates that we write the Hamiltonian above as where all the diagonal terms are grouped together to form and are integrated out of the LCU. Off-diagonal (p = q) terms are also grouped as We notice that in the off-diagonal representation, the Hamiltonian of Eq. (39) has a structure similar to that of Eq. (36), with k od = 2. Similar to the Fermi-Hubbard model, each D ijσ has a product structure and their max-norm is simply given by 1 2N | ν k 2 ν cos [k ν · r q−p ] | for all p, q, σ. The number of terms in the off-diagonal part of the Hamiltonian in this representation is M = 2(N 2 −N ), and thus the dimensionless time T of the simulation algorithm is For comparison, in the Taylor series-based LCU approach the number of terms in the Hamiltonian is L = 2N + 6(N 2 − N ), and the dimensionless time parameter is In particular, the dimensionless parameter in the current scheme depends only on the magnitude of the two-electron interaction and can take values much smaller than T due to a 'destructive interference' of the cosine terms evaluated at different values of [k ν · r q−p ].

The Schwinger model
The Schwinger model [20] is an Abelian low-dimensional gauge theory describing twodimensional (one spatial plus time) Euclidean quantum electrodynamics with a Dirac fermion. Despite being a simplified model, the theory exhibits rich properties, similar to those seen in more complex theories such as QCD (e.g., confinement and spontaneous symmetry breaking). The model can be converted to an equivalent spin model [21][22][23] whose Hamiltonian is where 0 is a constant (that can be set to zero), g, m and a are the fermion-gauge field coupling, mass and lattice spacing, respectively and N is the number of lattice sites. In permutation matrix representation, the Hamiltonian is written as and It follows then that the number of off-diagonal terms is M = N and the off-diagonal dimensionless time is T = tN/(2a 2 g 2 ). For comparison, in the Taylor series-based LCU approach the number of terms L to which the Hamiltonian is decomposed is proportional to N 2 due to the diagonal term, and the dimensionless time parameter T scales as O(t(N 2 + mN/(ag 2 ) + N/(a 2 g 2 ))). We thus find that the off-diagonal formulation provides in this case a scaling advantage over a Taylor series-based approach.

Summary and conclusions
We proposed a quantum algorithm for simulating the dynamics of general time-independent Hamiltonians. Our approach consisted of expanding the time evolution operator using an off-diagonal series; a parameter-free Trotter error-free method that was recently developed in the context of quantum Monte Carlo simulations [3][4][5]. This expansion enabled us to simulate the time evolution of states under general Hamiltonians using alternating segments of diagonal and off-diagonal evolutions, with the latter implemented using the LCU technique [1].
We argued that our scheme provides considerable savings in gate and qubit costs for certain classes of Hamiltonians, specifically Hamiltonians that are represented in a basis in which the diagonal component is dominant. In fact, we find that for optimal savings one should choose the basis of representation such that the norm of the off-diagonal component of the Hamiltonian is minimal.
In this work, we focused only on time-independent Hamiltonians. The algorithm can be extended to the time-dependent case by writing the time-evolution operator in a Dyson series and appropriately discretizing the Dyson time integrals [6].
We believe that further improvements to our algorithm can likely be made. In Appendix E, we provide a slightly modified representation of the Hamiltonian which simplifies, to an extent, the circuit construction, specifically the implementation of the 'classical' calculation U χφ , which requires additional auxiliary O(Q) ancillas beyond those required by the LCU. It would not be unreasonable to assume that it is possible to encode all the classical calculation directly into phases, eliminating this extra cost.
The above expression can be further simplified to as was asserted in the main text.
A.2 Proof of Eq. (9) Given a list of inputs x 0 , . . . , x q , we prove that where x is an arbitrary constant and ∆ j = x j − x. By definition [8], (assuming for now that all inputs are distinct). It follows then that This result holds for arbitrarily close inputs and can be easily generalized to the case where inputs have repeated values.

B Circuit construction of e −i∆tD 0
Assume that D 0 is a k d -local Hamiltonian on n qubits, i.e., can be written as a sum of terms each of which acts on at most k d qubits, where J i ∈ R and Z i is a shorthand for a specific tensor product of (at most) k d single-qubit Pauli-Z operators. Given this form of D 0 we can write Each unitary operator in the above product, e −i∆tJ i Z i , can be further simplified as where |z is the computational basis state of the m qubits on which e −i∆tJ i Z i acts (m ≤ k d ), i.e., z ∈ {0, 1} m and z l = 0, 1 is the l-th bit of z. Therefore, e −i∆tJ i Z i can be implemented using a single ancilla qubit and 2m CNOT gates [9]. A diagram is provided in Fig. 4. C Circuit construction of U ph We define a single qubit rotation By applying R j on an additional single qubit in the state |1 , controlled by the j-th register of |φ = |x 0 , x 1 , . . . , x b−1 , we obtain This is illustrated in Fig. 5.

D Implementation of U χφ
The controlled unitary U χφ essentially carries out a classical computation (it is a pure permutation, sending diagonal elements to diagonal elements). As such, its gate complexity can be given in terms of the classical cost of the calculation plus an incurred logarithmic overhead which comes from making the classical calculation reversible [24]. The classical gate cost of calculating the complex number β (z) iq , Eq. (13), and therefore also that of χ iq ), consists of calculating (i) the product of q off-diagonal Hamiltonian matrix elements, namely d iq = q j=1 d i j (z j ), and (ii) the divided-differences of the exponential function with q diagonal matrix elements as inputs (energy differences to be precise). The former can be calculated with O(1) registers and O(QM C D ) operations, where C D is the cost of calculating a single D j matrix element as this is a controlled operation on the evaluation of at most Q off-diagonal matrix entries.
The latter requires roughly O(QM ) operations to generate the energy differences in the worst case, and another O(Q 2 ) operations for calculating the divided difference given the inputs. The number of registers required for the above operation scales as O(Q) [25].
With the above stated, we next present an explicit algorithm for constructing U χφ for completeness. To that aim, we use two transformations, U β and U decomp , defined as follows: with β (z) iq given in Eq. (13), and which decomposes a complex number β = cos φe iχ (here, |β| < 1) to the two angles φ and χ. The third register of Eq. (61), and similarly the first register of Eq. (62), store complex numbers, that is, they each consist of two registers for their real and imaginary parts (hereafter, complex numbers inside kets indicate complex-number registers). The gate cost of U decomp is a function of bit-precision only. Combined, the two sub-routines above yield U χφ = U † β U decomp U β . This is illustrated in Fig. 6. For the construction of U β , it will be useful to rewrite the complex number β where we have defined the (complex-valued) 'effective energy difference' ∆E (z,...,zq) [3] such that Note that e −i∆t∆E (z,...,zq ) is a complex number lying inside the unit circle. In addition, which computes the 'effective energy difference' (which should be followed by a O(1)-cost circuit |∆E (z,...,zq) |0 → |∆E (z,...,zq) |e −i∆t∆E (z,...,zq ) .) We next discuss a classically efficient method for calculating ∆E (z,...,zq) given the sequence of energy differences {∆E z , ∆E z 1 , . . . , ∆E z k }. For simplicity we will assume the energy values are sorted (the divided difference of a function is invariant under a permutation of its inputs). To carry out the calculation, we will use the divided differences recursion relations, Eq. (48), which we will rewrite in terms of 'effective energy differences' [3]: Isolating ∆E (z,...,zq) , we arrive at Thus Eq. (67) provides a recursion relation for the effective energy differences. The initial condition for the above recursion is simply E (z i ) = E z i (we will sometimes denote a state z by z 0 for notational convenience). In the limit where all energies in a sequence {z i , . . . , z j } are equal, i.e., z i = . . . = z j = z , the above relation neatly becomes A convenient way to calculate the divided difference, equivalently the 'effective energy difference' ∆E (z,...,zq) , relying on the recursion relations given above is using a 'pyramid scheme' as illustrated in Fig. 7. The base of the pyramid has q + 1 elements, corresponding to the 'initial' energies ∆E (z i ) = ∆E z i with i = 0, . . . , q. Let us denote this as level zero. Level one of the pyramid, which has q elements only, is now evaluated as follows. For each element at level one, we invoke the recursion relation Eq. (67) using the two elements below it (see Fig. 7) at level zero. To avoid ill-defined ratios, we order the energies at level zero such that repeated values are grouped together. In this case, the evaluation of Similarly, every level-two element is calculated using the two level-one elements immediately below it. This procedure can be continued until the top level (level q) of the pyramid is reached, which gives the desired value of ∆E (z,...,zq) the effective energy difference. The above procedure requires O(Q) complex-valued registers and can be done reversibly with O(Q 2 ) operations.
Because the routine just discussed for calculating divided differences requires the evaluation of logarithms and trigonometric functions, which are known to be somewhat cumbersome to implement on quantum computers [26], we also provide an alternative routine which is on the one hand slightly less efficient, requiring O(Q 2 ) registers but on the other hand uses only basic arithmetic operations. To accomplish that, we first discuss an analogous classical method for calculating e −i∆t[∆Ez j ,...,∆Ez k ] given a sequence of real numbers {∆E z j , . . . , ∆E z k }. Our calculation will be based on the divided-differences Leibniz rule [7,8] We use this rule as a 'time-doubling' mechanism for the exponent, from τ /2 to τ . Let us rephrase Eq. (69) as Figure 7: Calculating the effective energy differences using a 'pyramid' structure. The evaluation of the divided differences of the exponential function of q + 1 input energy differences consists of calculating each level of the pyramid starting at its base. The values at the base of the pyramid ∆E (zj ) are simply the energy inputs ∆E zj (shown as the red line at the bottom of the pyramid), with all identical energy differences placed together as a group (energies are assumed to be sorted). To calculate the elements at the next level of the pyramid, we use the relation in Eq. (67). This procedure is continued until the final level of the pyramid is evaluated, which corresponds to the desired effective energy difference ∆E (z,...,zq) .
where we have denoted (see Appendix F for proof). We can therefore choose a large enough integer that is commensurate with the precision of the complex-valued registers and set δt = ∆t/2 , and then repeatedly apply Leibniz's rule, Eq. (70), times to calculate e 0q (∆t).
The (0q)-th sub-register of the last register above contains the desired e −i∆t[Ez,...,Ez q ] . This circuit requires Q(Q − 1)/2 × complex registers. A more resource efficient version based on irreversible computation (and therefore also a similarly costly reversible version) of the circuit exists which only requires O(Q) registers and O(Q 2 ) gates [24].
To complete our analysis, we provide next a circuit, U ∆ , that generates the inputs to the divided-differences sub-routine, namely ∆E z i . A sketch of the circuit is given in Fig. 8. It requires U p (see Sec. 3.3.2), the adder U + |x |y = |x |x ⊕ y and a sub-routine for calculating the difference in diagonal energy following a change in the input state |z i−1 → P i |z i−1 = |z i , namely, We note that, conveniently, the size of these energy-difference registers is not expected to grow with system size for physical systems. The gate cost of U ∆E is O(M C ∆D 0 ), where C ∆D 0 is the cost of calculating the change in diagonal energy due to the action of a single permutation operator, and therefore we can conclude that the gate cost of U ∆ is O(QM (C ∆D 0 + k od + log M )).

|0
· · · U ∆E |∆E z Q Figure 8: A circuit for U ∆ , which calculates the energy differences ∆E zi

E Alternative description of the Hamiltonian
A somewhat more efficient short-time evolution circuit can be obtained if we slightly modify the representation of the Hamiltonian. As before, every D i (for i > 0) in H can be written as Γ i (D i /Γ i ) where Γ i is an upper bound on the norm of D i and the matrix elements of (D i /Γ i ) lie within the unit circle. Each such (diagonal) element can be written as a product cos θe iχ . We can thus replace every D i with an average of two pure phase matrices with phases e i(χ+θ) and e i(χ−θ) along their diagonals. This suggests the following representation of the Hamiltonian: where Θ Re-deriving the simulation algorithm using the above representation simplifies the 'classical' calculation of the controlled phase U CΦ , Eq. (29), which now includes only the divided-difference calculation. and apply the mean-value theorem for divided differences [8], which states that if f (·) is a real-valued function then for someẼ in the range R E = [min j E j , max j E j ] and f (q) (·) denotes the q-th derivative of f (·). For small enough ranges R E , f (·) is approximately linear, in which caseẼ will be the simple meanẼ ≈ j E j /(q + 1). In this case, the error of the approximation will be of second order, meaning: Choosing f (·) to be cos(·) and sin(·) with inputs [δtE 0 , . . . , δtE q ], we arrive at: